Chapter 5: Train-Test Split & Cross-Validation
Evaluating a Machine Learning model correctly is just as important as building it.
If a model performs well only on the training data but fails on new unseen data, the model is useless.
This chapter explains Train-Test Split and Cross-Validation — the two core methods to measure model performance and prevent overfitting.
1. Why Do We Split Data?
Machine learning models must be tested on data they have never seen before.
This ensures the model is learning the pattern, not memorizing the data.
📌 Without splitting the data:
- Model memorizes training data
- Evaluation becomes unfair
- Leads to overfitting
📌 With Train-Test Split:
- Training set → Learn patterns
- Testing set → Measure real accuracy
2. Train-Test Split (Basic)
A simple and widely used method is dividing data into two parts:
- Training Set → 70–80%
- Test Set → 20–30%
Python Example
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Train size:", len(X_train))
print("Test size:", len(X_test))
The random_state ensures reproducibility.
3. Why Train-Test Split Is Not Always Enough?
If your dataset is small, splitting once is risky because:
- Train-test results depend on one random split
- The model may accidentally get “easy” or “hard” samples
- Performance can vary significantly
To solve this, we use Cross-Validation.
4. What is Cross-Validation?
Cross-Validation ensures the model is tested on every part of the dataset.
The most common form is K-Fold Cross-Validation.
✔ K-Fold Cross-Validation
The dataset is divided into K equal parts (folds). The process:
- Train on K-1 folds
- Test on the remaining fold
- Repeat K times
- Take the average score
Illustration (K = 5):
- Fold 1 → Test, Folds 2–5 → Train
- Fold 2 → Test, Folds 1,3,4,5 → Train
- …
- Fold 5 → Test
✔ Python Example
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
scores = cross_val_score(model, X, y, cv=5)
print("Scores:", scores)
print("Average Score:", scores.mean())
Benefits of Cross-Validation
- More reliable evaluation
- Reduces chances of overfitting
- Works well on small datasets
5. Train/Validation/Test Split
For deep learning or fine-tuned ML models, we use a 3-way split:
- Training Set → Model learns
- Validation Set → Parameter tuning
- Test Set → Final unbiased evaluation
Example Split
- 70% — Training
- 15% — Validation
- 15% — Test
Python Example
# First split: train vs temp
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Second split: validation vs test
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
print(len(X_train), len(X_val), len(X_test))
6. Stratified Splitting (For Classification)
If your dataset classes are imbalanced (e.g., 90% class A, 10% class B),
a normal split may distort the class distribution.
Use Stratified Split to preserve label proportions.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
7. Cross-Validation Variants
✔ LOOCV (Leave-One-Out)
- Each sample becomes a test set once
- Very accurate but slow
✔ Stratified K-Fold
- K-fold preserving class ratios
- Best for classification
✔ Repeated K-Fold
- K-Fold performed multiple times
- Even more robust evaluation
8. Practical Tips
- Use train-test split for quick experiments.
- Use K-Fold CV for final evaluation.
- Always scale your data before KNN, SVM, K-Means.
- Use
Stratifiedvariants for classification. - Use a validation set when tuning hyperparameters.
Conclusion
Train-Test Split and Cross-Validation ensure you evaluate your ML models fairly and prevent overfitting.
These techniques form the backbone of reliable machine learning workflows.
In the next chapter, we will dive into Evaluation Metrics like Accuracy, Precision, Recall, F1 Score, and Confusion Matrix with clear examples.
Assignments
Assignment 1 – Simple Train/Test Split Practice
Take any supervised dataset. Split it into train and test sets (e.g. 80-20 or 70-30). Train a basic model and evaluate on the test set. Report train accuracy and test accuracy.
Hint: Use `train_test_split()` from scikit-learn with `random_state` to ensure reproducibility. :contentReference[oaicite:2]{index=2}
Assignment 2 – Impact of Split Ratio
Use the same dataset, but try different test-size splits (e.g. 10%, 20%, 30%, 40%). For each split, train and test the model. Compare how the performance metrics (accuracy, error) change with split ratio.
Hint: Smaller test set → more training data but less reliable test estimate; larger test set → vice versa. :contentReference[oaicite:3]{index=3}
Assignment 3 – K-Fold Cross-Validation vs Single Split
Take a dataset and apply both (a) single train/test split and (b) K-Fold Cross-Validation (e.g. k = 5). Compare the results (mean accuracy, variability across folds, train vs test performance).
Hint: Use `KFold` or `cross_val_score`. K-Fold usually gives more stable estimates and reduces bias from random split. :contentReference[oaicite:4]{index=4}
Assignment 4 – Cross-Validation on Small Dataset
Select or simulate a small dataset (few hundred records). Apply 5-fold cross-validation and observe how validation scores vary across folds. Explain why CV is especially helpful on small datasets.
Hint: Single split might give misleading high or low score due to unlucky partition; CV spreads risk across multiple folds. :contentReference[oaicite:5]{index=5}
Assignment 5 – Overfitting Detection via Train vs Validation Performance
Train a model on a dataset and record training accuracy and cross-validation (or test) accuracy. Try increasing model complexity (e.g. depth of decision tree, number of features) and observe when overfitting appears.
Hint: Overfitting occurs when train accuracy is high but test/validation accuracy drops. This shows the model memorizes training data instead of generalizing. :contentReference[oaicite:6]{index=6}
Assignment 6 – Use Stratified Sampling (for Classification)
Take an imbalanced classification dataset. Use stratified train/test split and stratified K-Fold cross-validation. Compare performance vs simple random split. Report class distribution and model metrics.
Hint: Stratified splitting ensures the class proportions are preserved in both splits, avoiding bias from class imbalance. :contentReference[oaicite:7]{index=7}
Assignment 7 – Hyperparameter Tuning with Cross-Validation
Pick an algorithm with hyperparameters (e.g. decision tree max_depth, number of neighbors in KNN). Use K-Fold CV to tune the hyperparameter. Show how CV helps pick a better hyperparameter than a single train/test split.
Hint: For each hyperparameter value, compute CV scores and choose the one with best average performance. :contentReference[oaicite:8]{index=8}
Assignment 8 – Compare Overfitting vs Underfitting with Split and CV
Design two models on the same dataset: one simple (underfitting), one very complex (overfitting). Use both single split and K-Fold CV to evaluate. Compare which evaluation method better reveals overfitting or underfitting.
Hint: CV tends to reveal instability or poor generalization for overfitted models more reliably than single split. :contentReference[oaicite:9]{index=9}
Assignment 9 – Nested Cross-Validation Conceptual Assignment
Read about nested cross-validation (train/validate/test inside CV). Describe when and why nested CV is useful (e.g. hyperparameter tuning + unbiased evaluation). Propose a scenario/dataset where nested CV would benefit.
Hint: Nested CV helps avoid selection bias — important when tuning many hyperparameters or comparing many models. :contentReference[oaicite:10]{index=10}
Assignment 10 – Report: Choosing Evaluation Strategy for a Given Problem
Take a hypothetical or real ML problem (with dataset description). Based on dataset size, class balance, and problem type, decide whether to use simple train/test split, K-Fold CV, or nested CV — and justify your choice in a short write-up.
Hint: Consider dataset size (small vs large), class balance, overfitting risk, and computational budget. :contentReference[oaicite:11]{index=11}
