ML Chapter 5 | Train-Test Split, K-Fold Cross-Validation, Overfitting Prevention

Chapter 5: Train-Test Split & Cross-Validation

Evaluating a Machine Learning model correctly is just as important as building it.
If a model performs well only on the training data but fails on new unseen data, the model is useless.
This chapter explains Train-Test Split and Cross-Validation — the two core methods to measure model performance and prevent overfitting.

1. Why Do We Split Data?

Machine learning models must be tested on data they have never seen before.
This ensures the model is learning the pattern, not memorizing the data.

📌 Without splitting the data:

Model memorizes training data
Evaluation becomes unfair
Leads to overfitting

📌 With Train-Test Split:

Training set → Learn patterns
Testing set → Measure real accuracy

2. Train-Test Split (Basic)

A simple and widely used method is dividing data into two parts:

Training Set → 70–80%
Test Set → 20–30%

Python Example


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

The random_state ensures reproducibility.

3. Why Train-Test Split Is Not Always Enough?

If your dataset is small, splitting once is risky because:

Train-test results depend on one random split
The model may accidentally get “easy” or “hard” samples
Performance can vary significantly

To solve this, we use Cross-Validation.

4. What is Cross-Validation?

Cross-Validation ensures the model is tested on every part of the dataset.
The most common form is K-Fold Cross-Validation.

✔ K-Fold Cross-Validation

The dataset is divided into K equal parts (folds). The process:

Train on K-1 folds
Test on the remaining fold
Repeat K times
Take the average score

Illustration (K = 5):

Fold 1 → Test, Folds 2–5 → Train
Fold 2 → Test, Folds 1,3,4,5 → Train
…
Fold 5 → Test

✔ Python Example


from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

scores = cross_val_score(model, X, y, cv=5)
print("Scores:", scores)
print("Average Score:", scores.mean())

Benefits of Cross-Validation

More reliable evaluation
Reduces chances of overfitting
Works well on small datasets

5. Train/Validation/Test Split

For deep learning or fine-tuned ML models, we use a 3-way split:

Training Set → Model learns
Validation Set → Parameter tuning
Test Set → Final unbiased evaluation

Example Split

70% — Training
15% — Validation
15% — Test

Python Example


# First split: train vs temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Second split: validation vs test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

print(len(X_train), len(X_val), len(X_test))

6. Stratified Splitting (For Classification)

If your dataset classes are imbalanced (e.g., 90% class A, 10% class B),
a normal split may distort the class distribution.

Use Stratified Split to preserve label proportions.


from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

7. Cross-Validation Variants

✔ LOOCV (Leave-One-Out)

Each sample becomes a test set once
Very accurate but slow

✔ Stratified K-Fold

K-fold preserving class ratios
Best for classification

✔ Repeated K-Fold

K-Fold performed multiple times
Even more robust evaluation

8. Practical Tips

Use train-test split for quick experiments.
Use K-Fold CV for final evaluation.
Always scale your data before KNN, SVM, K-Means.
Use Stratified variants for classification.
Use a validation set when tuning hyperparameters.

Conclusion

Train-Test Split and Cross-Validation ensure you evaluate your ML models fairly and prevent overfitting.
These techniques form the backbone of reliable machine learning workflows.

In the next chapter, we will dive into Evaluation Metrics like Accuracy, Precision, Recall, F1 Score, and Confusion Matrix with clear examples.

Assignments

Assignment 1 – Simple Train/Test Split Practice

Take any supervised dataset. Split it into train and test sets (e.g. 80-20 or 70-30). Train a basic model and evaluate on the test set. Report train accuracy and test accuracy.

Hint: Use `train_test_split()` from scikit-learn with `random_state` to ensure reproducibility. :contentReference[oaicite:2]{index=2}

Assignment 2 – Impact of Split Ratio

Use the same dataset, but try different test-size splits (e.g. 10%, 20%, 30%, 40%). For each split, train and test the model. Compare how the performance metrics (accuracy, error) change with split ratio.

Hint: Smaller test set → more training data but less reliable test estimate; larger test set → vice versa. :contentReference[oaicite:3]{index=3}

Assignment 3 – K-Fold Cross-Validation vs Single Split

Take a dataset and apply both (a) single train/test split and (b) K-Fold Cross-Validation (e.g. k = 5). Compare the results (mean accuracy, variability across folds, train vs test performance).

Hint: Use `KFold` or `cross_val_score`. K-Fold usually gives more stable estimates and reduces bias from random split. :contentReference[oaicite:4]{index=4}

Assignment 4 – Cross-Validation on Small Dataset

Select or simulate a small dataset (few hundred records). Apply 5-fold cross-validation and observe how validation scores vary across folds. Explain why CV is especially helpful on small datasets.

Hint: Single split might give misleading high or low score due to unlucky partition; CV spreads risk across multiple folds. :contentReference[oaicite:5]{index=5}

Assignment 5 – Overfitting Detection via Train vs Validation Performance

Train a model on a dataset and record training accuracy and cross-validation (or test) accuracy. Try increasing model complexity (e.g. depth of decision tree, number of features) and observe when overfitting appears.

Hint: Overfitting occurs when train accuracy is high but test/validation accuracy drops. This shows the model memorizes training data instead of generalizing. :contentReference[oaicite:6]{index=6}

Assignment 6 – Use Stratified Sampling (for Classification)

Take an imbalanced classification dataset. Use stratified train/test split and stratified K-Fold cross-validation. Compare performance vs simple random split. Report class distribution and model metrics.

Hint: Stratified splitting ensures the class proportions are preserved in both splits, avoiding bias from class imbalance. :contentReference[oaicite:7]{index=7}

Assignment 7 – Hyperparameter Tuning with Cross-Validation

Pick an algorithm with hyperparameters (e.g. decision tree max_depth, number of neighbors in KNN). Use K-Fold CV to tune the hyperparameter. Show how CV helps pick a better hyperparameter than a single train/test split.

Hint: For each hyperparameter value, compute CV scores and choose the one with best average performance. :contentReference[oaicite:8]{index=8}

Assignment 8 – Compare Overfitting vs Underfitting with Split and CV

Design two models on the same dataset: one simple (underfitting), one very complex (overfitting). Use both single split and K-Fold CV to evaluate. Compare which evaluation method better reveals overfitting or underfitting.

Hint: CV tends to reveal instability or poor generalization for overfitted models more reliably than single split. :contentReference[oaicite:9]{index=9}

Assignment 9 – Nested Cross-Validation Conceptual Assignment

Read about nested cross-validation (train/validate/test inside CV). Describe when and why nested CV is useful (e.g. hyperparameter tuning + unbiased evaluation). Propose a scenario/dataset where nested CV would benefit.

Hint: Nested CV helps avoid selection bias — important when tuning many hyperparameters or comparing many models. :contentReference[oaicite:10]{index=10}

Assignment 10 – Report: Choosing Evaluation Strategy for a Given Problem

Take a hypothetical or real ML problem (with dataset description). Based on dataset size, class balance, and problem type, decide whether to use simple train/test split, K-Fold CV, or nested CV — and justify your choice in a short write-up.

Hint: Consider dataset size (small vs large), class balance, overfitting risk, and computational budget. :contentReference[oaicite:11]{index=11}

About Us

Our Location

ML Chapter 5 | Train-Test Split, K-Fold Cross-Validation, Overfitting Prevention

Chapter 5: Train-Test Split & Cross-Validation

1. Why Do We Split Data?

📌 Without splitting the data:

📌 With Train-Test Split:

2. Train-Test Split (Basic)

Python Example

3. Why Train-Test Split Is Not Always Enough?

4. What is Cross-Validation?

✔ K-Fold Cross-Validation

Illustration (K = 5):

✔ Python Example

Benefits of Cross-Validation

5. Train/Validation/Test Split

Example Split

Python Example

6. Stratified Splitting (For Classification)

7. Cross-Validation Variants

✔ LOOCV (Leave-One-Out)

✔ Stratified K-Fold

✔ Repeated K-Fold

8. Practical Tips

Conclusion

Assignments

Assignment 1 – Simple Train/Test Split Practice

Assignment 2 – Impact of Split Ratio

Assignment 3 – K-Fold Cross-Validation vs Single Split

Assignment 4 – Cross-Validation on Small Dataset

Assignment 5 – Overfitting Detection via Train vs Validation Performance

Assignment 6 – Use Stratified Sampling (for Classification)

Assignment 7 – Hyperparameter Tuning with Cross-Validation

Assignment 8 – Compare Overfitting vs Underfitting with Split and CV

Assignment 9 – Nested Cross-Validation Conceptual Assignment

Assignment 10 – Report: Choosing Evaluation Strategy for a Given Problem

Leave a Reply Cancel reply

Our Courses

About Us

Our Location

Social

ML Chapter 5 | Train-Test Split, K-Fold Cross-Validation, Overfitting Prevention

Chapter 5: Train-Test Split & Cross-Validation

1. Why Do We Split Data?

📌 Without splitting the data:

📌 With Train-Test Split:

2. Train-Test Split (Basic)

Python Example

3. Why Train-Test Split Is Not Always Enough?

4. What is Cross-Validation?

✔ K-Fold Cross-Validation

Illustration (K = 5):

✔ Python Example

Benefits of Cross-Validation

5. Train/Validation/Test Split

Example Split

Python Example

6. Stratified Splitting (For Classification)

7. Cross-Validation Variants

✔ LOOCV (Leave-One-Out)

✔ Stratified K-Fold

✔ Repeated K-Fold

8. Practical Tips

Conclusion

Assignments

Assignment 1 – Simple Train/Test Split Practice

Assignment 2 – Impact of Split Ratio

Assignment 3 – K-Fold Cross-Validation vs Single Split

Assignment 4 – Cross-Validation on Small Dataset

Assignment 5 – Overfitting Detection via Train vs Validation Performance

Assignment 6 – Use Stratified Sampling (for Classification)

Assignment 7 – Hyperparameter Tuning with Cross-Validation

Assignment 8 – Compare Overfitting vs Underfitting with Split and CV

Assignment 9 – Nested Cross-Validation Conceptual Assignment

Assignment 10 – Report: Choosing Evaluation Strategy for a Given Problem

Leave a Reply Cancel reply

Related Post