Machine Learning

ML Chapter 6 | Accuracy, Precision, Recall, F1 Score, Confusion Matrix Explained

Chapter 6: Evaluation Metrics – Accuracy, Precision, Recall, F1 Score

Model evaluation is one of the most important steps in machine learning.
A model is only useful if it performs well on new, unseen data.
In this chapter, we explore the most essential evaluation metrics: Accuracy, Precision, Recall, F1 Score, and the Confusion Matrix.

1. What Are Evaluation Metrics?

Evaluation metrics help measure how well a machine learning model performs.
Different problems require different evaluation metrics — especially classification problems like spam detection, fraud detection, medical diagnosis, or churn prediction.

  • Accuracy → Overall correctness
  • Precision → How many predicted positives are actually positive
  • Recall → How many actual positives are captured
  • F1 Score → Balance of precision & recall
  • Confusion Matrix → Detailed breakdown of errors

2. Understanding the Confusion Matrix

A confusion matrix shows how many predictions were correct or incorrect for each class.

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

✔ Example Confusion Matrix


TP = 80   # correctly predicted positive
FP = 10   # predicted positive but actually negative
FN = 5    # predicted negative but actually positive
TN = 100  # correctly predicted negative

3. Accuracy

Accuracy measures how many predictions were correct out of all predictions.


Accuracy = (TP + TN) / (TP + TN + FP + FN)

Example:


Accuracy = (80 + 100) / (80+100+10+5) = 180 / 195 = 92.3%

Accuracy is misleading for imbalanced data.
Example: 95% healthy patients, 5% cancer patients → model predicting all “healthy” gets 95% accuracy but is useless.

4. Precision

Precision answers: Of all positive predictions, how many were correct?


Precision = TP / (TP + FP)

Example:


Precision = 80 / (80 + 10) = 88.9%

High precision → Few false positives.
Useful for: Spam detection, fraud detection, face recognition.

5. Recall (Sensitivity)

Recall answers: Of all actual positives, how many did we identify?


Recall = TP / (TP + FN)

Example:


Recall = 80 / (80 + 5) = 94.1%

High recall → Few false negatives.
Useful for: Medical diagnosis, safety systems, cancer detection.

6. F1 Score

F1 Score is the harmonic mean of precision and recall.
It balances both — useful when data is imbalanced.


F1 = 2 * (Precision * Recall) / (Precision + Recall)

Example:


Precision = 0.889
Recall = 0.941
F1 = 2 * (0.889 * 0.941) / (0.889 + 0.941) 
F1 = 0.914 (91.4%)

A perfect model has F1 = 1.
A very poor model has F1 close to 0.

7. Python Example using Scikit-Learn


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

y_true = [1,1,1,0,0,0]
y_pred = [1,1,0,0,1,0]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

8. Which Metric Should You Use?

✔ Use Accuracy When:

  • Data is balanced (equal classes)
  • Both types of errors have equal importance

✔ Use Precision When:

  • False positives are more dangerous
  • Example: Spam filter, loan approval, fraud detection

✔ Use Recall When:

  • Missing a positive case is dangerous
  • Example: Cancer detection, safety systems

✔ Use F1 Score When:

  • Data is imbalanced
  • Need balanced precision & recall

9. Summary Table

Metric Best For Avoid When
Accuracy Balanced data Imbalanced datasets
Precision Avoiding false positives When false negatives matter more
Recall Detecting all positives When false positives matter more
F1 Score Imbalanced classification When interpretability is needed

Conclusion

Accuracy alone is not enough to evaluate most real-world machine learning models, especially when data is imbalanced.
Understanding precision, recall, F1, and the confusion matrix helps choose the right model for the right problem.

In the next chapter, we will study the Bias-Variance Tradeoff — how to balance underfitting and overfitting for better ML performance.

Assignments

Assignment 1 – Compute Metrics from Confusion Matrix

Create a small hypothetical confusion matrix with TP, TN, FP, FN (e.g. TP=50, TN=45, FP=5, FN=10). Compute Accuracy, Precision, Recall, and F1 Score from these values.

Hint: Use formulas: Accuracy = (TP+TN)/(TP+TN+FP+FN), Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2*(Precision*Recall)/(Precision+Recall). :contentReference[oaicite:2]{index=2}

Assignment 2 – Analyze Class-Imbalanced Dataset

Take or simulate a binary dataset where positive class is rare (e.g. 5% positive, 95% negative). Assume a naive model always predicts negative. Compute its Accuracy, Precision, Recall, F1. Reflect on why accuracy is misleading here.

Hint: When data is imbalanced, accuracy may be high even with poor detection of positives — use confusion matrix metrics to reveal that. :contentReference[oaicite:3]{index=3}

Assignment 3 – Compare Two Models on Same Dataset

Assume you have two models for a classification task. Define two confusion matrices (for Model A and Model B). Compute Precision, Recall, F1 for both. Decide which model is better — considering trade-offs (e.g. one has higher precision but lower recall).

Hint: Sometimes precision > recall is better (if false positives are costly), other times recall > precision (if missing positives is costly). Use F1 or domain context to choose. :contentReference[oaicite:4]{index=4}

Assignment 4 – Real Data: Classification & Metric Reporting

Pick a small real-world classification dataset (e.g. about spam detection, churn, disease diagnosis). Split into train/test, train a model, then compute and report: confusion matrix, accuracy, precision, recall, F1. Include analysis: what do those metrics tell you?

Hint: Use built-in functions like `confusion_matrix`, `precision_score`, etc. from sklearn. :contentReference[oaicite:5]{index=5}

Assignment 5 – Threshold-Based Predictions & Metric Change

For a binary classification model that outputs probabilities (e.g. logistic regression), choose two different classification thresholds (e.g. 0.5 and 0.7). For each threshold, compute confusion matrix and evaluation metrics. Compare how precision and recall change.

Hint: Raising threshold usually increases precision (less FP) but may reduce recall (more FN). Useful for imbalanced / high-risk tasks. :contentReference[oaicite:6]{index=6}

Assignment 6 – Metric Selection by Problem Type

Write a short report: for three different classification problems (e.g. medical diagnosis, spam filter, recommendation system), decide which metric(s) among Accuracy / Precision / Recall / F1 are most appropriate and why.

Hint: In disease detection — high recall (low FN) is critical; for spam filter — high precision (low FP) might be more important; for balanced cases maybe accuracy works. :contentReference[oaicite:7]{index=7}

Assignment 7 – Multi-Class Classification Metrics

Find a small multiclass dataset (three or more classes). Train a classifier, then compute per-class and overall metrics (precision, recall, F1) — using macro, micro or weighted averaging. Report and explain difference between different averaging methods.

Hint: multiclass confusion matrices + averaging over classes help evaluate model properly. Libraries like sklearn support classification_report for this. :contentReference[oaicite:8]{index=8}

Assignment 8 – Measure Impact of Class Imbalance Handling

Take an imbalanced classification dataset. Train model without any balancing. Then apply techniques (resampling: oversample minority / undersample majority / class weights). Compare confusion matrices and evaluation metrics before & after.

Hint: Class imbalance affects recall/precision — balancing may improve detection of minority class (recall) but may change precision. Use evaluation metrics to show difference. :contentReference[oaicite:9]{index=9}

Assignment 9 – Compare Accuracy vs F1 Score on Skewed Data

Simulate or take a dataset with skewed class distribution (e.g. 90% negative). Build a trivial classifier that always predicts majority class. Compute accuracy and F1. Then build a “smart” classifier: compute its metrics too. Compare and write why F1 may be a better metric than accuracy here.

Hint: Accuracy may be high for trivial classifier but F1 will be low — shows accuracy alone can mislead. :contentReference[oaicite:10]{index=10}

Assignment 10 – Write a Tutorial Section (Blog Post Style)

Write a blog-post style explanation (like TutorialRays) for someone new: define confusion matrix, explain TP/FP/TN/FN, define accuracy, precision, recall, F1. Include why accuracy alone isn’t enough and when to prefer other metrics.

Hint: Use simple language, use small numeric example to illustrate confusion matrix and metric formulas. :contentReference[oaicite:11]{index=11}

Leave a Reply

Your email address will not be published. Required fields are marked *