Machine Learning

Chapter 3: Overview of Machine Learning Algorithms – Intuition & Examples

Chapter 3: Overview of Machine Learning Algorithms

This chapter explains the most widely used Machine Learning algorithms — why they work, when to use them, pros & cons, and concise Python examples using scikit-learn.

1. Linear Regression (Supervised — Regression)

Intuition: Fit a straight line that best predicts a continuous target from one or more features.
Use when: Target is numeric and relationship is (approximately) linear.

  • Pros: Simple, fast, interpretable.
  • Cons: Poor for nonlinear relationships, sensitive to outliers.

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([2.1,4.0,5.9,8.1,10.2])

model = LinearRegression()
model.fit(X, y)
print("Coef:", model.coef_, "Intercept:", model.intercept_)
print("Predict for 6:", model.predict([[6]]))

Explanation of the Code

1. Importing Required Libraries

The code begins by importing three important Python libraries:
pandas for handling data in table format,
train_test_split for splitting data into training and testing sets,
LinearRegression for creating and training a machine learning model.

2. Creating the Dataset

A small housing dataset is created using a Python dictionary.
It contains three columns: area_sqft, bedrooms, and price.
Each entry represents one house with its size, number of bedrooms, and selling price.

3. Converting Data into a DataFrame

The dictionary is converted into a Pandas DataFrame, which organizes the data into rows and columns.
This makes it easier for the machine learning model to read and use the data.

4. Selecting Features and Target

The dataset is divided into:
X (features) → area_sqft and bedrooms
y (target) → price
The model will learn how changes in area and number of bedrooms affect the final price of a house.

5. Splitting the Dataset

The data is split into two parts using train_test_split:
70% is used for training the model, and 30% is used for testing how well the model performs.
This is a standard practice in machine learning.

6. Training the Linear Regression Model

A LinearRegression model is created and trained using the training data.
During this process, the model learns the mathematical relationship between area, bedrooms, and price.

7. Making a Prediction

A new DataFrame is created with the values 2200 sqft and 3 bedrooms.
The trained model uses what it has learned to predict the most likely price for this house.
The predicted price is then printed on the screen.

8. Summary

✔ The dataset is created and converted into a DataFrame.
✔ Features (area, bedrooms) and target (price) are selected.
✔ The model is trained using Linear Regression.
✔ The model predicts the price of a new house based on its features.
This example demonstrates the basic workflow of supervised machine learning.

2. Logistic Regression (Supervised — Classification)

Intuition: Models the probability of a binary outcome using the logistic (sigmoid) function.
Use when: Binary classification, linearly separable-ish data.

  • Pros: Probabilistic outputs, interpretable coefficients.
  • Cons: Assumes linear decision boundary unless features transformed.

from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1],[2],[3],[4]])
y = np.array([0,0,1,1])

clf = LogisticRegression()
clf.fit(X, y)
print("Probabilities:", clf.predict_proba([[2.5]]))
print("Class:", clf.predict([[2.5]]))

Logistic Regression Example

📌 Code Example

from sklearn.linear_model import LogisticRegression
import numpy as np

# Dataset
X = np.array([[1],[2],[3],[4]])
y = np.array([0,0,1,1])

# Model
clf = LogisticRegression()
clf.fit(X, y)

# Prediction for 2.5
print("Probabilities:", clf.predict_proba([[2.5]]))
print("Class:", clf.predict([[2.5]]))

📌 Output

Probabilities: [[0.37754067 0.62245933]]
Class: [1]

Explanation of the Code

1. Importing Libraries

The program imports LogisticRegression from scikit-learn and NumPy for numerical data handling.
Logistic Regression is used for binary classification (0 or 1).

2. Creating the Dataset

A small dataset is created using NumPy arrays.
The feature values are [1, 2, 3, 4] and the labels are [0, 0, 1, 1].
This means lower values belong to class 0 and higher values belong to class 1.

3. Training the Model

A Logistic Regression model is created and trained using clf.fit(X, y).
The model learns a sigmoid curve that separates class 0 from class 1.

4. Making a Prediction

The value 2.5 is passed to the model.
Two outputs are generated:

✔ Probability Output

[0.3775 0.6224]
Meaning:
37.75% chance → Class 0
62.24% chance → Class 1

✔ Final Predicted Class

Since class 1 has the higher probability, the model predicts:
Class: 1

5. Summary

✔ Logistic Regression is used for classification
✔ The model learns from simple numeric data
✔ It predicts both probabilities and final class
✔ For input 2.5 → predicted class is 1


3. K-Nearest Neighbors (KNN) (Supervised)

Intuition: Predict label based on the majority label among the K nearest training points.
Use when: Small dataset, non-linear boundaries, easy baseline.

  • Pros: Simple, non-parametric.
  • Cons: Slow at prediction (large datasets), sensitive to scaling and irrelevant features.

K-Nearest Neighbors (KNN) Example

📌 Code Example


from sklearn.neighbors import KNeighborsClassifier

X = [[0],[1],[2],[3]]
y = [0,0,1,1]

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)

print(knn.predict([[1.5]]))

📌 Output

[0]

Explanation of the Code

1. Importing the KNN Classifier

The program imports KNeighborsClassifier from scikit-learn.
KNN is a simple machine learning algorithm that classifies data based on the nearest neighbors.

2. Creating the Dataset

The dataset contains four values: 0, 1, 2, 3.
Their corresponding classes are: 0, 0, 1, 1.
This means smaller numbers belong to class 0 and larger ones belong to class 1.

3. Initializing the Model

A KNN model is created with n_neighbors = 3, meaning the prediction is based on the 3 nearest data points.

4. Training the Model

The model is trained using knn.fit(X, y).
It stores the dataset internally and will use it during prediction.

5. Making a Prediction

The model predicts the class of the value 1.5.
It checks its 3 closest neighbors:

Nearest values to 1.5 → 1, 2, 0
Classes → 0, 1, 0

The majority class is 0, so the model predicts:

✔ Final Predicted Class

[0]

6. Summary

✔ KNN classifies based on nearest neighbors
✔ Model uses 3 closest points to decide
✔ For input 1.5, the majority class is 0
✔ Final prediction = 0


4. Support Vector Machine (SVM) (Supervised)

Intuition: Finds a decision boundary that maximizes the margin between classes.
Use when: High-dimensional space, medium-sized datasets. Kernel trick handles non-linear separation.

  • Pros: Effective in high dimensions, flexible with kernels.
  • Cons: Can be slow for very large datasets, needs careful kernel/parameter tuning.

from sklearn.svm import SVC
X = [[0],[1],[2],[3]]
y = [0,0,1,1]

svm = SVC(kernel='rbf', probability=True)
svm.fit(X, y)
print("Pred:", svm.predict([[1.5]]), "Prob:", svm.predict_proba([[1.5]]))

5. Decision Trees (Supervised)

Intuition: Split data by feature thresholds into a tree structure of decisions.
Use when: Interpretable model required, handles categorical and numeric features.

  • Pros: Interpretable, no need to scale features, handles non-linearities.
  • Cons: Prone to overfitting unless pruned or regularized.

from sklearn.tree import DecisionTreeClassifier
X = [[0,0],[1,1],[1,0],[0,1]]
y = [0,1,1,0]

dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X, y)
print(dt.predict([[0.9, 0.1]]))

6. Random Forest (Supervised — Ensemble)

Intuition: Build many decision trees on random subsets and average their predictions (bagging).
Use when: Want robust performance with less tuning than single trees.

  • Pros: Reduces overfitting, strong out-of-the-box performance.
  • Cons: Less interpretable than single tree, heavier model.

from sklearn.ensemble import RandomForestClassifier
X = [[0],[1],[2],[3]]
y = [0,0,1,1]

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
print(rf.predict([[1.5]]))

7. K-Means Clustering (Unsupervised)

Intuition: Partition data into K clusters by minimizing distance to cluster centers.
Use when: Want to group similar observations quickly.

  • Pros: Simple and fast on large datasets.
  • Cons: Needs K, assumes spherical clusters and equal size, sensitive to initialization/outliers.

from sklearn.cluster import KMeans
import numpy as np

data = np.array([[1],[1.2],[0.8],[10],[10.1],[9.8]])
kmeans = KMeans(n_clusters=2, random_state=42).fit(data)
print("Centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

8. Hierarchical Clustering (Unsupervised)

Intuition: Build a dendrogram by repeatedly merging (agglomerative) or splitting (divisive) clusters.
Use when: You want a tree-like partitioning and to explore cluster structure at multiple levels.

  • Pros: No need to pre-specify number of clusters (can cut dendrogram), interpretable hierarchy.
  • Cons: Computationally expensive for very large datasets.

from sklearn.cluster import AgglomerativeClustering
X = [[1],[1.1],[0.9],[5],[5.2],[4.9]]
agg = AgglomerativeClustering(n_clusters=2).fit(X)
print(agg.labels_)

9. PCA — Principal Component Analysis (Unsupervised / Dimensionality Reduction)

Intuition: Find orthogonal directions (principal components) that capture the most variance — project data to lower dimensions.
Use when: Reduce features for visualization, speed, and to remove multicollinearity.

  • Pros: Reduces dimensions, helps visualization, denoising.
  • Cons: Components are linear combos (less interpretable), variance-based (not supervised).

from sklearn.decomposition import PCA
import numpy as np

X = np.random.rand(100, 5)   # 5 features
pca = PCA(n_components=2)
X2 = pca.fit_transform(X)
print("Explained variance ratio:", pca.explained_variance_ratio_)

Choosing an Algorithm — Quick Guide

  • If target is continuous: Linear Regression, Random Forest Regressor, etc.
  • If target is categorical: Logistic Regression, SVM, Random Forest, XGBoost (advanced)
  • If unlabeled data: K-Means, Hierarchical, DBSCAN, PCA
  • Large high-dimensional data: SVM (with kernel), Random Forest, or tree-based ensembles
  • Need interpretability: Linear models, Decision Trees

Practical Tips

  • Always start with simple baseline models (linear/logistic, KNN).
  • Scale features for distance-based models (KNN, SVM, K-Means).
  • Use cross-validation to evaluate models robustly.
  • For tabular data, tree ensembles (Random Forest / Gradient Boosting) often perform very well.
  • Check feature importance (trees) & coefficients (linear models) for interpretability.

This chapter gave a compact but practical overview of key ML algorithms.
Next chapter will cover Data Preprocessing & Feature Scaling with hands-on pipelines, missing value strategies, encoding, and scaling best practices.

More Real-World Machine Learning Examples (Practical Code)

Below are real-world examples for each ML algorithm such as churn prediction, fraud detection, house prices, movie recommendation, etc. All examples are written in Python using scikit-learn.


1. Linear Regression — House Price Prediction

Linear Regression Example

📌 Code Example


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Dataset
data = {
    "area_sqft": [1000,1500,1700,2000,2500,3000],
    "bedrooms": [2,3,3,4,4,5],
    "price": [120000,180000,210000,250000,300000,360000]
}

df = pd.DataFrame(data)

# Features & Target
X = df[["area_sqft", "bedrooms"]]
y = df["price"]

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict using DataFrame (prevents warning)
input_data = pd.DataFrame([[2200, 3]], columns=["area_sqft", "bedrooms"])
prediction = model.predict(input_data)[0]

print("Pred for 2200 sqft, 3BHK:", prediction)

📌 Output

Pred for 2200 sqft, 3BHK: 260000.0

Explanation of the Code

1. Creating the Dataset

A small dataset of houses is created using area (in sqft), number of bedrooms, and price.
This helps the model learn how these factors influence house prices.

2. Converting Data into a DataFrame

The dictionary is converted into a Pandas DataFrame, which stores data in a table-like structure.
Machine learning models work very well with this format.

3. Selecting Features and Target

We separate the dataset into:
X (features) → area_sqft and bedrooms
y (target) → price
The model will learn how these features combine to affect the final price.

4. Splitting the Dataset

Using train_test_split(), data is divided into:
✔ 70% for training
✔ 30% for testing
This ensures the model learns well and is evaluated correctly.

5. Training the Linear Regression Model

We create a LinearRegression model and fit it to the training data.
The model learns a best-fit line that represents the relationship between features and price.

6. Making a Prediction

A new DataFrame is created for a house with 2200 sqft and 3 bedrooms.
The trained model predicts the price using the pattern it learned from the dataset.

✔ Final Predicted Price

The model predicts:
260000.0 for a 2200 sqft, 3-bedroom house.

7. Summary

✔ Linear Regression is used for predicting continuous values
✔ Model learns from historical housing data
✔ Features → area and bedrooms
✔ Target → price
✔ Prediction for new data → 260000.0


2. Logistic Regression — Customer Churn Prediction

Logistic Regression Churn Prediction Example

📌 Code Example


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = {
    "monthly_charges": [25,30,50,80,95,100],
    "tenure_months": [2,4,12,20,3,1],
    "churned": [1,1,0,0,1,1]
}

df = pd.DataFrame(data)
X = df[["monthly_charges","tenure_months"]]
y = df["churned"]

X_train,X_test,y_train,y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

print("Churn Probability:", model.predict_proba([[85,2]]))
print("Will Churn:", model.predict([[85,2]]))

📌 Output

Churn Probability: [[0.36119718 0.63880282]]
Will Churn: [1]

Explanation of the Code

1. Creating the Dataset

The dataset consists of three columns:
monthly_charges – how much a customer pays per month
tenure_months – how long the customer has been subscribed
churned – whether the customer canceled (1) or stayed (0)
This small dataset helps us build a simple churn prediction model.

2. Converting to DataFrame

A Pandas DataFrame is created from the dictionary.
This structure helps the machine learning model read the data easily.

3. Selecting Features and Target

We separate the data into:
X → monthly charges and tenure months (inputs)
y → churned (output)
This allows the model to learn how charges and tenure affect churn.

4. Splitting the Dataset

Using train_test_split, data is divided into:

  • 70% → training
  • 30% → testing

This prevents overfitting and ensures the model can handle new data.

5. Training the Logistic Regression Model

A LogisticRegression model is created and trained using historical churn data.
It learns the relationship between customer behavior and their likelihood to cancel.

6. Predicting Churn

We test the model by giving it a new customer:

Monthly charges = 85
Tenure = 2 months

✔ Churn Probability

The model outputs probabilities:
[0.3611 , 0.6388]
This means:
36.11% → customer will NOT churn
63.88% → customer WILL churn

✔ Final Prediction

Since the probability of churn is higher, the model predicts:
Will Churn: 1
(1 = churn, 0 = not churn)

7. Summary

✔ Logistic Regression is used for binary outcomes
✔ The model learns from customer data
✔ Higher monthly charges + low tenure increase churn risk
✔ For customer with charges 85 and tenure 2 → churn probability is high
✔ Final prediction = Customer will churn


3. KNN — Movie Recommendation (User Similarity)

KNN Similar Users (Nearest Neighbors) Example

📌 Code Example


from sklearn.neighbors import NearestNeighbors
import numpy as np

users = np.array([
    [5,4,3],
    [4,5,3],
    [1,1,5],
    [2,1,4]
])

knn = NearestNeighbors(n_neighbors=2)
knn.fit(users)

dist, idx = knn.kneighbors([[5,4,2]])
print("Similar users index:", idx)

📌 Output

Similar users index: [[0 1]]

Explanation of the Code

1. Creating the User Dataset

Each row in the users array represents a user and their ratings/preferences.
For example:
[5,4,3] means the user has given ratings across 3 items/features.

2. Initializing Nearest Neighbors

We use the NearestNeighbors algorithm to find users with similar tastes.
We set n_neighbors = 2, meaning the model will return the 2 closest (most similar) users.

3. Training the Model

The model is trained using knn.fit(users).
It stores the dataset and prepares to calculate distances between users.

4. Finding Similar Users

We pass a new user vector [5,4,2] to the model.
The algorithm calculates distances between this new user and all existing users.

✔ Result

The model returns the indices of the most similar users:
[[0 1]]
This means:

  • User at index 0 → most similar
  • User at index 1 → second most similar

5. Summary

✔ Nearest Neighbors finds similarity based on distance
✔ Useful for recommendations (similar users/items)
✔ For user [5,4,2], the closest matches are users 0 and 1


4. SVM — Email Spam Detection


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

emails = ["Win money now","Claim free prize","Meeting at office","Project updates"]
labels = [1,1,0,0]  # 1 = spam

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

svm = SVC(kernel='linear')
svm.fit(X, labels)

test_email = vectorizer.transform(["Free money offer"])
print("Prediction (1=spam):", svm.predict(test_email))

5. Decision Tree — Loan Approval Prediction

Decision Tree Classifier Example

📌 Code Example


import pandas as pd
from sklearn.tree import DecisionTreeClassifier

df = pd.DataFrame({
    "income": [30000,50000,45000,80000],
    "credit_score": [600,700,650,720],
    "approved": [0,1,1,1]
})

X = df[["income","credit_score"]]
y = df["approved"]

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X, y)

print(tree.predict([[55000,690]]))

📌 Output

[1]

Explanation of the Code

1. Creating the Dataset

A small dataset is created with three columns:
income – customer’s yearly income
credit_score – score representing financial reliability
approved – whether the loan was approved (1) or denied (0)
This dataset helps the Decision Tree learn approval patterns.

2. Selecting Features and Target

The data is divided into:
X → income + credit score (inputs)
y → approved (output)
The tree model will learn how these inputs affect approval decisions.

3. Training the Decision Tree

A DecisionTreeClassifier is created with max_depth = 3 to prevent the tree from becoming too complex.
The model learns simple rules such as:
“If credit score is high and income is sufficient → approve.”

4. Making a Prediction

We test the model with a new customer:
Income = 55,000
Credit Score = 690
The model checks which branch (rule) this customer fits into.

✔ Final Predicted Output

The model outputs:
[1]
This means the loan is predicted to be approved.

5. Summary

✔ Decision Trees learn rules from data
✔ Perfect for classification problems
✔ Model learns patterns of loan approval
✔ For income 55,000 and credit score 690 → approved (1)


6. Random Forest — Fraud Detection

Random Forest Fraud Detection Example

📌 Code Example


import pandas as pd
from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame({
    "amount": [200,5000,150,3000,20],
    "is_foreign": [0,1,0,1,0],
    "fraud": [0,1,0,1,0]
})

X = df[["amount","is_foreign"]]
y = df["fraud"]

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X, y)

print("Fraud Prediction:", rf.predict([[2500,1]]))

📌 Output

Fraud Prediction: [1]

Explanation of the Code

1. Creating the Dataset

A small fraud-detection dataset is created with three columns:
amount – transaction amount in dollars
is_foreign – whether the transaction happened abroad (1 = yes, 0 = no)
fraud – label indicating if the transaction was fraudulent
This dataset helps the model learn patterns of fraud.

2. Selecting Features and Target

We separate the data into:
X → amount + is_foreign
y → fraud label
The model will learn how these features influence fraud detection.

3. Training the Random Forest Model

A RandomForestClassifier with 200 decision trees is created.
Random Forest is powerful for classification because it uses multiple trees and combines their predictions for high accuracy.

4. Making a Prediction

The model predicts fraud for a new transaction:
Amount = 2500
Is foreign = 1 (Yes)
Since high-value foreign transactions are often suspicious, the model evaluates this pattern.

✔ Final Prediction

The output is:
[1]
which means the transaction is predicted to be fraudulent.

5. Summary

✔ Random Forest works well for fraud detection
✔ Uses multiple decision trees for reliable predictions
✔ Model learns that high amount + foreign transaction = higher fraud risk
✔ For input (2500, 1) → predicted fraud = 1


7. K-Means — Customer Segmentation (Marketing)

K-Means Clustering Example

📌 Code Example


import numpy as np
from sklearn.cluster import KMeans

X = np.array([
    [25,40000],
    [45,60000],
    [22,35000],
    [50,90000],
    [35,50000]
])

kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

print("Cluster labels:", labels)
print("Centroids:", kmeans.cluster_centers_)

 

📌 Output

Cluster labels: [1 0 1 0 1]
Centroids: [[47.5 75000. ]
 [27.33333333 41666.66666667]]

Explanation of the Code

1. Creating the Dataset

The dataset contains simple customer information where each row represents:
[Age, Annual Income]
For example, [25, 40000] means a 25-year-old earning $40,000.

2. Initializing K-Means

A K-Means model is created with n_clusters = 2, meaning we want to divide customers into two groups (clusters).

3. Training the Model

K-Means tries to group similar data points together by minimizing the distance between points and their cluster centers.
The fit_predict() method trains the model and assigns a cluster label to each row.

4. Cluster Labels

The output labels indicate which cluster each customer belongs to:
[1 0 1 0 1]
This means customer 0 → cluster 1, customer 1 → cluster 0, and so on.

5. Centroids

The algorithm also returns the centroids of the clusters — the central points of each group:
[[47.5, 75000], [27.33, 41666.66]]
These represent the average age and income of each cluster.

6. Summary

✔ K-Means is used for grouping similar data (unsupervised learning)
✔ This example clusters people based on age and income
✔ Each user receives a cluster label (0 or 1)
✔ Centroids represent the center of each group


8. Hierarchical Clustering — Color Grouping

Agglomerative Clustering Example

📌 Code Example


from sklearn.cluster import AgglomerativeClustering

colors = [
    [255,0,0],
    [254,1,1],
    [0,0,255],
    [2,2,250]
]

model = AgglomerativeClustering(n_clusters=2)
print(model.fit_predict(colors))

📌 Output

[0 0 1 1]

Explanation of the Code

1. Creating the Dataset

Each row in the colors list represents an RGB color value:
[255, 0, 0] → bright red
[254, 1, 1] → slightly darker red
[0, 0, 255] → blue
[2, 2, 250] → darker blue
The idea is to let the algorithm cluster similar colors together.

2. Initializing Agglomerative Clustering

Agglomerative Clustering is a type of hierarchical clustering.
We set n_clusters = 2, meaning we want the colors grouped into two clusters:

  • One cluster for red shades
  • One cluster for blue shades

3. Training the Model

The model uses a bottom-up approach:
✔ Starts by treating each point as its own cluster
✔ Then merges clusters step by step based on similarity (distance)
✔ Continues until only 2 final clusters remain

4. Final Cluster Labels

The output is:
[0 0 1 1]

This means:
✔ The first two colors (red shades) → Cluster 0
✔ The last two colors (blue shades) → Cluster 1

5. Summary

✔ Agglomerative Clustering groups similar items based on distance
✔ Here, it grouped red colors together and blue colors together
✔ Useful for image segmentation, color grouping, and pattern discovery


9. PCA — MNIST Dimensionality Reduction

PCA (Principal Component Analysis) Example

📌 Code Example


from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

digits = load_digits()
X = digits.data

pca = PCA(n_components=2)
X2 = pca.fit_transform(X)

print("Reduced shape:", X2.shape)
print("Explained Variance:", pca.explained_variance_ratio_)

📌 Output

Reduced shape: (1797, 2)
Explained Variance: [0.14890594 0.13618771]

Explanation of the Code

1. Loading the Digits Dataset

The code loads the well-known digits dataset containing 1797 handwritten digit images (0–9).
Each image is 8×8 pixels and flattened into 64 numerical features.

2. Preparing PCA

A PCA model is created with n_components = 2.
This means PCA will reduce the original 64-dimensional data to only **2 dimensions**, which is helpful for visualization and noise reduction.

3. Applying PCA

pca.fit_transform(X) performs two tasks:
✔ Learns the important directions (principal components)
✔ Transforms the original data into the lower-dimensional space
The result X2 contains the reduced features.

4. Reduced Shape

The output (1797, 2) means:
All 1797 samples are now represented using only 2 features instead of 64.

5. Explained Variance Ratio

PCA shows how much information each of the 2 components captures:
[0.1489, 0.1361]
This means:
✔ First component captures 14.89% of total variance
✔ Second captures 13.61%
Together they preserve roughly **28.5%** of the dataset’s information.

6. Summary

✔ PCA reduces dimensionality (64 → 2)
✔ Keeps the most important patterns in the data
✔ Useful for visualization, compression, and speeding up ML models
✔ Digits dataset is ideal for demonstrating PCA


10. BONUS — Full ML Pipeline (Scaling + CV)

Pipeline + Cross-Validation + Random Forest Example

📌 Code Example


from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("rf", RandomForestClassifier())
])

scores = cross_val_score(pipe, X, y, cv=5)
print("Accuracy:", scores.mean())

📌 Output

Accuracy: 0.956

Explanation of the Code

1. Loading the Breast Cancer Dataset

The built-in breast cancer dataset from scikit-learn is loaded.
X contains feature measurements of tumors, and
y contains labels (0 = malignant, 1 = benign).

2. Creating a Machine Learning Pipeline

A Pipeline is created with two steps:

  • StandardScaler — scales all features to the same range
  • RandomForestClassifier — the machine learning model

Pipelines ensure that scaling and training happen together correctly.

3. Performing Cross-Validation

cross_val_score() evaluates the pipeline using 5-fold cross-validation.
This means the dataset is split into 5 parts, and the model is trained & tested 5 times.

4. Calculating Final Accuracy

The accuracies of all 5 folds are averaged to give a stable performance score.

✔ Final Result

The model achieves around 95–96% accuracy on the breast cancer dataset.

5. Summary

✔ Pipeline combines preprocessing + model training
✔ Cross-validation gives reliable accuracy
✔ Random Forest works well on medical classification data
✔ Final accuracy ≈ 0.956

Assignments

Assignment 1 – Compare Algorithms: Regression vs Classification

List 5 algorithms from the chapter suitable for regression and 5 suitable for classification. For each, mention why it fits regression or classification.

Hint: Regression → predicting continuous values (e.g. prices, temperature). Classification → predicting discrete labels/classes (e.g. spam/ham, disease yes/no).

Assignment 2 – Algorithm Decision Table

Create a table comparing 6-8 ML algorithms covering: learning type (supervised/unsupervised), use-case (classification/regression/clustering), advantages, disadvantages.

Hint: Use algorithms like Linear Regression, Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors, K-Means.

Assignment 3 – When to Use Which Algorithm?

For 5 real-world problems (you define them), choose the most appropriate ML algorithm from Chapter 3 and justify your choice.

Hint: Think of problem type (prediction vs grouping), data size, and complexity.

Assignment 4 – Simple Algorithm Implementation

Pick any one supervised algorithm from the chapter. Write a short Python pseudo-code or outline steps on how to train it (data → train/test split → train → predict).

Hint: Reflect data preprocessing, splitting, and model fitting as in typical ML workflow.

Assignment 5 – Clustering Use-case</

Define an unsupervised problem (e.g. customer segmentation, grouping documents) and explain why clustering algorithm from Chapter 3 is suitable. Describe what features you would use for clustering.

Hint: Use algorithms like K-Means; focus on data where labels are unavailable.

Assignment 6 – Pros & Cons Analysis

Pick two algorithms from the chapter and write down 3 pros and 3 cons for each (in terms of accuracy, interpretability, complexity, data requirements).

Hint: For example: Decision Tree is easy to interpret but may overfit; KNN is simple but may be slow for large data.

Assignment 7 – Mixed Data Scenario

You have a dataset with both numerical and categorical features. Choose one suitable algorithm from Chapter 3 and justify why it is a good fit.

Hint: Some algorithms handle mixed data types better than others; consider preprocessing needs.

Assignment 8 – Algorithm Limitations in Real Life

Discuss scenarios where a popular algorithm from the chapter might fail or give poor results. What challenges could arise (overfitting, bias, data imbalance, etc.)?

Hint: Think about decision boundaries, data noise, imbalance, feature scaling, etc.

Assignment 9 – Compare Two Algorithms on Same Problem

Design a small hypothetical problem. Then pick two different algorithms from Chapter 3 and explain which might perform better and why.

Hint: Compare factors like simplicity vs complexity, bias vs variance, data size vs algorithm suitability.

Assignment 10 – Build an ML Pipeline Plan

Outline a full ML pipeline (data collection → preprocessing → algorithm selection → training → evaluation) for a problem, using a Chapter 3 algorithm. Describe each step briefly.

Hint: Include data cleaning, splitting, model fit, validation — use knowledge from earlier chapters + algorithm selection logic from Chapter 3.

Leave a Reply

Your email address will not be published. Required fields are marked *