ML Chapter 4 | Data Preprocessing, Cleaning, Encoding & Feature Scaling Explained

Chapter 4: Data Preprocessing & Feature Scaling

Data preprocessing is one of the most important steps in Machine Learning.
Raw data often contains missing values, inconsistent formats, outliers, and mixed datatypes.
Without proper cleaning and preprocessing, even the best algorithms will produce poor results.

In this chapter, we cover everything you need to prepare high-quality data for ML models — including missing values, encoding categorical features, outlier detection, normalization, standardization, and feature engineering.

1. Understanding Data Preprocessing

Data preprocessing refers to preparing raw data into a clean, structured form so that ML algorithms can understand it.

Handling missing values
Removing duplicates
Fixing data types
Encoding categorical columns
Scaling and normalization
Handling outliers

2. Handling Missing Values

✔ Types of Missing Data

MCAR – Missing Completely at Random
MAR – Missing at Random
MNAR – Missing Not at Random

✔ Strategies to Handle Missing Values

Drop missing rows/columns
Fill with mean/median/mode
Use model-based imputation

Python Example (Fill Missing Values)


import pandas as pd

df = pd.DataFrame({
    'age': [20, 25, None, 30],
    'income': [30000, None, 40000, 50000]
})

df['age'].fillna(df['age'].mean(), inplace=True)
df['income'].fillna(df['income'].median(), inplace=True)

print(df)

3. Encoding Categorical Variables

Machine learning models work only with numbers — not text.
Categorical values must be converted into numeric format.

✔ One-Hot Encoding

Creates new column for each category
Used when categories have no order


import pandas as pd

data = {'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai']}
df = pd.DataFrame(data)

encoded = pd.get_dummies(df, columns=['City'])
print(encoded)

✔ Label Encoding

Assigns integer value to each category
Used when categories have natural order


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['City'] = le.fit_transform(df['City'])

print(df)

4. Outlier Detection & Handling

Outliers are extremely high or low values that distort model performance.

✔ Detecting Outliers Using IQR


import numpy as np
import pandas as pd

df = pd.DataFrame({'salary': [1000, 1200, 1100, 90000]})

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(df[(df['salary'] < lower) | (df['salary'] > upper)])

✔ How to Handle Outliers?

Remove them
Cap them using percentile limits
Apply log/sqrt transformations
Use robust models (Random Forest, Tree models)

5. Feature Scaling

Feature scaling ensures that all numerical features contribute equally to the model.
It is crucial for algorithms where distance or gradient is involved such as:

KNN
SVM
K-Means
Neural Networks
Gradient-based models

✔ Standardization (Z-Score Scaling)

Transforms data to have mean = 0 and standard deviation = 1.


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform([[10],[20],[30]])

print(scaled)

✔ Normalization (Min-Max Scaling)

Scales values between 0 and 1.


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled = scaler.fit_transform([[10],[20],[30]])

print(scaled)

Standardization vs Normalization

Method	Output Range	Best For
Standardization	No fixed range	SVM, Logistic Regression, KNN
Normalization	0 to 1	Neural Networks, Distance-based models

6. Feature Engineering Basics

Extracting new features from date/time
Binning numeric values (age groups)
Combining related features
Creating polynomial features

Python Example: Creating New Features


df['age_group'] = pd.cut(df['age'], bins=[0,20,40,60,100],
                         labels=['Teen','Adult','Mid-age','Senior'])

7. Complete Preprocessing Pipeline

A full preprocessing pipeline automates the entire cleaning process.


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_features = ['age', 'salary']
categorical_features = ['city']

numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', cat_pipeline, categorical_features)
])

Conclusion

Data preprocessing is the foundation of building successful Machine Learning models.
Clean data → Better accuracy, faster training, and more reliable predictions.

In the next chapter, we will learn about Train-Test Split & Cross-Validation — essential techniques for evaluating model performance.

Assignments

Assignment 1 – Missing Value Handling

Take any dataset (real or synthetic) that contains missing values. Apply at least two different strategies to handle missing values (e.g. drop rows, fill with mean/median/mode). Compare how the dataset changes.

Hint: Use functions like `df.dropna()`, `df.fillna()` in pandas. Check before/after number of rows and missing-value counts.

Assignment 2 – Detect & Remove Duplicates

Load a dataset and check for duplicate rows. Remove duplicates and show how many rows were removed. Explain why removing duplicates might be important before training a model.

Hint: Use `df.drop_duplicates()` in pandas. Duplicates can bias the model if same data appears many times.

Assignment 3 – Encode Categorical Variables

Find a dataset with categorical (non-numeric) columns. Encode those columns into numeric form using at least two methods (e.g. label encoding, one-hot encoding). Show the result after encoding.

Hint: Use `pd.get_dummies()` or `LabelEncoder`. Make sure encoded data doesn’t introduce unintended ordinal relationships for nominal categories.

Assignment 4 – Feature Scaling: Normalization vs Standardization

Select a dataset with multiple numeric features having different ranges. Apply normalization (min-max scaling) and standardization (Z-score) separately. Compare summary statistics (mean, min, max) for both.

Hint: Use e.g. `MinMaxScaler` and `StandardScaler` from scikit-learn. Observe how mean and variance change.

Assignment 5 – Outlier Detection & Handling

Pick a dataset with numeric features. Identify outliers using IQR or Z-score method. Remove or cap them, then compare descriptive statistics before and after.

Hint: Compute Q1, Q3, IQR or compute z-scores. Be careful: do not remove valid but extreme data without justification.

Assignment 6 – Preprocessing Pipeline Design

Design a full preprocessing pipeline for a mixed dataset (numeric + categorical + missing values + outliers). Describe step by step how you will clean, encode, scale, and split the data for ML.

Hint: Think about order: e.g. handle missing → remove duplicates → encode → scale → split train/test.

Assignment 7 – Compare Model Performance with and without Preprocessing

Using the same dataset and ML algorithm, train a model twice: once with raw data (no preprocessing), and once after a full preprocessing pipeline. Compare model performance (accuracy, error) to show impact of preprocessing.

Hint: Choose a simple dataset + algorithm; ensure after preprocessing no missing/invalid data remains. Evaluate on test set.

Assignment 8 – Categorical Encoding & Model Impact

For a dataset with categorical features, encode them with both label encoding and one-hot encoding (separately), train same model, and compare results. Which encoding works better and why?

Hint: Label encoding may erroneously imply order; one-hot is safer for nominal categories. Observe effect on model metrics.

Assignment 9 – Scaling & Distance-Based Algorithms Sensitivity

Take a dataset and apply a distance-based algorithm (e.g. KNN, K-Means). Train/evaluate once with raw numeric data and once with scaled data. Compare results and comment on importance of feature scaling for such algorithms.

Hint: Distance-based algorithms are sensitive to feature magnitude differences — scaling often improves their performance.:contentReference[oaicite:0]{index=0}

Assignment 10 – Preprocessing Report for a Real Dataset

Select a publicly available dataset (UCI, Kaggle, etc.). Perform exploratory data analysis (EDA), identify data issues (missing values, outliers, inconsistent types), apply appropriate preprocessing steps, and write a short report describing what you did and why.

Hint: Use Pandas for EDA (`info()`, `describe()`, `isnull().sum()`). Document each preprocessing decision with justification.

About Us

Our Location

ML Chapter 4 | Data Preprocessing, Cleaning, Encoding & Feature Scaling Explained

Chapter 4: Data Preprocessing & Feature Scaling

1. Understanding Data Preprocessing

2. Handling Missing Values

✔ Types of Missing Data

✔ Strategies to Handle Missing Values

Python Example (Fill Missing Values)

3. Encoding Categorical Variables

✔ One-Hot Encoding

✔ Label Encoding

4. Outlier Detection & Handling

✔ Detecting Outliers Using IQR

✔ How to Handle Outliers?

5. Feature Scaling

✔ Standardization (Z-Score Scaling)

✔ Normalization (Min-Max Scaling)

Standardization vs Normalization

6. Feature Engineering Basics

Python Example: Creating New Features

7. Complete Preprocessing Pipeline

Conclusion

Assignments

Assignment 1 – Missing Value Handling

Assignment 2 – Detect & Remove Duplicates

Assignment 3 – Encode Categorical Variables

Assignment 4 – Feature Scaling: Normalization vs Standardization

Assignment 5 – Outlier Detection & Handling

Assignment 6 – Preprocessing Pipeline Design

Assignment 7 – Compare Model Performance with and without Preprocessing

Assignment 8 – Categorical Encoding & Model Impact

Assignment 9 – Scaling & Distance-Based Algorithms Sensitivity

Assignment 10 – Preprocessing Report for a Real Dataset

Leave a Reply Cancel reply

Our Courses

About Us

Our Location

Social

ML Chapter 4 | Data Preprocessing, Cleaning, Encoding & Feature Scaling Explained

Chapter 4: Data Preprocessing & Feature Scaling

1. Understanding Data Preprocessing

2. Handling Missing Values

✔ Types of Missing Data

✔ Strategies to Handle Missing Values

Python Example (Fill Missing Values)

3. Encoding Categorical Variables

✔ One-Hot Encoding

✔ Label Encoding

4. Outlier Detection & Handling

✔ Detecting Outliers Using IQR

✔ How to Handle Outliers?

5. Feature Scaling

✔ Standardization (Z-Score Scaling)

✔ Normalization (Min-Max Scaling)

Standardization vs Normalization

6. Feature Engineering Basics

Python Example: Creating New Features

7. Complete Preprocessing Pipeline

Conclusion

Assignments

Assignment 1 – Missing Value Handling

Assignment 2 – Detect & Remove Duplicates

Assignment 3 – Encode Categorical Variables

Assignment 4 – Feature Scaling: Normalization vs Standardization

Assignment 5 – Outlier Detection & Handling

Assignment 6 – Preprocessing Pipeline Design

Assignment 7 – Compare Model Performance with and without Preprocessing

Assignment 8 – Categorical Encoding & Model Impact

Assignment 9 – Scaling & Distance-Based Algorithms Sensitivity

Assignment 10 – Preprocessing Report for a Real Dataset

Leave a Reply Cancel reply

Related Post