Chapter 4: Data Preprocessing & Feature Scaling
Data preprocessing is one of the most important steps in Machine Learning.
Raw data often contains missing values, inconsistent formats, outliers, and mixed datatypes.
Without proper cleaning and preprocessing, even the best algorithms will produce poor results.
In this chapter, we cover everything you need to prepare high-quality data for ML models — including missing values, encoding categorical features, outlier detection, normalization, standardization, and feature engineering.
1. Understanding Data Preprocessing
Data preprocessing refers to preparing raw data into a clean, structured form so that ML algorithms can understand it.
- Handling missing values
- Removing duplicates
- Fixing data types
- Encoding categorical columns
- Scaling and normalization
- Handling outliers
2. Handling Missing Values
✔ Types of Missing Data
- MCAR – Missing Completely at Random
- MAR – Missing at Random
- MNAR – Missing Not at Random
✔ Strategies to Handle Missing Values
- Drop missing rows/columns
- Fill with mean/median/mode
- Use model-based imputation
Python Example (Fill Missing Values)
import pandas as pd
df = pd.DataFrame({
'age': [20, 25, None, 30],
'income': [30000, None, 40000, 50000]
})
df['age'].fillna(df['age'].mean(), inplace=True)
df['income'].fillna(df['income'].median(), inplace=True)
print(df)
3. Encoding Categorical Variables
Machine learning models work only with numbers — not text.
Categorical values must be converted into numeric format.
✔ One-Hot Encoding
- Creates new column for each category
- Used when categories have no order
import pandas as pd
data = {'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai']}
df = pd.DataFrame(data)
encoded = pd.get_dummies(df, columns=['City'])
print(encoded)
✔ Label Encoding
- Assigns integer value to each category
- Used when categories have natural order
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['City'] = le.fit_transform(df['City'])
print(df)
4. Outlier Detection & Handling
Outliers are extremely high or low values that distort model performance.
✔ Detecting Outliers Using IQR
import numpy as np
import pandas as pd
df = pd.DataFrame({'salary': [1000, 1200, 1100, 90000]})
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
print(df[(df['salary'] < lower) | (df['salary'] > upper)])
✔ How to Handle Outliers?
- Remove them
- Cap them using percentile limits
- Apply log/sqrt transformations
- Use robust models (Random Forest, Tree models)
5. Feature Scaling
Feature scaling ensures that all numerical features contribute equally to the model.
It is crucial for algorithms where distance or gradient is involved such as:
- KNN
- SVM
- K-Means
- Neural Networks
- Gradient-based models
✔ Standardization (Z-Score Scaling)
Transforms data to have mean = 0 and standard deviation = 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform([[10],[20],[30]])
print(scaled)
✔ Normalization (Min-Max Scaling)
Scales values between 0 and 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform([[10],[20],[30]])
print(scaled)
Standardization vs Normalization
| Method | Output Range | Best For |
|---|---|---|
| Standardization | No fixed range | SVM, Logistic Regression, KNN |
| Normalization | 0 to 1 | Neural Networks, Distance-based models |
6. Feature Engineering Basics
- Extracting new features from date/time
- Binning numeric values (age groups)
- Combining related features
- Creating polynomial features
Python Example: Creating New Features
df['age_group'] = pd.cut(df['age'], bins=[0,20,40,60,100],
labels=['Teen','Adult','Mid-age','Senior'])
7. Complete Preprocessing Pipeline
A full preprocessing pipeline automates the entire cleaning process.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
numeric_features = ['age', 'salary']
categorical_features = ['city']
numeric_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocess = ColumnTransformer(
transformers=[
('num', numeric_pipeline, numeric_features),
('cat', cat_pipeline, categorical_features)
])
Conclusion
Data preprocessing is the foundation of building successful Machine Learning models.
Clean data → Better accuracy, faster training, and more reliable predictions.
In the next chapter, we will learn about Train-Test Split & Cross-Validation — essential techniques for evaluating model performance.
Assignments
Assignment 1 – Missing Value Handling
Take any dataset (real or synthetic) that contains missing values. Apply at least two different strategies to handle missing values (e.g. drop rows, fill with mean/median/mode). Compare how the dataset changes.
Hint: Use functions like `df.dropna()`, `df.fillna()` in pandas. Check before/after number of rows and missing-value counts.
Assignment 2 – Detect & Remove Duplicates
Load a dataset and check for duplicate rows. Remove duplicates and show how many rows were removed. Explain why removing duplicates might be important before training a model.
Hint: Use `df.drop_duplicates()` in pandas. Duplicates can bias the model if same data appears many times.
Assignment 3 – Encode Categorical Variables
Find a dataset with categorical (non-numeric) columns. Encode those columns into numeric form using at least two methods (e.g. label encoding, one-hot encoding). Show the result after encoding.
Hint: Use `pd.get_dummies()` or `LabelEncoder`. Make sure encoded data doesn’t introduce unintended ordinal relationships for nominal categories.
Assignment 4 – Feature Scaling: Normalization vs Standardization
Select a dataset with multiple numeric features having different ranges. Apply normalization (min-max scaling) and standardization (Z-score) separately. Compare summary statistics (mean, min, max) for both.
Hint: Use e.g. `MinMaxScaler` and `StandardScaler` from scikit-learn. Observe how mean and variance change.
Assignment 5 – Outlier Detection & Handling
Pick a dataset with numeric features. Identify outliers using IQR or Z-score method. Remove or cap them, then compare descriptive statistics before and after.
Hint: Compute Q1, Q3, IQR or compute z-scores. Be careful: do not remove valid but extreme data without justification.
Assignment 6 – Preprocessing Pipeline Design
Design a full preprocessing pipeline for a mixed dataset (numeric + categorical + missing values + outliers). Describe step by step how you will clean, encode, scale, and split the data for ML.
Hint: Think about order: e.g. handle missing → remove duplicates → encode → scale → split train/test.
Assignment 7 – Compare Model Performance with and without Preprocessing
Using the same dataset and ML algorithm, train a model twice: once with raw data (no preprocessing), and once after a full preprocessing pipeline. Compare model performance (accuracy, error) to show impact of preprocessing.
Hint: Choose a simple dataset + algorithm; ensure after preprocessing no missing/invalid data remains. Evaluate on test set.
Assignment 8 – Categorical Encoding & Model Impact
For a dataset with categorical features, encode them with both label encoding and one-hot encoding (separately), train same model, and compare results. Which encoding works better and why?
Hint: Label encoding may erroneously imply order; one-hot is safer for nominal categories. Observe effect on model metrics.
Assignment 9 – Scaling & Distance-Based Algorithms Sensitivity
Take a dataset and apply a distance-based algorithm (e.g. KNN, K-Means). Train/evaluate once with raw numeric data and once with scaled data. Compare results and comment on importance of feature scaling for such algorithms.
Hint: Distance-based algorithms are sensitive to feature magnitude differences — scaling often improves their performance.:contentReference[oaicite:0]{index=0}
Assignment 10 – Preprocessing Report for a Real Dataset
Select a publicly available dataset (UCI, Kaggle, etc.). Perform exploratory data analysis (EDA), identify data issues (missing values, outliers, inconsistent types), apply appropriate preprocessing steps, and write a short report describing what you did and why.
Hint: Use Pandas for EDA (`info()`, `describe()`, `isnull().sum()`). Document each preprocessing decision with justification.
