🛠️ Working with Missing Data: Handling NaN in Pandas DataFrames
🔎 Introduction
Missing data, represented as NaN (Not a Number) in Pandas, can affect data analysis and machine learning models. Pandas provides multiple ways to handle NaN values, such as filling them with specific values, removing them, or replacing them dynamically.
Key methods for handling NaN values in Pandas include:
- 📌
fillna() – Replacing NaN values with a specified value or method. - 📌
dropna() – Removing rows or columns containing NaN values. - 📌
replace() – Replacing NaN values with alternative representations. - 📌
interpolate() – Estimating missing values using interpolation.
In this tutorial, we will explore different ways to handle NaN values in Pandas DataFrames.
📌 Example 1: Checking for NaN Values
Before handling NaN values, it’s useful to check where they exist.
import pandas as pd
import numpy as np
# Creating a DataFrame with NaN values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 35, 40],
'Score': [85, 90, np.nan, 78]}
df = pd.DataFrame(data)
# Checking for NaN values
print(df.isna())
✅ Output:
Name Age Score
0 False False False
1 False True False
2 False False True
3 False False False
Here, True
represents a missing (NaN) value.
📌 Example 2: Filling NaN Values Using fillna()
We can replace NaN values with a specific value, such as 0
:
# Filling NaN values with zero
df_filled = df.fillna(0)
print(df_filled)
✅ Output:
Name Age Score
0 Alice 25.0 85.0
1 Bob 0.0 90.0
2 Charlie 35.0 0.0
3 David 40.0 78.0
Alternatively, we can use column means to fill NaN values:
# Filling NaN values with column means
df_mean_filled = df.fillna(df.mean(numeric_only=True))
print(df_mean_filled)
📌 Example 3: Dropping NaN Values Using dropna()
To remove rows containing NaN values:
# Dropping rows with NaN values
df_dropped = df.dropna()
print(df_dropped)
✅ Output:
Name Age Score
0 Alice 25.0 85.0
3 David 40.0 78.0
📌 Example 4: Replacing NaN Values Using replace()
We can replace NaN values with a custom label:
# Replacing NaN values with 'Unknown'
df_replaced = df.replace(np.nan, 'Unknown')
print(df_replaced)
✅ Output:
Name Age Score
0 Alice 25.0 85.0
1 Bob Unknown 90.0
2 Charlie 35.0 Unknown
3 David 40.0 78.0
Question : When we directly can write NAN value the why use numpy?
We use np.nan
from the NumPy library because Pandas internally represents missing values as NaN
(Not a Number), and np.nan
is the standard way to introduce missing values in a DataFrame.
Why np.nan
?
-
Standard Representation: In Python, there is no built-in
NaN
type, sonp.nan
from the NumPy library is used as the standard representation of missing values. -
Compatibility: Pandas is built on top of NumPy, and it recognizes
np.nan
as a missing value. -
Operations: Pandas provides functions like
.isna()
,.fillna()
, and.dropna()
, which specifically handleNaN
values introduced usingnp.nan
.
Example Without np.nan
If you try to use None
instead
✅ Output:
Even though None
works, Pandas automatically converts it to NaN
for numerical columns. Using np.nan
is preferred because it’s explicitly meant for numerical operations.
📌 Summary
🔹
fillna() helps replace NaN values with specific values like mean or zero. 🔹
dropna() removes rows or columns containing NaN values. 🔹
replace() allows flexible replacement of NaN values. 🔹 Choosing the right method depends on the data context to maintain data integrity.
Effectively handling NaN values ensures cleaner and more reliable datasets for analysis and machine learning. 🚀