Data Science Pandas

5.3. Working with Missing Data – Handling NaN in Pandas DataFrames

🛠️ Working with Missing Data: Handling NaN in Pandas DataFrames

🔎 Introduction

Missing data, represented as NaN (Not a Number) in Pandas, can affect data analysis and machine learning models. Pandas provides multiple ways to handle NaN values, such as filling them with specific values, removing them, or replacing them dynamically.

Key methods for handling NaN values in Pandas include:

  1. 📌 fillna() – Replacing NaN values with a specified value or method.
  2. 📌 dropna() – Removing rows or columns containing NaN values.
  3. 📌 replace() – Replacing NaN values with alternative representations.
  4. 📌 interpolate() – Estimating missing values using interpolation.

In this tutorial, we will explore different ways to handle NaN values in Pandas DataFrames.

📌 Example 1: Checking for NaN Values

Before handling NaN values, it’s useful to check where they exist.

import pandas as pd
import numpy as np

# Creating a DataFrame with NaN values
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, np.nan, 35, 40],
        'Score': [85, 90, np.nan, 78]}
df = pd.DataFrame(data)

# Checking for NaN values
print(df.isna())

✅ Output:

    Name    Age  Score
0  False  False  False
1  False   True  False
2  False  False   True
3  False  False  False

Here, True represents a missing (NaN) value.

📌 Example 2: Filling NaN Values Using fillna()

We can replace NaN values with a specific value, such as 0:

# Filling NaN values with zero
df_filled = df.fillna(0)
print(df_filled)

✅ Output:

      Name   Age  Score
0   Alice  25.0   85.0
1     Bob   0.0   90.0
2  Charlie  35.0    0.0
3   David  40.0   78.0

Alternatively, we can use column means to fill NaN values:

# Filling NaN values with column means
df_mean_filled = df.fillna(df.mean(numeric_only=True))
print(df_mean_filled)

📌 Example 3: Dropping NaN Values Using dropna()

To remove rows containing NaN values:

# Dropping rows with NaN values
df_dropped = df.dropna()
print(df_dropped)

✅ Output:

      Name   Age  Score
0   Alice  25.0   85.0
3   David  40.0   78.0

📌 Example 4: Replacing NaN Values Using replace()

We can replace NaN values with a custom label:

# Replacing NaN values with 'Unknown'
df_replaced = df.replace(np.nan, 'Unknown')
print(df_replaced)

✅ Output:

      Name      Age    Score
0   Alice    25.0      85.0
1     Bob  Unknown     90.0
2  Charlie   35.0   Unknown
3   David    40.0      78.0

Question : When we directly can write NAN value the why use numpy?

We use np.nan from the NumPy library because Pandas internally represents missing values as NaN (Not a Number), and np.nan is the standard way to introduce missing values in a DataFrame.

Why np.nan?

  1. Standard Representation: In Python, there is no built-in NaN type, so np.nan from the NumPy library is used as the standard representation of missing values.

  2. Compatibility: Pandas is built on top of NumPy, and it recognizes np.nan as a missing value.

  3. Operations: Pandas provides functions like .isna(), .fillna(), and .dropna(), which specifically handle NaN values introduced using np.nan.

Example Without np.nan

If you try to use None instead

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, None, 35, 40], # Using None instead of np.nan
'Score': [85, 90, None, 78]}
df = pd.DataFrame(data)
print(df.isna())

Output:

Name Age Score
0 False False False
1 False True False
2 False False True
3 False False False

Even though None works, Pandas automatically converts it to NaN for numerical columns. Using np.nan is preferred because it’s explicitly meant for numerical operations.

📌 Summary

🔹 fillna() helps replace NaN values with specific values like mean or zero. 🔹 dropna() removes rows or columns containing NaN values. 🔹 replace() allows flexible replacement of NaN values. 🔹 Choosing the right method depends on the data context to maintain data integrity.

Effectively handling NaN values ensures cleaner and more reliable datasets for analysis and machine learning. 🚀

Leave a Reply

Your email address will not be published. Required fields are marked *