Data Science Pandas

4.4.Data Manipulation with Pandas – Handling Duplicates

🔄 Data Manipulation with Pandas: Handling Duplicates

🔍 Introduction

When working with real-world datasets, duplicate entries can often occur due to data collection errors, multiple sources, or merging inconsistencies. Pandas provides powerful tools to identify and remove duplicates efficiently, ensuring data integrity and accuracy.

The key functions for handling duplicates in Pandas include:

  1. 📌 duplicated() – Identifies duplicate rows.
  2. 📌 drop_duplicates() – Removes duplicate rows based on specific columns.
  3. 📌 Advanced filtering – Custom techniques to retain required duplicates.

In this tutorial, we will explore how to detect and remove duplicates in Pandas with practical examples.

📌 Example 1: Identifying Duplicates

The duplicated() function helps detect duplicate rows.

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 35, 25, 40, 30]}
df = pd.DataFrame(data)

# Identifying duplicate rows
duplicates = df.duplicated()
print(df[duplicates])

✅ Output:

    Name  Age
3  Alice   25
5    Bob   30

Here, the duplicated() function marks rows that have appeared before in the dataset.

📌 Example 2: Removing Duplicates

To remove duplicates while keeping the first occurrence, use drop_duplicates():

# Removing duplicates
df_unique = df.drop_duplicates()
print(df_unique)

✅ Output:

      Name  Age
0   Alice   25
1     Bob   30
2  Charlie   35
4   David   40

By default, drop_duplicates() retains the first occurrence and removes subsequent duplicates.

📌 Example 3: Removing Duplicates Based on Specific Columns

If you want to remove duplicates based on a specific column (e.g., ‘Name’), use:

# Removing duplicates based on the 'Name' column
df_name_unique = df.drop_duplicates(subset=['Name'])
print(df_name_unique)

✅ Output:

      Name  Age
0   Alice   25
1     Bob   30
2  Charlie   35
4   David   40

📌 Summary

🔹 duplicated() helps identify duplicate rows in a dataset. 🔹 drop_duplicates() removes duplicates while keeping the first occurrence by default. 🔹 Subset filtering allows duplicate removal based on specific columns.

Handling duplicates is essential for maintaining clean and reliable datasets. Mastering these Pandas functions will help you efficiently clean and process data for analysis. 🚀

 

Leave a Reply

Your email address will not be published. Required fields are marked *