🔄 Data Manipulation with Pandas: Handling Duplicates
🔍 Introduction
When working with real-world datasets, duplicate entries can often occur due to data collection errors, multiple sources, or merging inconsistencies. Pandas provides powerful tools to identify and remove duplicates efficiently, ensuring data integrity and accuracy.
The key functions for handling duplicates in Pandas include:
- 📌
duplicated() – Identifies duplicate rows. - 📌
drop_duplicates() – Removes duplicate rows based on specific columns. - 📌 Advanced filtering – Custom techniques to retain required duplicates.
In this tutorial, we will explore how to detect and remove duplicates in Pandas with practical examples.
📌 Example 1: Identifying Duplicates
The duplicated()
function helps detect duplicate rows.
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
'Age': [25, 30, 35, 25, 40, 30]}
df = pd.DataFrame(data)
# Identifying duplicate rows
duplicates = df.duplicated()
print(df[duplicates])
✅ Output:
Name Age
3 Alice 25
5 Bob 30
Here, the duplicated()
function marks rows that have appeared before in the dataset.
📌 Example 2: Removing Duplicates
To remove duplicates while keeping the first occurrence, use drop_duplicates()
:
# Removing duplicates
df_unique = df.drop_duplicates()
print(df_unique)
✅ Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
4 David 40
By default, drop_duplicates()
retains the first occurrence and removes subsequent duplicates.
📌 Example 3: Removing Duplicates Based on Specific Columns
If you want to remove duplicates based on a specific column (e.g., ‘Name’), use:
# Removing duplicates based on the 'Name' column
df_name_unique = df.drop_duplicates(subset=['Name'])
print(df_name_unique)
✅ Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
4 David 40
📌 Summary
🔹
duplicated() helps identify duplicate rows in a dataset. 🔹
drop_duplicates() removes duplicates while keeping the first occurrence by default. 🔹 Subset filtering allows duplicate removal based on specific columns.
Handling duplicates is essential for maintaining clean and reliable datasets. Mastering these Pandas functions will help you efficiently clean and process data for analysis. 🚀