๐ Introduction
Filtering data is an essential operation in data analysis. Pandas provides powerful techniques to filter rows in a DataFrame based on conditions, making it easier to extract relevant information. In this tutorial, we will explore various ways to filter data using Boolean indexing, conditions, and query methods.
๐๏ธ 1. Creating a Sample DataFrame
Before we begin filtering, let’s create a sample DataFrame to work with.
import pandas as pd
# Creating a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'Salary': [50000, 60000, 75000, 80000, 65000]
}
df = pd.DataFrame(data)
print(df)
โ Output:
Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 75000
3 David 40 Houston 80000
4 Eve 28 Phoenix 65000
Now, let’s explore different ways to filter rows in this DataFrame.
๐ 2. Filtering Data Using Boolean Indexing
๐ Filtering Rows Based on a Single Condition
To filter rows where Age is greater than 30:
filtered_df = df[df['Age'] > 30]
print(filtered_df)
โ Output:
Name Age City Salary
2 Charlie 35 Chicago 75000
3 David 40 Houston 80000
๐ Filtering Rows Where Salary is Equal to 60000
filtered_df = df[df['Salary'] == 60000]
print(filtered_df)
๐ Filtering Rows Based on Multiple Conditions
To filter rows where Age is greater than 30 and Salary is greater than 70000:
filtered_df = df[(df['Age'] > 30) & (df['Salary'] > 70000)]
print(filtered_df)
To filter rows where Age is greater than 30 or Salary is greater than 70000:
filtered_df = df[(df['Age'] > 30) | (df['Salary'] > 70000)]
print(filtered_df)
๐ฏ 3. Filtering Rows Using query()
Pandas provides a more readable way to filter data using the query()
method.
๐ Filtering Rows Where City is ‘New York’
filtered_df = df.query("City == 'New York'")
print(filtered_df)
๐ Filtering Rows Where Salary is Greater Than 60000
filtered_df = df.query("Salary > 60000")
print(filtered_df)
๐ฏ 4. Filtering Rows Using isin()
To filter rows where City is either ‘New York’ or ‘Chicago’:
filtered_df = df[df['City'].isin(['New York', 'Chicago'])]
print(filtered_df)
โ Output:
Name Age City Salary
0 Alice 25 New York 50000
2 Charlie 35 Chicago 75000
๐ฏ 5. Filtering Rows Using str.contains()
To filter rows where City contains the word ‘York’:
filtered_df = df[df['City'].str.contains('York', na=False)]
print(filtered_df)
๐ Filtering Rows Where Name Starts with ‘A’
filtered_df = df[df['Name'].str.startswith('A')]
print(filtered_df)
๐ฏ 6. Filtering Rows with between()
To filter rows where Salary is between 60000 and 80000:
filtered_df = df[df['Salary'].between(60000, 80000)]
print(filtered_df)
๐ฏ Conclusion
Filtering data in Pandas is an essential skill that allows you to extract meaningful insights. You can use Boolean indexing, query(), isin(), str.contains(), and between() to filter rows based on different conditions. Mastering these techniques will help you manipulate datasets efficiently in Python.