Data Science Pandas

3.4. Pandas DataFrame: How to Filter Data in Python (Step-by-Step Guide)

๐Ÿ“– Introduction

Filtering data is an essential operation in data analysis. Pandas provides powerful techniques to filter rows in a DataFrame based on conditions, making it easier to extract relevant information. In this tutorial, we will explore various ways to filter data using Boolean indexing, conditions, and query methods.

๐Ÿ—‚๏ธ 1. Creating a Sample DataFrame

Before we begin filtering, let’s create a sample DataFrame to work with.

import pandas as pd

# Creating a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [50000, 60000, 75000, 80000, 65000]
}

df = pd.DataFrame(data)
print(df)

โœ… Output:

     Name  Age         City  Salary
0   Alice   25     New York   50000
1     Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   75000
3   David   40      Houston   80000
4     Eve   28      Phoenix   65000

Now, let’s explore different ways to filter rows in this DataFrame.

๐Ÿ”Ž 2. Filtering Data Using Boolean Indexing

๐Ÿ“Œ Filtering Rows Based on a Single Condition

To filter rows where Age is greater than 30:

filtered_df = df[df['Age'] > 30]
print(filtered_df)

โœ… Output:

     Name  Age     City  Salary
2  Charlie   35  Chicago   75000
3   David   40  Houston   80000

๐Ÿ“Œ Filtering Rows Where Salary is Equal to 60000

filtered_df = df[df['Salary'] == 60000]
print(filtered_df)

๐Ÿ“Œ Filtering Rows Based on Multiple Conditions

To filter rows where Age is greater than 30 and Salary is greater than 70000:

filtered_df = df[(df['Age'] > 30) & (df['Salary'] > 70000)]
print(filtered_df)

To filter rows where Age is greater than 30 or Salary is greater than 70000:

filtered_df = df[(df['Age'] > 30) | (df['Salary'] > 70000)]
print(filtered_df)

๐ŸŽฏ 3. Filtering Rows Using query()

Pandas provides a more readable way to filter data using the query() method.

๐Ÿ“Œ Filtering Rows Where City is ‘New York’

filtered_df = df.query("City == 'New York'")
print(filtered_df)

๐Ÿ“Œ Filtering Rows Where Salary is Greater Than 60000

filtered_df = df.query("Salary > 60000")
print(filtered_df)

๐ŸŽฏ 4. Filtering Rows Using isin()

To filter rows where City is either ‘New York’ or ‘Chicago’:

filtered_df = df[df['City'].isin(['New York', 'Chicago'])]
print(filtered_df)

โœ… Output:

     Name  Age      City  Salary
0   Alice   25  New York   50000
2  Charlie   35  Chicago   75000

๐ŸŽฏ 5. Filtering Rows Using str.contains()

To filter rows where City contains the word ‘York’:

filtered_df = df[df['City'].str.contains('York', na=False)]
print(filtered_df)

๐Ÿ“Œ Filtering Rows Where Name Starts with ‘A’

filtered_df = df[df['Name'].str.startswith('A')]
print(filtered_df)

๐ŸŽฏ 6. Filtering Rows with between()

To filter rows where Salary is between 60000 and 80000:

filtered_df = df[df['Salary'].between(60000, 80000)]
print(filtered_df)

๐ŸŽฏ Conclusion

Filtering data in Pandas is an essential skill that allows you to extract meaningful insights. You can use Boolean indexing, query(), isin(), str.contains(), and between() to filter rows based on different conditions. Mastering these techniques will help you manipulate datasets efficiently in Python.

Leave a Reply

Your email address will not be published. Required fields are marked *