Artificial Intelligence

Module 4.6: Data Manipulation Using Pandas

Data manipulation is one of the most important steps in Data Science, Artificial Intelligence (AI), Machine Learning (ML), and Business Analytics. Raw data collected from databases, websites, surveys, sensors, and business applications is often incomplete, inconsistent, or unorganized. Before meaningful analysis or machine learning can take place, the data must be cleaned, transformed, filtered, and structured properly.

Pandas is a powerful Python library that provides flexible and efficient tools for data manipulation. It allows users to modify datasets, filter records, handle missing values, transform columns, merge tables, group data, and perform complex operations with minimal code.

Data scientists and analysts spend a significant portion of their time preparing data. Therefore, mastering data manipulation techniques in Pandas is essential for building reliable machine learning models and generating accurate business insights.

In this tutorial, we will explore various data manipulation techniques using Pandas, including filtering, sorting, updating, grouping, merging, transforming, and aggregating data.

What is Data Manipulation?

Data manipulation refers to the process of modifying, organizing, cleaning, and transforming data to make it suitable for analysis and decision-making.

Common data manipulation tasks include:

  • Filtering records.
  • Sorting data.
  • Adding new columns.
  • Removing unwanted data.
  • Handling missing values.
  • Renaming columns.
  • Grouping and aggregation.
  • Merging datasets.
  • Data transformation.

These operations improve data quality and make datasets easier to analyze.

Why Use Pandas for Data Manipulation?

Pandas provides powerful data structures and built-in functions that simplify data handling tasks.

Benefits include:

  • Fast data processing.
  • Easy handling of large datasets.
  • Flexible data structures.
  • Powerful filtering capabilities.
  • Built-in statistical functions.
  • Integration with machine learning libraries.
  • Support for multiple file formats.

These features make Pandas one of the most widely used tools in Data Science.

Importing Pandas

Before performing data manipulation, import the Pandas library.

import pandas as pd

Create a sample DataFrame:

data = {
    "Name": ["John", "Emma", "Alex", "Sophia"],
    "Age": [25, 28, 30, 27],
    "Salary": [50000, 60000, 70000, 65000]
}

df = pd.DataFrame(data)

print(df)

Viewing Data

Before manipulating data, it is important to inspect the dataset.

Display First Rows

df.head()

Display Last Rows

df.tail()

Dataset Information

df.info()

Statistical Summary

df.describe()

These functions help users understand the structure and quality of data.

Selecting Data

Selecting specific data is one of the most common operations.

Select a Single Column

df["Name"]

Select Multiple Columns

df[["Name", "Salary"]]

Select a Specific Row

df.iloc[0]

Data selection allows focused analysis of relevant information.

Filtering Data

Filtering retrieves records that satisfy specific conditions.

Employees Older Than 26

df[df["Age"] > 26]

Output:

     Name  Age  Salary
1    Emma   28   60000
2    Alex   30   70000
3  Sophia   27   65000

Salary Greater Than 60000

df[df["Salary"] > 60000]

Filtering is essential for data exploration and reporting.

Sorting Data

Sorting organizes data in ascending or descending order.

Sort by Age

df.sort_values("Age")

Sort by Salary Descending

df.sort_values(
    "Salary",
    ascending=False
)

Sorting helps identify trends and rankings within data.

Adding New Columns

New columns can be created using calculations or derived values.

df["Bonus"] = df["Salary"] * 0.10

Output:

     Name  Age  Salary  Bonus
0    John   25   50000   5000
1    Emma   28   60000   6000
2    Alex   30   70000   7000
3  Sophia   27   65000   6500

This technique is often used in feature engineering.

Updating Existing Data

Values can be modified using indexing methods.

df.loc[0, "Salary"] = 55000

This updates John’s salary from 50000 to 55000.

Renaming Columns

Meaningful column names improve readability.

df.rename(
    columns={
        "Salary": "Monthly_Salary"
    },
    inplace=True
)

The inplace parameter updates the original DataFrame.

Removing Columns

Unnecessary columns can be deleted.

df.drop(
    "Bonus",
    axis=1,
    inplace=True
)

This removes the Bonus column.

Removing Rows

Specific records can also be removed.

df.drop(0)

This removes the first row from the DataFrame.

Handling Missing Values

Missing data is a common challenge in real-world datasets.

Check Missing Values

df.isnull()

Count Missing Values

df.isnull().sum()

Remove Missing Values

df.dropna()

Replace Missing Values

df.fillna(0)

Proper handling of missing values improves data quality.

Replacing Values

Specific values can be replaced throughout the dataset.

df.replace(
    50000,
    52000
)

This replaces all occurrences of 50000 with 52000.

Removing Duplicate Records

Duplicate data can lead to inaccurate analysis.

df.drop_duplicates()

This removes repeated records.

Changing Data Types

Data types can be converted when necessary.

df["Age"] = df["Age"].astype(float)

This converts the Age column to float values.

Creating New Calculated Columns

Pandas allows complex calculations across columns.

df["Annual_Salary"] =
df["Monthly_Salary"] * 12

Output:

     Name  Annual_Salary
0    John        660000
1    Emma        720000
2    Alex        840000
3  Sophia        780000

Calculated columns are frequently used in analytics projects.

Using Conditional Logic

Conditional operations can create categories.

df["Category"] =
df["Salary"].apply(
    lambda x:
    "High" if x > 60000
    else "Standard"
)

This classifies employees based on salary levels.

Grouping Data

Grouping summarizes data by categories.

Example dataset:

Department  Salary
Sales       50000
Sales       60000
IT          70000
IT          80000

Calculate average salary:

df.groupby(
    "Department"
)["Salary"].mean()

Output:

Department
IT       75000
Sales    55000

Grouping is widely used in business intelligence.

Aggregation Functions

Aggregation summarizes data using statistical functions.

Common functions include:

  • sum()
  • mean()
  • count()
  • min()
  • max()
  • median()

Example:

df["Salary"].mean()

Output:

61250

Merging DataFrames

Organizations often store information across multiple datasets.

Merge two DataFrames:

pd.merge(
    df1,
    df2,
    on="Employee_ID"
)

This combines matching records from both tables.

Joining DataFrames

Join operations combine related datasets.

df1.join(df2)

Joins are useful when integrating multiple data sources.

Concatenating DataFrames

Concatenation combines DataFrames vertically or horizontally.

pd.concat([df1, df2])

This operation is commonly used when combining monthly or yearly reports.

Applying Functions to Columns

Custom functions can be applied to DataFrame columns.

def double_salary(x):
    return x * 2

df["Salary"].apply(
    double_salary
)

This doubles every salary value.

Pivot Tables

Pivot tables summarize large datasets efficiently.

df.pivot_table(
    values="Salary",
    index="Department",
    aggfunc="mean"
)

Pivot tables are widely used in business reporting and analytics.

Working with Dates

Date manipulation is important in time-series analysis.

df["Date"] =
pd.to_datetime(
    df["Date"]
)

Extract year:

df["Date"].dt.year

Extract month:

df["Date"].dt.month

Date operations are heavily used in forecasting projects.

Exporting Manipulated Data

After processing, datasets can be saved for future use.

Export to CSV

df.to_csv(
    "processed_data.csv",
    index=False
)

Export to Excel

df.to_excel(
    "processed_data.xlsx",
    index=False
)

This allows processed data to be shared or stored.

Applications of Data Manipulation

Data manipulation is used in numerous industries.

  • Artificial Intelligence.
  • Machine Learning.
  • Financial Analytics.
  • Business Intelligence.
  • Healthcare Analytics.
  • Marketing Analysis.
  • Customer Behavior Analysis.
  • Scientific Research.

Every data-driven project requires data manipulation before analysis.

Best Practices for Data Manipulation

  • Use meaningful column names.
  • Validate data quality regularly.
  • Handle missing values carefully.
  • Remove duplicate records.
  • Maintain consistent data types.
  • Document transformation steps.
  • Use vectorized operations for efficiency.

Following these practices ensures reliable and reproducible results.

Advantages of Pandas Data Manipulation

  • Fast processing speed.
  • Simple syntax.
  • Powerful built-in functions.
  • Excellent scalability.
  • Strong integration with AI libraries.
  • Easy handling of complex datasets.
  • Improved productivity.

Conclusion

Data manipulation is a critical skill in Data Science, Artificial Intelligence, and Machine Learning. Pandas provides a comprehensive set of tools for cleaning, filtering, sorting, transforming, merging, and analyzing data efficiently.

By mastering DataFrame operations such as filtering records, handling missing values, creating calculated columns, grouping data, merging datasets, and applying custom functions, learners can prepare high-quality datasets for analysis and machine learning. These skills form the foundation of every successful data-driven project and are essential for modern AI development.

Leave a Reply

Your email address will not be published. Required fields are marked *