Pandas – Cleaning Empty Cells

Welcome to The Coding College, where we simplify coding and programming! In this tutorial, we’ll focus on cleaning empty cells in Pandas DataFrames, a critical step in data preparation.

Why Handle Empty Cells?

Empty cells, or NaN values, can lead to:

  • Errors during analysis or modeling.
  • Incorrect insights or skewed results.

Cleaning empty cells ensures your dataset is complete and reliable.

Sample DataFrame with Empty Cells

Let’s create a sample DataFrame to demonstrate:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", None, "David", "Eva"],
    "Age": [25, None, 35, 40, None],
    "Salary": [50000, 60000, None, 80000, 52000]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name   Age   Salary
0   Alice  25.0  50000.0
1     Bob   NaN  60000.0
2    None  35.0      NaN
3   David  40.0  80000.0
4     Eva   NaN  52000.0

Techniques for Cleaning Empty Cells

1. Identifying Empty Cells

  • Check for Missing Values:
print(df.isnull())  # Boolean DataFrame indicating NaN values
  • Count Missing Values:
print(df.isnull().sum())  # Total NaN values per column

2. Filling Empty Cells

  • Fill with a Specific Value: Replace empty cells with a fixed value:
df["Age"].fillna(30, inplace=True)  # Replace NaN in "Age" with 30
  • Fill with Mean, Median, or Mode: Use statistical methods for numeric columns:
df["Age"].fillna(df["Age"].mean(), inplace=True)  # Replace NaN with the mean
df["Salary"].fillna(df["Salary"].median(), inplace=True)  # Replace NaN with the median
  • Forward Fill (Copy Previous Value): Fill using the previous row’s value:
df.fillna(method="ffill", inplace=True)
  • Backward Fill (Copy Next Value): Fill using the next row’s value:
df.fillna(method="bfill", inplace=True)

3. Dropping Rows or Columns with Empty Cells

  • Drop Rows with Missing Values: Remove rows that have any empty cells:
df.dropna(inplace=True)
  • Drop Columns with Missing Values: Remove columns with any empty cells:
df.dropna(axis=1, inplace=True)
  • Drop Rows with Missing Values in Specific Columns: Remove rows where certain columns have empty cells:
df.dropna(subset=["Salary"], inplace=True)

4. Replacing Empty Cells Based on Conditions

Replace empty cells conditionally:

df["Salary"] = df["Salary"].fillna(df["Age"].apply(lambda x: 30000 if x < 30 else 60000))

Example: Before and After Cleaning

Original DataFrame:

     Name   Age   Salary
0   Alice  25.0  50000.0
1     Bob   NaN  60000.0
2    None  35.0      NaN
3   David  40.0  80000.0
4     Eva   NaN  52000.0

Cleaned DataFrame (Fill NaN):

df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Salary"].fillna(df["Salary"].median(), inplace=True)

Result:

     Name   Age   Salary
0   Alice  25.0  50000.0
1     Bob  33.3  60000.0
2    None  35.0  60000.0
3   David  40.0  80000.0
4     Eva  33.3  52000.0

Real-World Applications

  1. Data Science: Prepare datasets for training machine learning models.
  2. Business Analysis: Ensure complete and accurate data for decision-making.
  3. ETL Pipelines: Handle incomplete data during extraction and transformation processes.

Learn Data Cleaning at The Coding College

At The Coding College, we focus on practical, real-world coding skills. Our tutorials are designed to help you excel in your coding journey.

Visit The Coding College for:

  • Comprehensive coding and data analysis tutorials.
  • Practical projects to enhance your skills.
  • A supportive community of learners and experts.

Conclusion

Handling empty cells is a fundamental skill when working with Pandas. By using the techniques in this guide, you can clean your dataset and ensure it’s ready for analysis.

Leave a Comment