Welcome to The Coding College, where we simplify coding and programming! In this tutorial, we’ll focus on cleaning empty cells in Pandas DataFrames, a critical step in data preparation.
Why Handle Empty Cells?
Empty cells, or NaN
values, can lead to:
- Errors during analysis or modeling.
- Incorrect insights or skewed results.
Cleaning empty cells ensures your dataset is complete and reliable.
Sample DataFrame with Empty Cells
Let’s create a sample DataFrame to demonstrate:
import pandas as pd
data = {
"Name": ["Alice", "Bob", None, "David", "Eva"],
"Age": [25, None, 35, 40, None],
"Salary": [50000, 60000, None, 80000, 52000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 60000.0
2 None 35.0 NaN
3 David 40.0 80000.0
4 Eva NaN 52000.0
Techniques for Cleaning Empty Cells
1. Identifying Empty Cells
- Check for Missing Values:
print(df.isnull()) # Boolean DataFrame indicating NaN values
- Count Missing Values:
print(df.isnull().sum()) # Total NaN values per column
2. Filling Empty Cells
- Fill with a Specific Value: Replace empty cells with a fixed value:
df["Age"].fillna(30, inplace=True) # Replace NaN in "Age" with 30
- Fill with Mean, Median, or Mode: Use statistical methods for numeric columns:
df["Age"].fillna(df["Age"].mean(), inplace=True) # Replace NaN with the mean
df["Salary"].fillna(df["Salary"].median(), inplace=True) # Replace NaN with the median
- Forward Fill (Copy Previous Value): Fill using the previous row’s value:
df.fillna(method="ffill", inplace=True)
- Backward Fill (Copy Next Value): Fill using the next row’s value:
df.fillna(method="bfill", inplace=True)
3. Dropping Rows or Columns with Empty Cells
- Drop Rows with Missing Values: Remove rows that have any empty cells:
df.dropna(inplace=True)
- Drop Columns with Missing Values: Remove columns with any empty cells:
df.dropna(axis=1, inplace=True)
- Drop Rows with Missing Values in Specific Columns: Remove rows where certain columns have empty cells:
df.dropna(subset=["Salary"], inplace=True)
4. Replacing Empty Cells Based on Conditions
Replace empty cells conditionally:
df["Salary"] = df["Salary"].fillna(df["Age"].apply(lambda x: 30000 if x < 30 else 60000))
Example: Before and After Cleaning
Original DataFrame:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 60000.0
2 None 35.0 NaN
3 David 40.0 80000.0
4 Eva NaN 52000.0
Cleaned DataFrame (Fill NaN):
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Salary"].fillna(df["Salary"].median(), inplace=True)
Result:
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 33.3 60000.0
2 None 35.0 60000.0
3 David 40.0 80000.0
4 Eva 33.3 52000.0
Real-World Applications
- Data Science: Prepare datasets for training machine learning models.
- Business Analysis: Ensure complete and accurate data for decision-making.
- ETL Pipelines: Handle incomplete data during extraction and transformation processes.
Learn Data Cleaning at The Coding College
At The Coding College, we focus on practical, real-world coding skills. Our tutorials are designed to help you excel in your coding journey.
Visit The Coding College for:
- Comprehensive coding and data analysis tutorials.
- Practical projects to enhance your skills.
- A supportive community of learners and experts.
Conclusion
Handling empty cells is a fundamental skill when working with Pandas. By using the techniques in this guide, you can clean your dataset and ensure it’s ready for analysis.