Welcome to The Coding College, where coding meets practical insights! In this tutorial, we’ll explore how to fix wrong data in Pandas, ensuring the integrity of your datasets for accurate analysis and decision-making.
Why Fix Wrong Data?
Incorrect data can:
- Skew results and lead to false conclusions.
- Cause errors in models and calculations.
- Compromise the quality of your insights.
Fixing wrong data ensures your analysis is reliable and trustworthy.
Sample DataFrame with Wrong Data
Let’s create a sample dataset with errors:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
"Age": [25, -30, 35, 120, 29], # Negative age and unrealistic age
"Salary": [50000, 60000, None, 80000, -52000], # Missing and negative salary
"JoinDate": ["2023-01-01", "2023-13-01", "2023/03/01", "2023-02-30", "2023-05-03"], # Invalid dates
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary JoinDate
0 Alice 25 50000.0 2023-01-01
1 Bob -30 60000.0 2023-13-01
2 Charlie 35 NaN 2023/03/01
3 David 120 80000.0 2023-02-30
4 Eva 29 -52000.0 2023-05-03
Steps to Fix Wrong Data
1. Correcting Numeric Data
Identifying Wrong Data
- Negative or unrealistic values:
print(df[df["Age"] < 0]) # Rows with negative ages
print(df[df["Age"] > 100]) # Rows with unrealistic ages
Replacing Wrong Data
- Replace invalid numeric values with a default or calculated value:
df["Age"] = df["Age"].apply(lambda x: 30 if x < 0 or x > 100 else x) # Replace invalid ages
df["Salary"] = df["Salary"].apply(lambda x: abs(x) if x and x < 0 else x) # Fix negative salaries
2. Handling Missing Data
- Fill missing values with a default or calculated value:
df["Salary"].fillna(df["Salary"].mean(), inplace=True) # Replace NaN with the average salary
3. Correcting Dates
Identifying Invalid Dates
Convert dates to a standard format, replacing invalid ones with NaT
(Not a Time):
df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")
print(df["JoinDate"])
Replacing Invalid Dates
Replace NaT
values with a default date:
df["JoinDate"].fillna(pd.Timestamp("2023-01-01"), inplace=True)
4. Using Conditions to Fix Data
Replace values conditionally based on other columns:
df.loc[df["Age"] < 18, "Age"] = 18 # Set minimum age to 18
Example: Before and After Fixing
Original DataFrame:
Name Age Salary JoinDate
0 Alice 25 50000.0 2023-01-01
1 Bob -30 60000.0 2023-13-01
2 Charlie 35 NaN 2023/03/01
3 David 120 80000.0 2023-02-30
4 Eva 29 -52000.0 2023-05-03
Cleaned DataFrame:
df["Age"] = df["Age"].apply(lambda x: 30 if x < 0 or x > 100 else x)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)
df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")
df["JoinDate"].fillna(pd.Timestamp("2023-01-01"), inplace=True)
Result:
Name Age Salary JoinDate
0 Alice 25 50000.0 2023-01-01
1 Bob 30 60000.0 2023-01-01
2 Charlie 35 60500.0 2023-03-01
3 David 30 80000.0 2023-01-01
4 Eva 29 52000.0 2023-05-03
Real-World Applications
- Data Science: Fix errors in training datasets for machine learning.
- Business Analytics: Ensure data accuracy for reporting and insights.
- ETL Pipelines: Correct raw data during extraction and transformation.
Learn with The Coding College
At The Coding College, we provide practical and user-focused tutorials to help you succeed in coding and data analysis.
Visit The Coding College for:
- Beginner-friendly tutorials with real-world applications.
- Hands-on projects to apply your skills.
- A community of learners and professionals to support your growth.
Conclusion
Fixing wrong data is a critical skill in data preprocessing. By following this guide, you’ll ensure your datasets are clean, accurate, and ready for analysis.