Pandas – Fixing Wrong Data

Welcome to The Coding College, where coding meets practical insights! In this tutorial, we’ll explore how to fix wrong data in Pandas, ensuring the integrity of your datasets for accurate analysis and decision-making.

Why Fix Wrong Data?

Incorrect data can:

  • Skew results and lead to false conclusions.
  • Cause errors in models and calculations.
  • Compromise the quality of your insights.

Fixing wrong data ensures your analysis is reliable and trustworthy.

Sample DataFrame with Wrong Data

Let’s create a sample dataset with errors:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": [25, -30, 35, 120, 29],  # Negative age and unrealistic age
    "Salary": [50000, 60000, None, 80000, -52000],  # Missing and negative salary
    "JoinDate": ["2023-01-01", "2023-13-01", "2023/03/01", "2023-02-30", "2023-05-03"],  # Invalid dates
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age   Salary      JoinDate
0    Alice   25  50000.0   2023-01-01
1      Bob  -30  60000.0   2023-13-01
2  Charlie   35      NaN   2023/03/01
3    David  120  80000.0   2023-02-30
4      Eva   29 -52000.0   2023-05-03

Steps to Fix Wrong Data

1. Correcting Numeric Data

Identifying Wrong Data
  • Negative or unrealistic values:
print(df[df["Age"] < 0])  # Rows with negative ages
print(df[df["Age"] > 100])  # Rows with unrealistic ages
Replacing Wrong Data
  • Replace invalid numeric values with a default or calculated value:
df["Age"] = df["Age"].apply(lambda x: 30 if x < 0 or x > 100 else x)  # Replace invalid ages
df["Salary"] = df["Salary"].apply(lambda x: abs(x) if x and x < 0 else x)  # Fix negative salaries

2. Handling Missing Data

  • Fill missing values with a default or calculated value:
df["Salary"].fillna(df["Salary"].mean(), inplace=True)  # Replace NaN with the average salary

3. Correcting Dates

Identifying Invalid Dates

Convert dates to a standard format, replacing invalid ones with NaT (Not a Time):

df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")
print(df["JoinDate"])
Replacing Invalid Dates

Replace NaT values with a default date:

df["JoinDate"].fillna(pd.Timestamp("2023-01-01"), inplace=True)

4. Using Conditions to Fix Data

Replace values conditionally based on other columns:

df.loc[df["Age"] < 18, "Age"] = 18  # Set minimum age to 18

Example: Before and After Fixing

Original DataFrame:

      Name  Age   Salary      JoinDate
0    Alice   25  50000.0   2023-01-01
1      Bob  -30  60000.0   2023-13-01
2  Charlie   35      NaN   2023/03/01
3    David  120  80000.0   2023-02-30
4      Eva   29 -52000.0   2023-05-03

Cleaned DataFrame:

df["Age"] = df["Age"].apply(lambda x: 30 if x < 0 or x > 100 else x)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)
df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")
df["JoinDate"].fillna(pd.Timestamp("2023-01-01"), inplace=True)

Result:

      Name  Age   Salary      JoinDate
0    Alice   25  50000.0   2023-01-01
1      Bob   30  60000.0   2023-01-01
2  Charlie   35  60500.0   2023-03-01
3    David   30  80000.0   2023-01-01
4      Eva   29  52000.0   2023-05-03

Real-World Applications

  1. Data Science: Fix errors in training datasets for machine learning.
  2. Business Analytics: Ensure data accuracy for reporting and insights.
  3. ETL Pipelines: Correct raw data during extraction and transformation.

Learn with The Coding College

At The Coding College, we provide practical and user-focused tutorials to help you succeed in coding and data analysis.

Visit The Coding College for:

  • Beginner-friendly tutorials with real-world applications.
  • Hands-on projects to apply your skills.
  • A community of learners and professionals to support your growth.

Conclusion

Fixing wrong data is a critical skill in data preprocessing. By following this guide, you’ll ensure your datasets are clean, accurate, and ready for analysis.

Leave a Comment