Pandas – Cleaning Data of Wrong Format

Welcome to The Coding College, your go-to resource for coding and programming! In this tutorial, we’ll explore how to clean data of the wrong format in Pandas, a vital step in preparing datasets for analysis.

Why Correct Data Formats?

Data with incorrect formats can:

  • Cause errors in calculations and operations.
  • Lead to inaccurate analysis.
  • Complicate data integration with other systems.

By ensuring your data is in the correct format, you improve its reliability and usability.

Sample DataFrame with Wrong Formats

Here’s an example of a dataset with inconsistent formats:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": ["25", "Thirty", "35", "40", "29"],
    "JoinDate": ["2023-01-01", "01-02-2023", "2023/03/01", "March 4, 2023", "05-03-2023"],
    "Salary": ["50,000", "60000", "$75,000", "80k", "52000"]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name     Age      JoinDate   Salary
0    Alice      25   2023-01-01   50,000
1      Bob  Thirty   01-02-2023    60000
2  Charlie      35   2023/03/01  $75,000
3    David      40  March 4, 2023     80k
4      Eva      29   05-03-2023    52000

Steps to Clean Data of Wrong Format

1. Correcting Numeric Columns

Convert to Numeric

Use pd.to_numeric to fix numeric columns:

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

Explanation: Converts valid numbers and replaces invalid values (like “Thirty”) with NaN.

Remove Non-Numeric Characters

Clean numeric columns with mixed formats:

df["Salary"] = df["Salary"].replace("[\$,k,]", "", regex=True).astype(float)
df["Salary"] *= 1000  # Convert "80k" to 80000

2. Standardizing Date Formats

Convert to Datetime

Use pd.to_datetime to normalize dates:

df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")

Explanation:

  • Converts various date formats to a standard datetime64 type.
  • Invalid formats are set to NaT (Not a Time).

3. Handling Invalid Formats

Replace or Drop Invalid Values

Replace invalid entries with a default value:

df["Age"].fillna(df["Age"].mean(), inplace=True)

Drop rows with invalid formats:

df.dropna(subset=["JoinDate"], inplace=True)

Example: Before and After Cleaning

Original DataFrame:

      Name     Age      JoinDate   Salary
0    Alice      25   2023-01-01   50,000
1      Bob  Thirty   01-02-2023    60000
2  Charlie      35   2023/03/01  $75,000
3    David      40  March 4, 2023     80k
4      Eva      29   05-03-2023    52000

Cleaned DataFrame:

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["JoinDate"] = pd.to_datetime(df["JoinDate"], errors="coerce")
df["Salary"] = df["Salary"].replace("[\$,k,]", "", regex=True).astype(float)

Result:

      Name   Age   JoinDate   Salary
0    Alice  25.0 2023-01-01  50000.0
1      Bob   NaN 2023-01-02  60000.0
2  Charlie  35.0 2023-03-01  75000.0
3    David  40.0 2023-03-04  80000.0
4      Eva  29.0 2023-05-03  52000.0

Common Techniques for Format Cleaning

  • String Methods:
df["Name"] = df["Name"].str.strip()  # Remove leading/trailing spaces
df["Name"] = df["Name"].str.title()  # Standardize capitalization
  • Regular Expressions: Use regex for advanced text cleaning:
df["Salary"] = df["Salary"].replace("[^\d.]", "", regex=True).astype(float)
  • Error Handling: Handle invalid formats gracefully:
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

Real-World Applications

  1. Data Science: Standardize formats for feature engineering.
  2. Business Analysis: Ensure accurate metrics by cleaning numeric and date fields.
  3. ETL Pipelines: Normalize data before loading it into databases.

Learn Data Cleaning at The Coding College

At The Coding College, we provide practical tutorials for mastering coding skills. Our resources are designed to prepare you for real-world challenges in programming and data analysis.

Visit The Coding College for:

  • Comprehensive coding tutorials.
  • Hands-on projects for skill application.
  • A thriving community of learners and experts.

Conclusion

Cleaning data of the wrong format is a fundamental task in data preprocessing. With Pandas, you can easily fix numeric, date, and string formats to prepare your data for analysis.

Leave a Comment