Pandas – Cleaning Data

Welcome to The Coding College, where we make coding and programming accessible to everyone! In this tutorial, we’ll explore data cleaning with Pandas, a crucial step in preparing data for analysis or machine learning.

Why Clean Data?

Data is rarely perfect. Cleaning data helps:

  • Remove inaccuracies or inconsistencies.
  • Handle missing values.
  • Standardize formats for analysis.

Getting Started

Let’s start by creating a sample DataFrame:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", None, "David", "Eva"],
    "Age": [25, None, 35, 40, "Unknown"],
    "Salary": [50000, 60000, None, 80000, 52000],
    "Department": ["HR", "IT", "Finance", None, "HR"]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name      Age   Salary Department
0   Alice       25  50000.0        HR
1     Bob     None  60000.0        IT
2    None       35      NaN    Finance
3   David       40  80000.0      None
4     Eva  Unknown  52000.0        HR

Steps to Clean Data

1. Handling Missing Data

Detect Missing Data
print(df.isnull())  # Boolean DataFrame showing missing values
print(df.isnull().sum())  # Count missing values per column
Fill Missing Values
df["Age"].fillna(30, inplace=True)  # Replace NaN with 30
df["Department"].fillna("Unknown", inplace=True)  # Replace NaN with "Unknown"
print(df)
Drop Rows or Columns with Missing Data
df.dropna(inplace=True)  # Drop rows with any missing values
df.dropna(axis=1, inplace=True)  # Drop columns with missing values

2. Correcting Data Types

Check column data types:

print(df.dtypes)

Convert columns to the correct types:

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")  # Convert to numeric

3. Removing Duplicates

Detect Duplicates
print(df.duplicated())  # Boolean series showing duplicate rows
Drop Duplicates
df.drop_duplicates(inplace=True)

4. Fixing Incorrect Data

Replace Specific Values
df["Age"].replace("Unknown", 0, inplace=True)
Apply Custom Functions
def clean_salary(salary):
    return salary if salary > 0 else None

df["Salary"] = df["Salary"].apply(clean_salary)

5. Standardizing Data

Rename Columns
df.rename(columns={"Salary": "Annual_Salary", "Age": "Employee_Age"}, inplace=True)
Strip Whitespace
df["Name"] = df["Name"].str.strip()
Format String Data
df["Department"] = df["Department"].str.capitalize()

6. Outlier Detection

Using Statistical Metrics
q1 = df["Salary"].quantile(0.25)
q3 = df["Salary"].quantile(0.75)
iqr = q3 - q1

# Detect outliers
outliers = df[(df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr))]
print(outliers)
Removing Outliers
df = df[~((df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr)))]

Real-World Applications of Data Cleaning

  1. Machine Learning: Clean data ensures better model performance.
  2. Business Analysis: Accurate data leads to reliable insights.
  3. Data Visualization: Clean data is easier to interpret and present.

Why Learn Data Cleaning with The Coding College?

At The Coding College, we emphasize practical skills that help you handle real-world challenges. Our tutorials are beginner-friendly yet comprehensive, catering to coders at every level.

Visit The Coding College for:

  • In-depth coding and data analysis tutorials.
  • Hands-on projects to apply your skills.
  • A supportive community of learners and experts.

Conclusion

Data cleaning is an essential step in any data project. Pandas provides powerful tools to handle missing values, correct data types, remove duplicates, and more. With a clean dataset, your analysis will be accurate and meaningful.

Ready to enhance your coding skills? Explore more tutorials at The Coding College and start mastering data manipulation today! 🚀

Leave a Comment