Pandas – Cleaning Data - The Coding College

Welcome to The Coding College, where we make coding and programming accessible to everyone! In this tutorial, we’ll explore data cleaning with Pandas, a crucial step in preparing data for analysis or machine learning.

Why Clean Data?

Data is rarely perfect. Cleaning data helps:

Remove inaccuracies or inconsistencies.
Handle missing values.
Standardize formats for analysis.

Getting Started

Let’s start by creating a sample DataFrame:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", None, "David", "Eva"],
    "Age": [25, None, 35, 40, "Unknown"],
    "Salary": [50000, 60000, None, 80000, 52000],
    "Department": ["HR", "IT", "Finance", None, "HR"]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name      Age   Salary Department
0   Alice       25  50000.0        HR
1     Bob     None  60000.0        IT
2    None       35      NaN    Finance
3   David       40  80000.0      None
4     Eva  Unknown  52000.0        HR

Steps to Clean Data

1. Handling Missing Data

Detect Missing Data

print(df.isnull())  # Boolean DataFrame showing missing values
print(df.isnull().sum())  # Count missing values per column

Fill Missing Values

df["Age"].fillna(30, inplace=True)  # Replace NaN with 30
df["Department"].fillna("Unknown", inplace=True)  # Replace NaN with "Unknown"
print(df)

Drop Rows or Columns with Missing Data

df.dropna(inplace=True)  # Drop rows with any missing values
df.dropna(axis=1, inplace=True)  # Drop columns with missing values

2. Correcting Data Types

Check column data types:

print(df.dtypes)

Convert columns to the correct types:

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")  # Convert to numeric

3. Removing Duplicates

Detect Duplicates

print(df.duplicated())  # Boolean series showing duplicate rows

Drop Duplicates

df.drop_duplicates(inplace=True)

4. Fixing Incorrect Data

Replace Specific Values

df["Age"].replace("Unknown", 0, inplace=True)

Apply Custom Functions

def clean_salary(salary):
    return salary if salary > 0 else None

df["Salary"] = df["Salary"].apply(clean_salary)

5. Standardizing Data

Rename Columns

df.rename(columns={"Salary": "Annual_Salary", "Age": "Employee_Age"}, inplace=True)

Strip Whitespace

df["Name"] = df["Name"].str.strip()

Format String Data

df["Department"] = df["Department"].str.capitalize()

6. Outlier Detection

Using Statistical Metrics

q1 = df["Salary"].quantile(0.25)
q3 = df["Salary"].quantile(0.75)
iqr = q3 - q1

# Detect outliers
outliers = df[(df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr))]
print(outliers)

Removing Outliers

df = df[~((df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr)))]

Real-World Applications of Data Cleaning

Machine Learning: Clean data ensures better model performance.
Business Analysis: Accurate data leads to reliable insights.
Data Visualization: Clean data is easier to interpret and present.

Why Learn Data Cleaning with The Coding College?

At The Coding College, we emphasize practical skills that help you handle real-world challenges. Our tutorials are beginner-friendly yet comprehensive, catering to coders at every level.

Visit The Coding College for:

In-depth coding and data analysis tutorials.
Hands-on projects to apply your skills.
A supportive community of learners and experts.

Conclusion

Data cleaning is an essential step in any data project. Pandas provides powerful tools to handle missing values, correct data types, remove duplicates, and more. With a clean dataset, your analysis will be accurate and meaningful.

Ready to enhance your coding skills? Explore more tutorials at The Coding College and start mastering data manipulation today! 🚀