Welcome to The Coding College, where we make coding and programming accessible to everyone! In this tutorial, we’ll explore data cleaning with Pandas, a crucial step in preparing data for analysis or machine learning.
Why Clean Data?
Data is rarely perfect. Cleaning data helps:
- Remove inaccuracies or inconsistencies.
- Handle missing values.
- Standardize formats for analysis.
Getting Started
Let’s start by creating a sample DataFrame:
import pandas as pd
data = {
"Name": ["Alice", "Bob", None, "David", "Eva"],
"Age": [25, None, 35, 40, "Unknown"],
"Salary": [50000, 60000, None, 80000, 52000],
"Department": ["HR", "IT", "Finance", None, "HR"]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary Department
0 Alice 25 50000.0 HR
1 Bob None 60000.0 IT
2 None 35 NaN Finance
3 David 40 80000.0 None
4 Eva Unknown 52000.0 HR
Steps to Clean Data
1. Handling Missing Data
Detect Missing Data
print(df.isnull()) # Boolean DataFrame showing missing values
print(df.isnull().sum()) # Count missing values per column
Fill Missing Values
df["Age"].fillna(30, inplace=True) # Replace NaN with 30
df["Department"].fillna("Unknown", inplace=True) # Replace NaN with "Unknown"
print(df)
Drop Rows or Columns with Missing Data
df.dropna(inplace=True) # Drop rows with any missing values
df.dropna(axis=1, inplace=True) # Drop columns with missing values
2. Correcting Data Types
Check column data types:
print(df.dtypes)
Convert columns to the correct types:
df["Age"] = pd.to_numeric(df["Age"], errors="coerce") # Convert to numeric
3. Removing Duplicates
Detect Duplicates
print(df.duplicated()) # Boolean series showing duplicate rows
Drop Duplicates
df.drop_duplicates(inplace=True)
4. Fixing Incorrect Data
Replace Specific Values
df["Age"].replace("Unknown", 0, inplace=True)
Apply Custom Functions
def clean_salary(salary):
return salary if salary > 0 else None
df["Salary"] = df["Salary"].apply(clean_salary)
5. Standardizing Data
Rename Columns
df.rename(columns={"Salary": "Annual_Salary", "Age": "Employee_Age"}, inplace=True)
Strip Whitespace
df["Name"] = df["Name"].str.strip()
Format String Data
df["Department"] = df["Department"].str.capitalize()
6. Outlier Detection
Using Statistical Metrics
q1 = df["Salary"].quantile(0.25)
q3 = df["Salary"].quantile(0.75)
iqr = q3 - q1
# Detect outliers
outliers = df[(df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr))]
print(outliers)
Removing Outliers
df = df[~((df["Salary"] < (q1 - 1.5 * iqr)) | (df["Salary"] > (q3 + 1.5 * iqr)))]
Real-World Applications of Data Cleaning
- Machine Learning: Clean data ensures better model performance.
- Business Analysis: Accurate data leads to reliable insights.
- Data Visualization: Clean data is easier to interpret and present.
Why Learn Data Cleaning with The Coding College?
At The Coding College, we emphasize practical skills that help you handle real-world challenges. Our tutorials are beginner-friendly yet comprehensive, catering to coders at every level.
Visit The Coding College for:
- In-depth coding and data analysis tutorials.
- Hands-on projects to apply your skills.
- A supportive community of learners and experts.
Conclusion
Data cleaning is an essential step in any data project. Pandas provides powerful tools to handle missing values, correct data types, remove duplicates, and more. With a clean dataset, your analysis will be accurate and meaningful.
Ready to enhance your coding skills? Explore more tutorials at The Coding College and start mastering data manipulation today! 🚀