Welcome to The Coding College, your one-stop destination for practical coding tutorials! In this guide, we’ll explore how to handle duplicate data in Pandas, an essential skill for maintaining the integrity of your datasets.
Why Remove Duplicates?
Duplicate entries can:
- Skew data analysis and results.
- Inflate counts and metrics.
- Complicate data processing.
Removing duplicates ensures your data is clean and accurate.
Sample DataFrame with Duplicates
Let’s create a dataset to demonstrate:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie", "Alice", "David", "Bob"],
"Age": [25, 30, 35, 25, 40, 30],
"Salary": [50000, 60000, 70000, 50000, 80000, 60000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 Alice 25 50000
4 David 40 80000
5 Bob 30 60000
How to Remove Duplicates
1. Identify Duplicates
- Check for duplicates:
print(df.duplicated()) # Returns a boolean series
Output:
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
- Count duplicates:
print(df.duplicated().sum()) # Total number of duplicate rows
2. Remove Duplicates
- Remove all duplicate rows:
df_cleaned = df.drop_duplicates()
print(df_cleaned)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
4 David 40 80000
3. Remove Duplicates Based on Specific Columns
- Remove duplicates considering only specific columns:
df_cleaned = df.drop_duplicates(subset=["Name"])
print(df_cleaned)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
4 David 40 80000
- Keep the last occurrence of a duplicate:
df_cleaned = df.drop_duplicates(subset=["Name"], keep="last")
print(df_cleaned)
Output:
Name Age Salary
3 Alice 25 50000
5 Bob 30 60000
2 Charlie 35 70000
4 David 40 80000
4. Marking Duplicates Without Removing
- Add a column to mark duplicates:
df["IsDuplicate"] = df.duplicated()
print(df)
Output:
Name Age Salary IsDuplicate
0 Alice 25 50000 False
1 Bob 30 60000 False
2 Charlie 35 70000 False
3 Alice 25 50000 True
4 David 40 80000 False
5 Bob 30 60000 True
Advanced Techniques
Counting Duplicates
To count occurrences of duplicate rows:
print(df[df.duplicated(keep=False)].groupby(["Name", "Age", "Salary"]).size())
Removing Duplicate Values in a Single Column
For specific columns only:
df["Name"] = df["Name"].drop_duplicates()
Real-World Applications
- Data Science: Clean data for machine learning and statistical analysis.
- ETL Pipelines: Deduplicate datasets before loading into databases.
- Business Analytics: Ensure unique customer or transaction records.
Learn with The Coding College
At The Coding College, we aim to make coding practical and accessible. Explore our tutorials to enhance your skills in data manipulation, analysis, and more.
Visit The Coding College for:
- In-depth coding tutorials.
- Real-world projects for skill building.
- A growing community of coders like you.
Conclusion
Handling duplicates is a crucial step in cleaning data. With Pandas, you can identify, remove, and manage duplicates efficiently.