Pandas – Removing Duplicates - The Coding College

Welcome to The Coding College, your one-stop destination for practical coding tutorials! In this guide, we’ll explore how to handle duplicate data in Pandas, an essential skill for maintaining the integrity of your datasets.

Why Remove Duplicates?

Duplicate entries can:

Skew data analysis and results.
Inflate counts and metrics.
Complicate data processing.

Removing duplicates ensures your data is clean and accurate.

Sample DataFrame with Duplicates

Let’s create a dataset to demonstrate:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice", "David", "Bob"],
    "Age": [25, 30, 35, 25, 40, 30],
    "Salary": [50000, 60000, 70000, 50000, 80000, 60000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    Alice   25   50000
4    David   40   80000
5      Bob   30   60000

How to Remove Duplicates

1. Identify Duplicates

Check for duplicates:

print(df.duplicated())  # Returns a boolean series

Output:

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

Count duplicates:

print(df.duplicated().sum())  # Total number of duplicate rows

2. Remove Duplicates

Remove all duplicate rows:

df_cleaned = df.drop_duplicates()
print(df_cleaned)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000

3. Remove Duplicates Based on Specific Columns

Remove duplicates considering only specific columns:

df_cleaned = df.drop_duplicates(subset=["Name"])
print(df_cleaned)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000

Keep the last occurrence of a duplicate:

df_cleaned = df.drop_duplicates(subset=["Name"], keep="last")
print(df_cleaned)

Output:

      Name  Age  Salary
3    Alice   25   50000
5      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000

4. Marking Duplicates Without Removing

Add a column to mark duplicates:

df["IsDuplicate"] = df.duplicated()
print(df)

Output:

      Name  Age  Salary  IsDuplicate
0    Alice   25   50000        False
1      Bob   30   60000        False
2  Charlie   35   70000        False
3    Alice   25   50000         True
4    David   40   80000        False
5      Bob   30   60000         True

Advanced Techniques

Counting Duplicates

To count occurrences of duplicate rows:

print(df[df.duplicated(keep=False)].groupby(["Name", "Age", "Salary"]).size())

Removing Duplicate Values in a Single Column

For specific columns only:

df["Name"] = df["Name"].drop_duplicates()

Real-World Applications

Data Science: Clean data for machine learning and statistical analysis.
ETL Pipelines: Deduplicate datasets before loading into databases.
Business Analytics: Ensure unique customer or transaction records.

Learn with The Coding College

At The Coding College, we aim to make coding practical and accessible. Explore our tutorials to enhance your skills in data manipulation, analysis, and more.

Visit The Coding College for:

In-depth coding tutorials.
Real-world projects for skill building.
A growing community of coders like you.

Conclusion

Handling duplicates is a crucial step in cleaning data. With Pandas, you can identify, remove, and manage duplicates efficiently.

Why Remove Duplicates?

Sample DataFrame with Duplicates

How to Remove Duplicates

1. Identify Duplicates

2. Remove Duplicates

3. Remove Duplicates Based on Specific Columns

4. Marking Duplicates Without Removing

Advanced Techniques

Counting Duplicates

Removing Duplicate Values in a Single Column

Real-World Applications

Learn with The Coding College

Conclusion

Leave a Comment Cancel reply