Pandas – Removing Duplicates

Welcome to The Coding College, your one-stop destination for practical coding tutorials! In this guide, we’ll explore how to handle duplicate data in Pandas, an essential skill for maintaining the integrity of your datasets.

Why Remove Duplicates?

Duplicate entries can:

  • Skew data analysis and results.
  • Inflate counts and metrics.
  • Complicate data processing.

Removing duplicates ensures your data is clean and accurate.

Sample DataFrame with Duplicates

Let’s create a dataset to demonstrate:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice", "David", "Bob"],
    "Age": [25, 30, 35, 25, 40, 30],
    "Salary": [50000, 60000, 70000, 50000, 80000, 60000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    Alice   25   50000
4    David   40   80000
5      Bob   30   60000

How to Remove Duplicates

1. Identify Duplicates

  • Check for duplicates:
print(df.duplicated())  # Returns a boolean series

Output:

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool
  • Count duplicates:
print(df.duplicated().sum())  # Total number of duplicate rows

2. Remove Duplicates

  • Remove all duplicate rows:
df_cleaned = df.drop_duplicates()
print(df_cleaned)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000

3. Remove Duplicates Based on Specific Columns

  • Remove duplicates considering only specific columns:
df_cleaned = df.drop_duplicates(subset=["Name"])
print(df_cleaned)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000
  • Keep the last occurrence of a duplicate:
df_cleaned = df.drop_duplicates(subset=["Name"], keep="last")
print(df_cleaned)

Output:

      Name  Age  Salary
3    Alice   25   50000
5      Bob   30   60000
2  Charlie   35   70000
4    David   40   80000

4. Marking Duplicates Without Removing

  • Add a column to mark duplicates:
df["IsDuplicate"] = df.duplicated()
print(df)

Output:

      Name  Age  Salary  IsDuplicate
0    Alice   25   50000        False
1      Bob   30   60000        False
2  Charlie   35   70000        False
3    Alice   25   50000         True
4    David   40   80000        False
5      Bob   30   60000         True

Advanced Techniques

Counting Duplicates

To count occurrences of duplicate rows:

print(df[df.duplicated(keep=False)].groupby(["Name", "Age", "Salary"]).size())

Removing Duplicate Values in a Single Column

For specific columns only:

df["Name"] = df["Name"].drop_duplicates()

Real-World Applications

  1. Data Science: Clean data for machine learning and statistical analysis.
  2. ETL Pipelines: Deduplicate datasets before loading into databases.
  3. Business Analytics: Ensure unique customer or transaction records.

Learn with The Coding College

At The Coding College, we aim to make coding practical and accessible. Explore our tutorials to enhance your skills in data manipulation, analysis, and more.

Visit The Coding College for:

  • In-depth coding tutorials.
  • Real-world projects for skill building.
  • A growing community of coders like you.

Conclusion

Handling duplicates is a crucial step in cleaning data. With Pandas, you can identify, remove, and manage duplicates efficiently.

Leave a Comment