Pandas – Analyzing DataFrames - The Coding College

Welcome to The Coding College, where we simplify coding and programming for learners and professionals! In this guide, we’ll delve into analyzing DataFrames with Pandas, a critical skill for working with structured data in Python.

Why Analyze DataFrames?

Analyzing DataFrames allows you to:

Gain insights from your dataset.
Identify trends, patterns, and anomalies.
Prepare data for further analysis or visualization.

Getting Started with Pandas DataFrames

To analyze a DataFrame, you first need one. Let’s create a sample DataFrame for our examples:

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": [25, 30, 35, 40, 29],
    "Salary": [50000, 60000, 75000, 80000, 52000],
    "Department": ["HR", "IT", "Finance", "IT", "HR"]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Salary Department
0    Alice   25   50000        HR
1      Bob   30   60000        IT
2  Charlie   35   75000    Finance
3    David   40   80000        IT
4      Eva   29   52000        HR

Common Data Analysis Techniques in Pandas

1. Summary Statistics

Get an overview of your data using:

print(df.describe())  # Summary statistics for numeric columns

Output:

             Age        Salary
count   5.000000      5.000000
mean   31.800000  63400.000000
std     5.712993  12206.555615
min    25.000000  50000.000000
25%    29.000000  52000.000000
50%    30.000000  60000.000000
75%    35.000000  75000.000000
max    40.000000  80000.000000

2. Basic Information

Check the structure of the DataFrame:

print(df.info())  # Column types and non-null values

Get a quick preview:

print(df.head(3))  # First 3 rows
print(df.tail(2))  # Last 2 rows

3. Analyzing Specific Columns

Calculate key metrics:

print("Mean Age:", df["Age"].mean())
print("Total Salary:", df["Salary"].sum())

Find unique values:

print(df["Department"].unique())  # Output: ['HR', 'IT', 'Finance']

Count occurrences:

print(df["Department"].value_counts())

4. Filtering Data

Find employees older than 30:

filtered_df = df[df["Age"] > 30]
print(filtered_df)

Find employees in the HR department:

hr_employees = df[df["Department"] == "HR"]
print(hr_employees)

5. Sorting Data

Sort by salary in descending order:

sorted_df = df.sort_values(by="Salary", ascending=False)
print(sorted_df)

Sort by multiple columns:

df = df.sort_values(by=["Department", "Age"], ascending=[True, False])
print(df)

6. Grouping and Aggregation

Analyze data by department:

grouped = df.groupby("Department").mean()
print(grouped)

Aggregate multiple statistics:

agg = df.groupby("Department").agg({"Salary": ["mean", "max"], "Age": "median"})
print(agg)

7. Handling Missing Data

Fill missing values:

df["Salary"].fillna(0, inplace=True)

Drop rows with missing data:

df.dropna(inplace=True)

8. Correlation and Covariance

Analyze relationships between numeric columns:

print(df.corr())  # Correlation matrix

Real-World Applications of DataFrame Analysis

Business Insights: Analyze sales, customer demographics, and operational data.
Data Science: Prepare and explore datasets for machine learning models.
Research: Conduct statistical analysis on experimental or survey data.

Learn Data Analysis with Pandas at The Coding College

At The Coding College, we break down complex concepts into actionable steps. Whether you’re new to programming or an experienced coder, our tutorials are designed to enhance your skills.

Visit The Coding College for:

In-depth coding and data analysis tutorials.
Real-world projects to practice your skills.
A vibrant community of learners and experts.

Conclusion

Analyzing DataFrames with Pandas is a fundamental skill for Python programmers, data analysts, and data scientists. By mastering these techniques, you’ll be equipped to handle a wide range of data analysis tasks.