R Statistics - The Coding College

Welcome to The Coding College! In this tutorial, we’ll dive into statistics in R, one of the most popular programming languages for statistical computing and data analysis. Whether you’re a beginner or an advanced data scientist, R provides a vast collection of tools to perform everything from basic descriptive statistics to advanced inferential methods.

By the end of this guide, you’ll understand:

How to perform basic statistical analysis in R.
How to summarize data using descriptive statistics.
How to conduct inferential statistical tests like t-tests, ANOVA, and correlation.

Why Use R for Statistical Analysis?

R is widely used for statistics because:

It includes built-in functions for common statistical techniques.
It has a rich ecosystem of packages like dplyr, ggplot2, and caret for advanced analysis.
It’s open-source and supported by a large community of statisticians and data scientists.

1. Descriptive Statistics in R

Descriptive statistics summarize and describe the characteristics of your data.

Example: Basic Summary Statistics

R has built-in functions like mean(), median(), sd() (standard deviation), and summary().

# Sample data
data <- c(10, 20, 30, 40, 50)

# Calculate basic statistics
mean(data)    # Mean
median(data)  # Median
sd(data)      # Standard deviation
var(data)     # Variance

# Full summary of data
summary(data)

Output:

Mean: The average value of the dataset.
Median: The middle value when sorted.
SD: Measures the spread or dispersion of data.

2. Frequency Tables and Mode

Use tables and modes to analyze categorical data.

Example: Frequency Table

# Sample data
categories <- c("A", "B", "A", "C", "B", "B", "A")

# Create a frequency table
table(categories)

Example: Find the Mode

# Custom function to calculate mode
find_mode <- function(x) {
  unique_x <- unique(x)
  unique_x[which.max(tabulate(match(x, unique_x)))]
}

find_mode(categories)

3. Inferential Statistics in R

Inferential statistics help you make predictions or generalizations about a population based on sample data.

3.1 Hypothesis Testing: t-Test

A t-test is used to compare the means of two groups.

Example: One-Sample t-Test

# Sample data
data <- c(5, 6, 7, 8, 9)

# Perform a one-sample t-test
t.test(data, mu = 6)

Example: Two-Sample t-Test

# Two groups of data
group1 <- c(5, 6, 7, 8, 9)
group2 <- c(6, 7, 8, 9, 10)

# Perform a two-sample t-test
t.test(group1, group2)

3.2 Analysis of Variance (ANOVA)

ANOVA compares the means of multiple groups.

Example: ANOVA Test

# Sample data
group <- factor(c("A", "A", "B", "B", "C", "C"))
values <- c(5, 6, 7, 8, 9, 10)

# Perform ANOVA
anova_result <- aov(values ~ group)
summary(anova_result)

4. Correlation and Regression Analysis

4.1 Correlation

Correlation measures the strength and direction of a relationship between two variables.

Example: Correlation Coefficient

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Calculate correlation
cor(x, y)

Use cor() to calculate correlation coefficients (-1 to +1).

4.2 Linear Regression

Linear regression predicts the relationship between a dependent and independent variable.

Example: Simple Linear Regression

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Fit a linear model
model <- lm(y ~ x)
summary(model)

5. Working with Statistical Packages

R offers several powerful packages for statistics. Below are a few commonly used ones:

dplyr: Data Manipulation

install.packages("dplyr")
library(dplyr)

# Summarize data
data <- data.frame(category = c("A", "A", "B", "B"), values = c(10, 20, 30, 40))
data %>% group_by(category) %>% summarize(mean_value = mean(values))

ggplot2: Data Visualization

install.packages("ggplot2")
library(ggplot2)

# Visualize data with ggplot2
ggplot(data, aes(x = category, y = values)) +
  geom_bar(stat = "identity") +
  ggtitle("Bar Chart")

caret: Machine Learning and Statistical Modeling

install.packages("caret")
library(caret)

# Perform cross-validation
model <- train(y ~ x, data = data.frame(x, y), method = "lm")

Common Mistakes in Statistical Analysis

Ignoring Assumptions: Ensure your data meets assumptions for statistical tests (e.g., normality, independence).
Overfitting: Be cautious when creating overly complex models.
Misinterpreting Results: Statistical significance does not imply practical significance.
Improper Data Cleaning: Always preprocess and clean your data before analysis.

FAQs About R Statistics

1. How do I check if my data is normally distributed?

Use the shapiro.test() function:

shapiro.test(data)

2. How do I handle missing data in R?

Use functions like na.omit() to remove missing values or impute() from libraries like mice to fill them.

3. Can I automate statistical reports in R?

Yes, you can use R Markdown to generate automated reports in HTML, PDF, or Word formats.

Conclusion

R offers a comprehensive suite of tools for statistical analysis, making it an essential tool for anyone working with data. From descriptive statistics to advanced inferential techniques, R has you covered. Practice these examples, experiment with real datasets, and share your findings!