R Statistics

Welcome to The Coding College! In this tutorial, we’ll dive into statistics in R, one of the most popular programming languages for statistical computing and data analysis. Whether you’re a beginner or an advanced data scientist, R provides a vast collection of tools to perform everything from basic descriptive statistics to advanced inferential methods.

By the end of this guide, you’ll understand:

  • How to perform basic statistical analysis in R.
  • How to summarize data using descriptive statistics.
  • How to conduct inferential statistical tests like t-tests, ANOVA, and correlation.

Why Use R for Statistical Analysis?

R is widely used for statistics because:

  • It includes built-in functions for common statistical techniques.
  • It has a rich ecosystem of packages like dplyr, ggplot2, and caret for advanced analysis.
  • It’s open-source and supported by a large community of statisticians and data scientists.

1. Descriptive Statistics in R

Descriptive statistics summarize and describe the characteristics of your data.

Example: Basic Summary Statistics

R has built-in functions like mean(), median(), sd() (standard deviation), and summary().

# Sample data
data <- c(10, 20, 30, 40, 50)

# Calculate basic statistics
mean(data)    # Mean
median(data)  # Median
sd(data)      # Standard deviation
var(data)     # Variance

# Full summary of data
summary(data)

Output:

  • Mean: The average value of the dataset.
  • Median: The middle value when sorted.
  • SD: Measures the spread or dispersion of data.

2. Frequency Tables and Mode

Use tables and modes to analyze categorical data.

Example: Frequency Table

# Sample data
categories <- c("A", "B", "A", "C", "B", "B", "A")

# Create a frequency table
table(categories)

Example: Find the Mode

# Custom function to calculate mode
find_mode <- function(x) {
  unique_x <- unique(x)
  unique_x[which.max(tabulate(match(x, unique_x)))]
}

find_mode(categories)

3. Inferential Statistics in R

Inferential statistics help you make predictions or generalizations about a population based on sample data.

3.1 Hypothesis Testing: t-Test

A t-test is used to compare the means of two groups.

Example: One-Sample t-Test

# Sample data
data <- c(5, 6, 7, 8, 9)

# Perform a one-sample t-test
t.test(data, mu = 6)

Example: Two-Sample t-Test

# Two groups of data
group1 <- c(5, 6, 7, 8, 9)
group2 <- c(6, 7, 8, 9, 10)

# Perform a two-sample t-test
t.test(group1, group2)

3.2 Analysis of Variance (ANOVA)

ANOVA compares the means of multiple groups.

Example: ANOVA Test

# Sample data
group <- factor(c("A", "A", "B", "B", "C", "C"))
values <- c(5, 6, 7, 8, 9, 10)

# Perform ANOVA
anova_result <- aov(values ~ group)
summary(anova_result)

4. Correlation and Regression Analysis

4.1 Correlation

Correlation measures the strength and direction of a relationship between two variables.

Example: Correlation Coefficient

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Calculate correlation
cor(x, y)

Use cor() to calculate correlation coefficients (-1 to +1).

4.2 Linear Regression

Linear regression predicts the relationship between a dependent and independent variable.

Example: Simple Linear Regression

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Fit a linear model
model <- lm(y ~ x)
summary(model)

5. Working with Statistical Packages

R offers several powerful packages for statistics. Below are a few commonly used ones:

dplyr: Data Manipulation

install.packages("dplyr")
library(dplyr)

# Summarize data
data <- data.frame(category = c("A", "A", "B", "B"), values = c(10, 20, 30, 40))
data %>% group_by(category) %>% summarize(mean_value = mean(values))

ggplot2: Data Visualization

install.packages("ggplot2")
library(ggplot2)

# Visualize data with ggplot2
ggplot(data, aes(x = category, y = values)) +
  geom_bar(stat = "identity") +
  ggtitle("Bar Chart")

caret: Machine Learning and Statistical Modeling

install.packages("caret")
library(caret)

# Perform cross-validation
model <- train(y ~ x, data = data.frame(x, y), method = "lm")

Common Mistakes in Statistical Analysis

  1. Ignoring Assumptions: Ensure your data meets assumptions for statistical tests (e.g., normality, independence).
  2. Overfitting: Be cautious when creating overly complex models.
  3. Misinterpreting Results: Statistical significance does not imply practical significance.
  4. Improper Data Cleaning: Always preprocess and clean your data before analysis.

FAQs About R Statistics

1. How do I check if my data is normally distributed?

Use the shapiro.test() function:

shapiro.test(data)

2. How do I handle missing data in R?

Use functions like na.omit() to remove missing values or impute() from libraries like mice to fill them.

3. Can I automate statistical reports in R?

Yes, you can use R Markdown to generate automated reports in HTML, PDF, or Word formats.

Conclusion

R offers a comprehensive suite of tools for statistical analysis, making it an essential tool for anyone working with data. From descriptive statistics to advanced inferential techniques, R has you covered. Practice these examples, experiment with real datasets, and share your findings!

Leave a Comment