Machine Learning - Train/Test Split

Welcome to The Coding College! The Train/Test Split is a fundamental concept in Machine Learning. It’s essential for evaluating a model’s performance and ensuring it generalizes well to unseen data.

In this guide, you’ll learn what Train/Test Split is, why it’s critical, and how to implement it using Python.

What Is Train/Test Split?

The Train/Test Split is a method of dividing your dataset into two parts:

Training Set: Used to train the machine learning model.
Testing Set: Used to evaluate the model’s performance on unseen data.

Typical Split Ratios:

80/20: 80% of the data for training, 20% for testing (most common).
70/30 or 90/10: Used depending on dataset size or specific requirements.

Why Is Train/Test Split Important?

Prevent Overfitting: Evaluating the model on unseen data helps detect overfitting.
Estimate Model Accuracy: Provides a realistic measure of how the model will perform in production.
Ensure Generalization: Ensures the model learns patterns, not noise.

Train/Test Split in Python

The Scikit-Learn library offers an easy-to-use function to split datasets:

Example Dataset

from sklearn.model_selection import train_test_split
import pandas as pd

# Sample dataset
data = {
    "Feature1": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    "Feature2": [15, 25, 35, 45, 55, 65, 75, 85, 95, 105],
    "Target": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
}

df = pd.DataFrame(data)

# Features and target variable
X = df[["Feature1", "Feature2"]]
y = df["Target"]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Features:")
print(X_train)
print("Testing Features:")
print(X_test)

Visualizing the Split

Scatter plots can help visualize how data is divided:

import matplotlib.pyplot as plt

plt.scatter(X_train["Feature1"], y_train, color="blue", label="Training Data")
plt.scatter(X_test["Feature1"], y_test, color="red", label="Testing Data")
plt.title("Train/Test Split Visualization")
plt.xlabel("Feature1")
plt.ylabel("Target")
plt.legend()
plt.show()

Evaluating Model Performance with Train/Test Split

Training and Testing a Logistic Regression Model

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")

Common Pitfalls in Train/Test Splits

1. Data Leakage

Cause: Training data contains information from the test set.
Solution: Ensure complete separation of training and testing data.

2. Unbalanced Splits

Cause: Uneven representation of classes in training and testing sets.
Solution: Use stratified sampling with train_test_split.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

3. Small Datasets

Cause: Splitting reduces data for training and testing.
Solution: Use techniques like cross-validation to maximize data usage.

Advanced Techniques

1. K-Fold Cross-Validation

Instead of a single train/test split, divide the data into kk folds and train/test on each fold.

from sklearn.model_selection import cross_val_score

# Logistic Regression with Cross-Validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-Validation Scores: {scores}")
print(f"Mean Accuracy: {scores.mean()}")

2. Train/Validation/Test Split

Use a validation set for hyperparameter tuning and reserve the test set for final evaluation.

Exercises

Exercise 1: Train/Test Split

Load a dataset (e.g., the Iris dataset from Scikit-Learn) and split it into training and testing sets.

Exercise 2: Evaluate a Model

Train a Support Vector Machine (SVM) classifier on the training set and evaluate its performance on the test set.

Exercise 3: Cross-Validation

Implement K-Fold Cross-Validation on the Boston Housing dataset to calculate the mean squared error for a regression model.

Why Learn with The Coding College?

At The Coding College, we focus on making foundational concepts like Train/Test Split easy to understand and implement. By following our tutorials, you’ll gain practical skills to build and evaluate robust Machine Learning models.

Conclusion

The Train/Test Split is a cornerstone of Machine Learning, ensuring models generalize well to unseen data. By mastering this technique, you’ll take the first step toward creating reliable and high-performing models.

Machine Learning – Train/Test Split