Machine Learning – Logistic Regression

Logistic Regression is one of the most fundamental algorithms in machine learning for classification problems. Despite its name, Logistic Regression is not a regression algorithm but a classification technique.

In this tutorial on The Coding College, we’ll explore Logistic Regression, its mathematical foundation, implementation in Python, and its applications.

What Is Logistic Regression?

Logistic Regression is a supervised learning algorithm used to predict categorical outcomes based on input variables. The algorithm works by estimating probabilities using a logistic function (sigmoid function).

Key Characteristics:

  • Binary Classification: Commonly used for problems with two classes (e.g., spam vs. not spam).
  • Probabilistic Output: Outputs probabilities, which can be converted into class predictions.

Logistic Function (Sigmoid Function):

The sigmoid function maps any real number into a range between 0 and 1, making it ideal for probability prediction.

How Logistic Regression Works

  1. Input Features: Use input variables (features) to compute a weighted sum (zz).
  2. Apply Sigmoid Function: Convert zz into a probability using the sigmoid function.
  3. Classify: Assign a class label based on a threshold (e.g., 0.5).

Logistic Regression vs Linear Regression

AspectLogistic RegressionLinear Regression
OutputProbability (0-1)Continuous values
Type of ProblemClassificationRegression
Assumption of LinearityLinear relationship in log-oddsLinear relationship between input and output

Implementing Logistic Regression in Python

Example: Predicting Diabetes

We’ll use the Pima Indians Diabetes dataset to demonstrate Logistic Regression.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigree", "Age", "Outcome"]
data = pd.read_csv(url, header=None, names=columns)

# Split into features and target
X = data.iloc[:, :-1]
y = data["Outcome"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

Evaluating the Model

  1. Accuracy Score:
print("Accuracy:", accuracy_score(y_test, y_pred))
  1. Confusion Matrix:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
  1. Classification Report:
print("Classification Report:\n", classification_report(y_test, y_pred))

Real-World Applications

  1. Medical Diagnosis: Predicting diseases based on patient data.
  2. Spam Detection: Identifying spam emails.
  3. Credit Scoring: Determining loan eligibility.
  4. Customer Retention: Predicting whether a customer will churn.

Advantages and Disadvantages

Advantages:

  • Simple and Interpretable: Easy to understand and explain.
  • Efficient: Works well with small to medium-sized datasets.
  • Probabilistic Predictions: Provides confidence scores for predictions.

Disadvantages:

  • Linearity Assumption: Assumes a linear relationship between features and log-odds.
  • Sensitive to Outliers: Outliers can affect model performance.
  • Limited to Binary Classification: Requires extensions for multi-class problems (e.g., one-vs-rest).

Exercises

Exercise 1: Logistic Regression on Titanic Dataset

Use the Titanic dataset to predict survival. Visualize the coefficients to interpret feature importance.

Exercise 2: Multi-Class Classification

Extend Logistic Regression for multi-class classification using the Iris dataset. Evaluate the model’s performance.

Exercise 3: Threshold Adjustment

Experiment with different classification thresholds (e.g., 0.4, 0.6) and observe the impact on precision and recall.

Why Learn at The Coding College?

At The Coding College, we make machine learning concepts like Logistic Regression accessible to everyone. Our hands-on tutorials and exercises are designed to help you apply these algorithms in real-world scenarios.

Conclusion

Logistic Regression is a foundational algorithm in machine learning, offering simplicity and effectiveness for classification tasks. By understanding its workings and applications, you can confidently tackle a variety of binary classification problems.

Leave a Comment