Machine Learning – K-Nearest Neighbors (KNN)

The K-Nearest Neighbors (KNN) algorithm is one of the simplest and most intuitive machine learning algorithms. It can be used for both classification and regression tasks, making it a versatile tool for beginners and experts alike.

At The Coding College, we aim to make machine learning concepts easy to grasp and apply. This guide explains how KNN works, its advantages and limitations, and how to implement it in Python.

What Is K-Nearest Neighbors (KNN)?

KNN is a non-parametric, instance-based algorithm that classifies or predicts a data point based on its similarity to its nearest neighbors in the dataset.

How It Works:

  1. Choose the number of neighbors, K.
  2. Calculate the distance between the new data point and all points in the dataset (commonly using Euclidean distance).
  3. Select the K closest neighbors.
  4. For classification: Assign the most common class among the neighbors.
    For regression: Take the average of the neighbors’ values.

Key Features of KNN

  • Simplicity: No training phase; the algorithm stores the entire dataset.
  • Versatility: Supports both classification and regression tasks.
  • Dynamic Decision Boundaries: Adapts to the shape of the data.

Advantages and Disadvantages

Advantages:

  • Easy to understand and implement.
  • No assumptions about the data distribution.
  • Works well with small datasets.

Disadvantages:

  • Computationally expensive as dataset size grows.
  • Sensitive to noisy data and irrelevant features.
  • Requires careful selection of the value of K.

KNN Implementation in Python

Here’s a step-by-step implementation using Scikit-Learn:

Example: KNN for Classification

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)  # Set K=5

# Train the model
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Example: KNN for Regression

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic data
X, y = make_regression(n_samples=200, n_features=1, noise=10, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize KNN regressor
knn_regressor = KNeighborsRegressor(n_neighbors=3)

# Train the model
knn_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = knn_regressor.predict(X_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

Choosing the Right Value of K

The choice of K affects the model’s performance significantly:

  • Low K: Sensitive to noise and overfitting.
  • High K: Leads to underfitting.

Use cross-validation to select the optimal K.

from sklearn.model_selection import cross_val_score
import numpy as np

# Evaluate K values
k_values = range(1, 20)
scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k), X, y, cv=5).mean() for k in k_values]

optimal_k = k_values[np.argmax(scores)]
print("Optimal K:", optimal_k)

Applications of KNN

  • Recommendation Systems: Suggest items based on similar user profiles.
  • Image Recognition: Classify images based on feature similarity.
  • Medical Diagnosis: Predict disease based on patient data.

Exercises

Exercise 1: Experiment with Different K Values

Use a classification dataset and observe the model’s accuracy as you vary K.

Exercise 2: Feature Scaling

KNN is sensitive to feature scaling. Normalize your dataset and compare the results with and without scaling.

Exercise 3: Real-World Dataset

Apply KNN to a real-world dataset, such as the UCI Machine Learning Repository’s datasets.

Why Learn KNN at The Coding College?

At The Coding College, we believe in hands-on learning. Understanding algorithms like KNN empowers you to solve real-world problems effectively.

Conclusion

The K-Nearest Neighbors (KNN) algorithm is a powerful yet simple tool for classification and regression tasks. While easy to implement, it requires careful preprocessing and parameter tuning for optimal performance.

Leave a Comment