K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is one of the simplest and most intuitive supervised learning algorithms. It belongs to the family of Lazy Learners (or Instance-Based Learners) because it doesn't build a mathematical model during training. Instead, it stores the entire training dataset and performs a calculation only when a new prediction is requested.
1. How KNN Works
The core philosophy of KNN is: "Show me your neighbors, and I'll tell you who you are."
When you want to classify a new data point:
- Calculate Distance: Find the distance between the new point and all points in the training set.
- Find Neighbors: Pick the closest points (the "neighbors").
- Vote: The new point is assigned the class that is most common among its neighbors.
2. Distance Metrics
To find the "nearest" neighbor, we need a mathematical way to measure distance.
Euclidean Distance
The most common metric, representing the "straight-line" distance between two points.
Manhattan Distance
Also known as "City Block" distance, it measures distance along axes at right angles.
3. Choosing the Right 'K'
The choice of (the number of neighbors) is critical to the model's performance:
- Small K (e.g., ): The model is extremely sensitive to noise and outliers. This leads to Overfitting.
- Large K (e.g., ): The model becomes too "smooth" and might ignore local patterns. This leads to Underfitting.
Rule of Thumb: A common practice is to set to the square root of the number of training samples:
where is the number of training samples. However, always validate your choice of using techniques like Cross-Validation.
4. Implementation with Scikit-Learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# 1. IMPORTANT: KNN requires scaling!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 2. Initialize and Train
# n_neighbors is the 'K' parameter
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)
# 3. Predict
y_pred = knn.predict(X_test_scaled)
5. Pros and Cons
| Advantages | Disadvantages |
|---|---|
| Simple to understand and implement. | Slow Prediction: It must calculate distance to every point for every prediction. |
| No assumptions about data distribution. | Memory Intensive: Must store the entire dataset in RAM. |
| Naturally handles multi-class classification. | Sensitive to Scale: Features with larger units will dominate the distance. |
6. The Curse of Dimensionality
KNN suffers significantly from the "Curse of Dimensionality." As the number of features increases, the "volume" of the space grows so fast that even the "nearest" neighbors become very far away.
Solution: Always perform Dimensionality Reduction (PCA) or Feature Selection before using KNN on high-dimensional data.
References for More Details
- Scikit-Learn KNN Documentation: Learning about advanced algorithms like BallTree and KDTree for faster searches.
KNN is great for simple patterns, but what if you want a model that learns a "boundary" or a logic tree?