Skip to main content

K-Nearest Neighbor (Geeks for Geeks)

What is the K-Nearest Neighbors Algorithm?​

KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining, and intrusion detection.

It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution of the given data). We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.

This image explains the Basic

Intuition Behind KNN Algorithm​

If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point close to a cluster of points classified as β€˜Red’ has a higher probability of getting classified as β€˜Red’.

Intuitively, we can see that the first point (2.5, 7) should be classified as β€˜Green’, and the second point (5.5, 4.5) should be classified as β€˜Red’.

How to select the value of K in the K-NN Algorithm?​

The value of K is critical in the K-NN algorithm. A small K can make the model sensitive to noise, while a large K can smooth out the predictions too much. Typically, K is chosen by experimenting with different values and using cross-validation to evaluate the performance. A common approach is to start with K = sqrt(n), where n is the number of data points, and adjust based on the model’s performance.

How does K-NN work?​

K-Nearest Neighbor (K-NN) is a simple, supervised machine learning algorithm used for classification and regression tasks. It works by finding the K closest data points (neighbors) to the query point and using them to make predictions. For classification, it assigns the class most common among the neighbors. For regression, it averages the values of the neighbors.

Here you can see

Problem Description​

Given a dataset and a query point, implement the K-Nearest Neighbor (KNN) algorithm to find the k points in the dataset that are closest to the query point.

Examples​

Example 1:

Input:
dataset = [(2, 4), (4, 4), (4, 2), (2, 2)]
query_point = (3, 3)
k = 2
Output: [(2, 4), (4, 4)]

Example 2:

Input:
dataset = [(1, 2), (2, 3), (3, 4), (4, 5)]
query_point = (2, 2)
k = 3
Output: [(1, 2), (2, 3), (3, 4)]

Your Task​

Implement the function knn()knn() that takes in a dataset, a query point, and an integer k. The function should return the k points in the dataset that are closest to the query point based on Euclidean distance.

Expected Time Complexity: O(nlogn)O(n log n) for sorting the dataset.

Expected Auxiliary Space: O(n)O(n) for storing the distances.

Constraints​

  • 1<=len(dataset)<=1041 <= len(dataset) <= 10^4
  • 1<=k<=len(dataset)1 <= k <= len(dataset)
  • Each point in the dataset is a tuple of integers (x,y)(x, y).
  • The query point is a tuple of integers (x,y)(x, y).

Problem Explanation​

Here's the step-by-step breakdown of the K-Nearest Neighbor algorithm:

  1. Calculate Distances: Compute the Euclidean distance between the query point and each point in the dataset.
  2. Sort Distances: Sort the points based on their distances from the query point.
  3. Select Neighbors: Select the first k points from the sorted list as the nearest neighbors.

Code Implementation​

Written by @ngmuraqrdd
import math

def euclidean_distance(point1, point2):
return math.sqrt((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2)

def knn(dataset, query_point, k):
distances = []
for point in dataset:
distance = euclidean_distance(point, query_point)
distances.append((point, distance))
distances.sort(key=lambda x: x[1])
nearest_neighbors = [distances[i][0] for i in range(k)]
return nearest_neighbors

Solution Logic​

  1. Calculate Distances: Iterate through each point in the dataset and compute the Euclidean distance to the query point.
  2. Sort Distances: Use the sorting algorithm to sort the points by their distance.
  3. Select Neighbors: Extract the k closest points from the sorted list.

Time Complexity​

O(nlogn)O(n log n), where n is the number of points in the dataset. This is due to the sorting step.

Space Complexity​

O(n)O(n), for storing the distances of all points in the dataset.

Resources​

Additional Topics​

Advantages and Disadvantages​

Advantages:

  • Simple to understand and implement.
  • No training phase, making it fast for small datasets.
  • Effective with a well-distributed dataset.

Disadvantages:

  • Computationally intensive for large datasets as it requires calculating the distance of each data point.
  • Performance degrades with high-dimensional data due to the curse of dimensionality.
  • Sensitive to irrelevant or redundant features, requiring feature scaling and selection.