K-Nearest Neighbors (KNN)

TL;DR

Predicts a point’s class by majority vote among its k nearest data points.
Non-parametric and lazy: makes no assumptions about data distribution and requires no prior training.
Sensitive to the choice of k and the distance metric; computes distances to existing points at prediction time.

Definition

K-Nearest Neighbors, or KNN, is a classification algorithm that is non-parametric and lazy, meaning it does not make any assumptions about the underlying data distribution and does not require any prior training. For a given data point, KNN finds the “k” nearest data points and uses the majority class among those points to predict the class of the given data point.

Explanation

KNN assigns a class to a new data point by identifying the k-nearest existing points (according to a chosen distance metric) and taking the majority class among them. Because it performs no model fitting beforehand, KNN is considered a lazy learner and relies on distance calculations at prediction time. An advantage is its simplicity and ease of implementation; the algorithm’s computation consists primarily of calculating distances between the new point and existing points. A limitation is sensitivity to the chosen value of k and to the distance metric, which can lead to overfitting or underfitting if not chosen carefully.

Examples

Classification with colored points

Given a dataset of red and blue points, to classify a new point KNN identifies the k-nearest points to the new point. If the majority of those points are red, the new point is classified as red; if the majority are blue, it is classified as blue.

Recommendation systems

With a dataset of users and the movies they have watched, KNN can predict which movies a given user would like by finding the k-nearest users based on similarity of movie preferences. Movies liked by the majority of those similar users are then recommended to the given user.

Use cases

Recommendation systems: predict items a user would like based on preferences of similar users.

Notes or pitfalls

Choice of k and distance metric strongly affects performance.
Risk of overfitting (too small k) or underfitting (too large k) if k is not selected appropriately.
Computational cost occurs at prediction time due to distance calculations against existing points.

Non-parametric
Lazy learning
Distance metric
Classification algorithm