K-Nearest Neighbors (KNN)
- Predicts a point’s class by majority vote among its k nearest data points.
- Non-parametric and lazy: makes no assumptions about data distribution and requires no prior training.
- Sensitive to the choice of k and the distance metric; computes distances to existing points at prediction time.
Definition
Section titled “Definition”K-Nearest Neighbors, or KNN, is a classification algorithm that is non-parametric and lazy, meaning it does not make any assumptions about the underlying data distribution and does not require any prior training. For a given data point, KNN finds the “k” nearest data points and uses the majority class among those points to predict the class of the given data point.
Explanation
Section titled “Explanation”KNN assigns a class to a new data point by identifying the k-nearest existing points (according to a chosen distance metric) and taking the majority class among them. Because it performs no model fitting beforehand, KNN is considered a lazy learner and relies on distance calculations at prediction time. An advantage is its simplicity and ease of implementation; the algorithm’s computation consists primarily of calculating distances between the new point and existing points. A limitation is sensitivity to the chosen value of k and to the distance metric, which can lead to overfitting or underfitting if not chosen carefully.
Examples
Section titled “Examples”Classification with colored points
Section titled “Classification with colored points”Given a dataset of red and blue points, to classify a new point KNN identifies the k-nearest points to the new point. If the majority of those points are red, the new point is classified as red; if the majority are blue, it is classified as blue.
Recommendation systems
Section titled “Recommendation systems”With a dataset of users and the movies they have watched, KNN can predict which movies a given user would like by finding the k-nearest users based on similarity of movie preferences. Movies liked by the majority of those similar users are then recommended to the given user.
Use cases
Section titled “Use cases”- Recommendation systems: predict items a user would like based on preferences of similar users.
Notes or pitfalls
Section titled “Notes or pitfalls”- Choice of k and distance metric strongly affects performance.
- Risk of overfitting (too small k) or underfitting (too large k) if k is not selected appropriately.
- Computational cost occurs at prediction time due to distance calculations against existing points.
Related terms
Section titled “Related terms”- Non-parametric
- Lazy learning
- Distance metric
- Classification algorithm