K Means

TL;DR

Partitions a dataset into a specified number K of clusters by repeatedly assigning points to the nearest cluster mean and recomputing those means.
Initialization typically chooses K initial centers (often at random); assignment uses a distance measure (the source uses Euclidean distance).
Stop when cluster centers no longer change or when a stopping criterion is met (e.g., maximum iterations or a threshold on center change).

Definition

K-means is an iterative clustering algorithm that divides a dataset into a specified number of clusters (K) by assigning each data point to the cluster with the closest mean and updating cluster centers until convergence.

Explanation

The algorithm proceeds by:

Choosing a number K of clusters and selecting K initial cluster centers (commonly at random).
For each data point, computing its distance to each cluster center (the source uses the Euclidean distance) and assigning the point to the cluster with the nearest center.
Updating each cluster center to be the mean of all points assigned to that cluster.
Repeating assignment and update steps until the cluster centers converge (they stop changing) or a stopping criterion is reached (for example, a maximum number of iterations or a threshold on the change in the centers).

Examples

2D points example

Dataset of five 2D points: (1,2), (3,4), (5,6), (7,8), (9,10). Want K = 2 clusters.

Initial cluster centers (selected at random): (1,2) and (9,10).
Compute distances (using Euclidean distance as stated in the source):
- Point (1,2): distance to (1,2) = 0, distance to (9,10) = 13.4
- Point (3,4): distance to (1,2) = 2.8, distance to (9,10) = 10.6
- Point (5,6): distance to (1,2) = 5.7, distance to (9,10) = 8.6
- Point (7,8): distance to (1,2) = 8.5, distance to (9,10) = 6.4
- Point (9,10): distance to (1,2) = 13.4, distance to (9,10) = 0
Assign each point to the nearest center:
- First cluster: (1,2) and (3,4)
- Second cluster: (5,6), (7,8), (9,10)
Update cluster centers by computing the mean of assigned points:
- First cluster center → (2,3)
- Second cluster center → (7,8)
Repeat assignment and update until centers converge. In this example the final cluster centers are (2,3) and (7,8), with final assignments:
- First cluster: (1,2), (3,4)
- Second cluster: (5,6), (7,8), (9,10)

Customer segmentation example

A retail company collects data on products purchased, amount spent, and purchase frequency for each customer. Using K-means, the company can group customers into segments (for example, high-value customers, frequent shoppers, and occasional buyers) by finding cluster centers from the purchase data and assigning each customer to the nearest cluster.

Use cases

Customer segmentation for grouping customers by purchasing behavior (as described in the customer segmentation example).

Notes or pitfalls

K must be specified in advance.
The algorithm typically stops when cluster centers converge or when a stopping criterion is met, such as a maximum number of iterations or a threshold on the change in cluster centers.

Euclidean distance
Cluster center (mean)
K (number of clusters)