K Means
- Partitions a dataset into a specified number K of clusters by repeatedly assigning points to the nearest cluster mean and recomputing those means.
- Initialization typically chooses K initial centers (often at random); assignment uses a distance measure (the source uses Euclidean distance).
- Stop when cluster centers no longer change or when a stopping criterion is met (e.g., maximum iterations or a threshold on center change).
Definition
Section titled “Definition”K-means is an iterative clustering algorithm that divides a dataset into a specified number of clusters (K) by assigning each data point to the cluster with the closest mean and updating cluster centers until convergence.
Explanation
Section titled “Explanation”The algorithm proceeds by:
- Choosing a number K of clusters and selecting K initial cluster centers (commonly at random).
- For each data point, computing its distance to each cluster center (the source uses the Euclidean distance) and assigning the point to the cluster with the nearest center.
- Updating each cluster center to be the mean of all points assigned to that cluster.
- Repeating assignment and update steps until the cluster centers converge (they stop changing) or a stopping criterion is reached (for example, a maximum number of iterations or a threshold on the change in the centers).
Examples
Section titled “Examples”2D points example
Section titled “2D points example”Dataset of five 2D points: (1,2), (3,4), (5,6), (7,8), (9,10). Want K = 2 clusters.
-
Initial cluster centers (selected at random): (1,2) and (9,10).
-
Compute distances (using Euclidean distance as stated in the source):
- Point (1,2): distance to (1,2) = 0, distance to (9,10) = 13.4
- Point (3,4): distance to (1,2) = 2.8, distance to (9,10) = 10.6
- Point (5,6): distance to (1,2) = 5.7, distance to (9,10) = 8.6
- Point (7,8): distance to (1,2) = 8.5, distance to (9,10) = 6.4
- Point (9,10): distance to (1,2) = 13.4, distance to (9,10) = 0
-
Assign each point to the nearest center:
- First cluster: (1,2) and (3,4)
- Second cluster: (5,6), (7,8), (9,10)
-
Update cluster centers by computing the mean of assigned points:
- First cluster center → (2,3)
- Second cluster center → (7,8)
-
Repeat assignment and update until centers converge. In this example the final cluster centers are (2,3) and (7,8), with final assignments:
- First cluster: (1,2), (3,4)
- Second cluster: (5,6), (7,8), (9,10)
Customer segmentation example
Section titled “Customer segmentation example”A retail company collects data on products purchased, amount spent, and purchase frequency for each customer. Using K-means, the company can group customers into segments (for example, high-value customers, frequent shoppers, and occasional buyers) by finding cluster centers from the purchase data and assigning each customer to the nearest cluster.
Use cases
Section titled “Use cases”- Customer segmentation for grouping customers by purchasing behavior (as described in the customer segmentation example).
Notes or pitfalls
Section titled “Notes or pitfalls”- K must be specified in advance.
- The algorithm typically stops when cluster centers converge or when a stopping criterion is met, such as a maximum number of iterations or a threshold on the change in cluster centers.
Related terms
Section titled “Related terms”- Euclidean distance
- Cluster center (mean)
- K (number of clusters)