Skip to content

K Means

  • Partitions a dataset into a specified number K of clusters by repeatedly assigning points to the nearest cluster mean and recomputing those means.
  • Initialization typically chooses K initial centers (often at random); assignment uses a distance measure (the source uses Euclidean distance).
  • Stop when cluster centers no longer change or when a stopping criterion is met (e.g., maximum iterations or a threshold on center change).

K-means is an iterative clustering algorithm that divides a dataset into a specified number of clusters (K) by assigning each data point to the cluster with the closest mean and updating cluster centers until convergence.

The algorithm proceeds by:

  • Choosing a number K of clusters and selecting K initial cluster centers (commonly at random).
  • For each data point, computing its distance to each cluster center (the source uses the Euclidean distance) and assigning the point to the cluster with the nearest center.
  • Updating each cluster center to be the mean of all points assigned to that cluster.
  • Repeating assignment and update steps until the cluster centers converge (they stop changing) or a stopping criterion is reached (for example, a maximum number of iterations or a threshold on the change in the centers).

Dataset of five 2D points: (1,2), (3,4), (5,6), (7,8), (9,10). Want K = 2 clusters.

  1. Initial cluster centers (selected at random): (1,2) and (9,10).

  2. Compute distances (using Euclidean distance as stated in the source):

    • Point (1,2): distance to (1,2) = 0, distance to (9,10) = 13.4
    • Point (3,4): distance to (1,2) = 2.8, distance to (9,10) = 10.6
    • Point (5,6): distance to (1,2) = 5.7, distance to (9,10) = 8.6
    • Point (7,8): distance to (1,2) = 8.5, distance to (9,10) = 6.4
    • Point (9,10): distance to (1,2) = 13.4, distance to (9,10) = 0
  3. Assign each point to the nearest center:

    • First cluster: (1,2) and (3,4)
    • Second cluster: (5,6), (7,8), (9,10)
  4. Update cluster centers by computing the mean of assigned points:

    • First cluster center → (2,3)
    • Second cluster center → (7,8)
  5. Repeat assignment and update until centers converge. In this example the final cluster centers are (2,3) and (7,8), with final assignments:

    • First cluster: (1,2), (3,4)
    • Second cluster: (5,6), (7,8), (9,10)

A retail company collects data on products purchased, amount spent, and purchase frequency for each customer. Using K-means, the company can group customers into segments (for example, high-value customers, frequent shoppers, and occasional buyers) by finding cluster centers from the purchase data and assigning each customer to the nearest cluster.

  • Customer segmentation for grouping customers by purchasing behavior (as described in the customer segmentation example).
  • K must be specified in advance.
  • The algorithm typically stops when cluster centers converge or when a stopping criterion is met, such as a maximum number of iterations or a threshold on the change in cluster centers.
  • Euclidean distance
  • Cluster center (mean)
  • K (number of clusters)