~/K-means++

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.

Intuition

The intuition behind this approach involves spreading out the $k$ initial cluster centers. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center.

Algorithm

The exact algorithm is as follows

  1. Choose one center uniformly at random from among data points
  2. For each data point $x$, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proporitonal to $D(x)^2$
  4. Repeat steps 2 and 3 until $k$ centers have been chosen
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering