~/Centroid-based Clustering

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI, Writer of Tidbits, and Linux Enthusiast

In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to $k$ and the goal is to solve some sort of optimization problem.

The similarity of two clusters is defined as the similarity of their centroids.

This problem is computationally difficult so there are efficient heuristic algorithms that are commonly employed. These usually converge quickly to a local optimum.

K-means clustering

This aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean which serves as the centroid of the cluster.

This technique results in partitioning the data space into Voronoi cells.

Description

Given a set of observations $x$, k-means clustering aims to partition the $n$ observations into $k$ sets $S$ so as to minimize the within-cluster sum of squares (i.e. variance). More formally, the objective is to find $$ argmin_s{\sum_{i = 1}^k{\sum_{x \in S_i}{||x-\mu_i||^2}}}= argmin_{s}{\sum_{i = 1}^k{|S_i|Var(S_i)}} $$ where $\mu_i$ is the mean of points in $S_i$. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster $$ argmin_s{\sum_{i = 1}^k{\frac{1}{2|S_i|}\sum_{x, y \in S_i}{||x-y||^2}}} $$

Algorithm

Given an initial set of $k$ means, the algorithm proceeds by alternating between two steps.

Assignment step: Assign each observation to the cluster whose mean has the least squared euclidean distance.

Update Step: Calculate the new means to be the centroids of the observations in the new clusters

The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.

The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.

Using a different distance function other than the squared Euclidean distance may stop the algorithm from converging.

Initialization methods

Commonly used initialization methods are Forgy and Random Partition.

Forgy Method: This method randomly chooses $k$ observations from the data set and uses these are the initial means

This method is known to spread the initial means out

Random Partition Method: This method first randomly assigns a cluster to each observation and then proceeds to the update step.

This method is known to place most of the means close to the center of the dataset.