~/K-Medoids

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.

The K-medoids algorithm is related to k-means and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partition and both attempt to minimize the distance between points in the cluster to it’s center. In contrast to k-means, it chooses data points as centers and uses the Manhattan Norm to define the distance between data points instead of the Euclidean.

This method is known to be more robust to noise and outliers compared to k-means since it minimizes the sum of pairwise dissimilarities instead of the sum of squared Euclidean distances.

Algorithms

There are several algorithms that have been created as an optimization to an exhaustive search. In this section, we’ll discuss PAM and Voronoi iteration method.

Partitioning Around Medoids (PAM)

  1. Select $k$ of the $n$ data points as medoids
  2. Associate each data point to the closes medoid
  3. While the cost of the configuration decreases:
    1. For each medoid $m$, for each non-medoid data point $o$:
      1. Swap $m$ and $o$, recompute the cost (sum of distances of points to their medoid)
      2. If the total cost of the configuration increased in the previous step, undo the swap

Voronoi Iteration Method

  1. Select $k$ of the $n$ data points as medoids
  2. While the cost of the configuration decreases
    1. In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
    2. Reassign each point to the cluster defined by the closest medoid determined in the previous step.

Clustering Large Applications (CLARA

This is a variant of the PAM algorithm that relies on the sampling approach to handle large datasets. The cost of a particular cluster configuration is the mean cost of all the dissimilarities.

R Implementations

Both PAM and CLARA are defined in the cluster package in R.

clara(x, k, metric = "euclidean", stand = FALSE, samples = 5, 
      sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
      keep.data = medoids.x, rngR = FALSE)
pam(x, k, metric = "euclidean", stand = FALSE)