~/Silhouette

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

This technique validates the consistency within clusters of data. It provides a succinct graphical representation of how well each object lies in its cluster.

The silhouette ranges from -1 to 1 where a high value indicates that the object is consistent within its own cluster and poorly matched to neighboring clustesr.

A low or negative silhouette value can mean that the current clustering configuration has too many or too few clusters.

Definition

For each datum $i$, let $a(i)$ be the average distance of $i$ with all other data within the same cluster.

$a(i)$ can be interpreted as how well $i$ is assigned to its cluster. (lower values mean better agreement)

We can then define the average dissimilarity of point $i$ to a cluster $c$ as the average distance from $i$ to all points in $c$.

Let $b(i)$ be the lowest average distance of $i$ to all other points in any other cluster in which i is not already a member.

The cluster with this lowest average dissimilarity is said to be the neighboring cluster of $i$. From here we can define a silhouette: $$ s(i) = \frac{b(i) - a(i)}{max{a(i), b(i)}} $$ The average $s(i)$ over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. A silhouette plot may be used to visualize the agreement between each of the data and its cluster.

img

Properties

Recall that $a(i)$ is a measure of how dissimilar $i$ is to its own cluster, a smaller value means that it’s in agreement to its cluster. For $s(i)$ to be close to 1, we require $a(i) « b(i)$ .

If $s(i)$ is close to negative one, then by the same logic we can see that $i$ would be more appropriate if it was clustered in its neighboring cluster.

$s(i)$ near zero means that the datum is on the border of two natural clusters.

Determining the number of Clusters

This can also be used in helping to determine the number of clusters in a dataset. The ideal number of cluster is one that produces the highest silhouette value.

Also a good indication that one has too many clusters is if there are clusters with the majority of observations being under the mean silhouette value.

https://kapilddatascience.wordpress.com/2015/11/10/using-silhouette-analysis-for-selecting-the-number-of-cluster-for-k-means-clustering/

plot_kmeans_silhouette_analysis_004