# ~/Cluster Analysis

## Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

### Distance, Dimensionality Reduction, and Tendency

• Distance
• Euclidean Distance
• Squared Euclidean Distance
• Manhattan Distance
• Maximum Distance
• Mahalanobis Distance
• Which distance function should you use?
• PCA
• Cluster Tendency
• Hopkins Statistic
• Scaling Data

### Validating Clustering Models

• Clustering Validation
• Cross Validation

### Connectivity Models

• Agglomerative Clustering
• Unweighted Pair Group Method with Arithmetic Mean (If time permits)
• Dendrograms
• Divisive Clustering
• CURE (Clustering using REpresentatives) algorithm (If time permits)

### Cluster Evaluation

• Internal Evaluation
• Dunn Index
• Silhouette Coefficient
• Davies-Bouldin Index (If time permits)
• External Evaluation
• Rand Measure
• Jaccard Index
• Dice Index
• Confusion Matrix
• F Measure (If time permits)
• Fowlkes-Mallows Index (If time permits)

### Centroid Models

• Jenks Natural Breaks Optimization
• Voronoi Diagram
• K means clustering
• K medoids clustering
• K Medians/Modes clustering
• When to use K means as opposed to K medoids or K Medians?
• How many clusters should you use?
• Lloyd’s Algorithm for Approximating K-means (If time permits)

### Density Models

• DBSCAN Density Based Clustering Algorithm
• OPTICS Ordering Points To Identify the Clustering Structure
• DeLi-Clu Density Link Clustering (If time permits)
• What should be your density threshold?

### Analysis of Model Appropriateness

• When do we use each of the models above?

### Distribution Models (If time permits)

• Fuzzy Clusters
• EM (Expectation Maximization) Clustering
• Maximum Likelihood Gaussian
• Probabilistic Hierarchal Clustering

## Textbooks

Cluster Analysis 5th Edition

Cluster Analysis: 2014 Edition (Statistical Associates Blue Book Series 24)

## Schedule

In an ideal world, the topics below I estimated being a certain time period for learning them. Of course you have more experience when it comes to how long it actually takes to learn these topics, so I’ll leave this mostly to your discretion.

Distance, Dimensionality Reduction, and Tendency – 3 Weeks

Validating Cluster Models – 1 Week

Connectivity Models – 2 Weeks

Cluster Evaluation – 1 Week

Centroid Models – 3 Weeks

Density Models – 3 Weeks

Analysis of Model Appropriateness – 1 Week

The schedule above accounts for 14 weeks, so there is a week that is free as a buffer.

## Conclusion

Creating this document got me really excited for this independent study. Feel free to give me feedback :)