Kmeans, Hierarchical, and DBSCAN
Clustering
Overview / Comparison of Clustering Methods
The following analysis applies k-means clustering, hierarchical clustering, and DBSCAN clustering to a subset of the ultra-running dataset. Each of the three are unsupervised machine learning algorithms that cluster data.
K-means (partition) clustering clusters data into a user defined k-specific groups. The clustering method works by initializing k cluster centroids and assigning each data point to the nearest centroid (through Euclidean distance in the below analysis). While far from the only distance metric, euclidean distance is the standard used in R and Python and measures the straight line distance between two points in n-dimensional space. Other distance measuring options include manhattan and cosine similarity (which measures absolute distance). After data points are distributed to initial clusters new centroids are created from the mean of all points in the cluster, data points are then once again reassigned to clusters. This iterative process is then repeated until convergence. K-means clusters are most applicable to well separated spherical clusters.
Hierarchical clustering builds a hierarch of data clusters and does not require a predefined number of clusters – unlike k-means. For the following analysis a bottom-up method approach will be applied which begins with singular data points and merges them together until eventually converging into one singular cluster. The method of combining data points and clusters that will be used in this analysis is Ward’s method. Ward’s method works by minimizing the increase in total with-in cluster variance. Each hierarchical step merges the two clusters that result in the smallest increase in total variance.
DBSCAN (density) clustering groups together data points based on density while additionally identifying outlier points (labeled as noise). The algorithm groups points based on a defined epsilon, radius with which neighboring points are considered close, and the minimum points required to form a “dense region”. Unlike k-means DBSCAN does not require a defined number of clusters. DBSCAN works well with arbitrarily shaped clusters.
Example of K-means Clustering (above)
Example of Hierarchical Clustering Dendrogram (left)
Comparison of K-means vs DBSCAN clustering (right)
Data Prep
Clustering (Kmeans, hierarchical, and DBSCAN) will be performed on 4-dimensional quantitative data - after the removal of the label (Hours category). Furthermore, after the standardization of the data, the data will be transformed into two principal components using PCA. Clustering will then be performed on the transformed data which contains approximately 70% of the variation present in the original data (see screenshot to the right)
Snipit of labeled data to be used for clustering (above)
Snipit of unlabeled data transformed through PCA to 2 PCs (above)
PCA output for 2 PCs (above)
Kmeans Clustering
Based on the silhouette plot created (right). Kmeans clustering will be performed with k = 2, k = 6, and k = 13
Clustering with k = 2
Clustering from K-means with k =2
Clustering from K-means with k = 6
Clustering from K-means with k = 13
Clustering as it relates to original labels from K-means with k =2
Clustering with k = 6
Clustering as it relates to original labels from K-means with k =6
Clustering with k = 13
Clustering as it relates to original labels from K-means with k =13
Discussion of Results
Given the above created visualizations there is clear variations in clustering results based on the chosen k-value. As the silhouette graph clearly demonstrated a k-value of 2 resulted in the highest silhouette score and therefore the most properly separated clusters. This was further supported by the created visualization which created two distinct clusters. Additionally, when examining the distinct clusters while considering the original labels the clusters appear to separate the data into race results under 20 hours and race results over 20 hours. Conversely, when performing k-means with k-values of 6 or 13 - which provide much lower silhouette scores - there are not distinct and obvious clusters. Furthermore, the created clusters and centroids do not appear to group the the data into meaningful categories when considering the original label.
Hierarchical Clustering
The plot (left) displays a dendrogram created from the same data used in the above k-means analysis. The dendrogram was created using Ward’s method which is a bottom-up hierarchical clustering approach that minimizes the variance within clusters. (Cosine Similarity required too much computational power for my machine given the size of the dataset)
The dendrogram provides intriguing results when considered in the context of the previously created k-means clusters. The graph displays two clear and obvious clusters, supporting the k-means analysis. Additionally, while 6 distinct clusters can be recognized the distinction becomes less obvious. Furthermore, choosing 13 distinct groups becomes difficult. This Hierarchical clustering further supports the k-means clustering analysis.
DBSCAN
DBSCAN was also performed on the same PCA transformed data as K-means and the dendrogram. The k-distance plot (below) was created to inform the correct eps value to be applied when coding DBSCAN (eps = 0.1).
The DBSCAN analysis resulted in the creation of 12 clusters. However, the created clusters are extremely convoluted and do not provide any clear value to the analysis of the data. This is clearly visualized by the plot (right). The DBSCAN results are quite distinct from the more helpful K-means and Hierarchical clustering results which generally supported one another.
Conclusions
When considering the ultra-running dataset it is clear that K-means is a much more relevant clustering method compared to DBSCAN. The distribution of two clear primarily spherical clusters lends itself to K-means. The hierarchical clustering results further supports the K-means clustering conclusions. Performing K-means with a k-value of 2 provided the most useful information pertaining to the ultra-running data. This clustering method was able to create 2 distinct groups closely corresponding to results where the athlete took shorter or longer than 20 hours to complete the race. This information is very helpful when segmenting race results, and is consistent with a non-technical sanity check. Performances in which athletes are racing for over 20 hours are very extreme and occur under unique circumstances when compared to shorter sub-20 hour races.