2 min read•july 24, 2024
Hierarchical clustering organizes data into nested groups, revealing relationships between points. Agglomerative methods build clusters from the bottom up, while divisive approaches split from the top down. Each method has unique strengths for different dataset sizes and shapes.
Visualizing hierarchical clustering results often uses dendrograms, tree-like diagrams showing cluster relationships. Various linkage methods determine how distances between clusters are calculated. While hierarchical clustering offers flexibility, it can be computationally expensive for large datasets.
builds clusters from bottom-up starting with individual data points as separate clusters and merges closest clusters iteratively until all points form a single cluster
takes top-down approach splitting one large cluster containing all data points iteratively until each point becomes its own cluster
Key differences include direction of cluster formation (bottom-up vs top-down), computational complexity (agglomerative generally more efficient), and suitability for dataset sizes (divisive better for larger datasets)
uses minimum distance between points in different clusters (nearest neighbor)
employs maximum distance between points in different clusters (farthest neighbor)
calculates average distance between all pairs of points in different clusters
minimizes variance within clusters by merging clusters that result in smallest increase in sum of squared differences
Tree-like diagram with vertical axis representing distance or dissimilarity and horizontal axis showing individual data points or clusters
Leaf nodes represent individual data points while internal nodes indicate merged clusters
Height of vertical lines signifies dissimilarity between merged clusters
Interpret results by identifying natural clusters through long vertical lines and cutting horizontally for desired
Visualization techniques include color-coding clusters, adding labels to data points, and adjusting dendrogram orientation (vertical or horizontal)
Advantages: no need to specify number of clusters beforehand, produces hierarchical representation, suitable for small to medium-sized datasets, handles non-globular cluster shapes
Limitations: computationally expensive for large datasets (time complexity , space complexity ), sensitive to noise and outliers, cannot undo previous merge or split decisions
Comparison with other techniques:
Consider dataset size, dimensionality, desired cluster shapes, presence of noise or outliers, and need for hierarchical representation when choosing clustering method