Statistical Methods for Data Science

12.1 K-means Clustering

Citation:

K-means clustering is a popular unsupervised learning technique that groups similar data points into clusters. It works by iteratively assigning points to the nearest centroid and updating centroids based on the mean of assigned points.

The algorithm aims to minimize inertia, which measures cluster compactness. Techniques like K-means++ initialization and the elbow method help optimize clustering results. Evaluation metrics such as silhouette score assess the quality of the final clusters.

Cluster Formation

Centroid Selection and Distance Calculation

K-means clustering aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (centroid)
Centroids serve as the representative points for each cluster and are typically initialized randomly or using techniques like K-means++
Euclidean distance is commonly used to measure the similarity between data points and centroids, calculated as the square root of the sum of squared differences between corresponding features

Each data point is assigned to the cluster whose centroid is closest to it based on the Euclidean distance
After assigning all data points to clusters, the centroids are updated by calculating the mean of all data points within each cluster
The process of assigning data points to clusters and updating centroids is repeated iteratively until the cluster assignments stabilize or a maximum number of iterations is reached
Convergence occurs when the centroids no longer change significantly between iterations, indicating that the clusters have stabilized

Initialization and Optimization

K-means++ Initialization

K-means++ is an initialization technique that aims to choose better initial centroids compared to random initialization
It selects the first centroid randomly from the data points and then selects subsequent centroids with a probability proportional to the squared distance from the nearest already chosen centroid
K-means++ helps to spread out the initial centroids and can lead to faster convergence and more stable clustering results compared to random initialization

Inertia as Optimization Criterion

Inertia, also known as within-cluster sum of squares (WCSS), measures the compactness of the clusters and is used as an optimization criterion in K-means clustering
It is calculated as the sum of squared distances between each data point and its assigned centroid across all clusters
The goal of K-means is to minimize the inertia, which indicates that the clusters are as compact and well-separated as possible
A lower inertia value suggests a better clustering solution, as data points are closer to their respective centroids

Evaluation Metrics

Elbow Method for Determining Optimal Number of Clusters

The elbow method is a visual technique used to determine the optimal number of clusters (k) in K-means clustering
It involves plotting the inertia (WCSS) against different values of k and looking for an "elbow" point where the rate of decrease in inertia slows down significantly
The idea is that adding more clusters beyond the elbow point does not provide substantial improvements in reducing the inertia
The elbow point suggests a good balance between the number of clusters and the compactness of the clusters (trade-off between model complexity and goodness of fit)

Silhouette Score for Assessing Clustering Quality

Silhouette score is a metric used to evaluate the quality of the clustering results
It measures how well each data point fits into its assigned cluster compared to other clusters
The silhouette score for a data point is calculated as (b - a) / max(a, b), where a is the average distance between the data point and all other points in the same cluster, and b is the average distance between the data point and all points in the nearest neighboring cluster
The silhouette score ranges from -1 to 1, where a higher value indicates better clustering quality (data points are well-matched to their own clusters and poorly-matched to neighboring clusters)
A silhouette score close to 1 suggests that the data point is appropriately clustered, while a score close to 0 indicates that the data point is on the border between two clusters, and a negative score suggests that the data point may have been assigned to the wrong cluster

Table of Contents

📉statistical methods for data science review

12.1 K-means Clustering

Cluster Formation

Centroid Selection and Distance Calculation

Cluster Assignment and Iterative Refinement

Initialization and Optimization

K-means++ Initialization

Inertia as Optimization Criterion

Evaluation Metrics

Elbow Method for Determining Optimal Number of Clusters

Silhouette Score for Assessing Clustering Quality

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes