scoresvideos
Statistical Methods for Data Science
Table of Contents

K-means clustering is a popular unsupervised learning technique that groups similar data points into clusters. It works by iteratively assigning points to the nearest centroid and updating centroids based on the mean of assigned points.

The algorithm aims to minimize inertia, which measures cluster compactness. Techniques like K-means++ initialization and the elbow method help optimize clustering results. Evaluation metrics such as silhouette score assess the quality of the final clusters.

Cluster Formation

Centroid Selection and Distance Calculation

  • K-means clustering aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (centroid)
  • Centroids serve as the representative points for each cluster and are typically initialized randomly or using techniques like K-means++
  • Euclidean distance is commonly used to measure the similarity between data points and centroids, calculated as the square root of the sum of squared differences between corresponding features

Cluster Assignment and Iterative Refinement

  • Each data point is assigned to the cluster whose centroid is closest to it based on the Euclidean distance
  • After assigning all data points to clusters, the centroids are updated by calculating the mean of all data points within each cluster
  • The process of assigning data points to clusters and updating centroids is repeated iteratively until the cluster assignments stabilize or a maximum number of iterations is reached
  • Convergence occurs when the centroids no longer change significantly between iterations, indicating that the clusters have stabilized

Initialization and Optimization

K-means++ Initialization

  • K-means++ is an initialization technique that aims to choose better initial centroids compared to random initialization
  • It selects the first centroid randomly from the data points and then selects subsequent centroids with a probability proportional to the squared distance from the nearest already chosen centroid
  • K-means++ helps to spread out the initial centroids and can lead to faster convergence and more stable clustering results compared to random initialization

Inertia as Optimization Criterion

  • Inertia, also known as within-cluster sum of squares (WCSS), measures the compactness of the clusters and is used as an optimization criterion in K-means clustering
  • It is calculated as the sum of squared distances between each data point and its assigned centroid across all clusters
  • The goal of K-means is to minimize the inertia, which indicates that the clusters are as compact and well-separated as possible
  • A lower inertia value suggests a better clustering solution, as data points are closer to their respective centroids

Evaluation Metrics

Elbow Method for Determining Optimal Number of Clusters

  • The elbow method is a visual technique used to determine the optimal number of clusters (k) in K-means clustering
  • It involves plotting the inertia (WCSS) against different values of k and looking for an "elbow" point where the rate of decrease in inertia slows down significantly
  • The idea is that adding more clusters beyond the elbow point does not provide substantial improvements in reducing the inertia
  • The elbow point suggests a good balance between the number of clusters and the compactness of the clusters (trade-off between model complexity and goodness of fit)

Silhouette Score for Assessing Clustering Quality

  • Silhouette score is a metric used to evaluate the quality of the clustering results
  • It measures how well each data point fits into its assigned cluster compared to other clusters
  • The silhouette score for a data point is calculated as (b - a) / max(a, b), where a is the average distance between the data point and all other points in the same cluster, and b is the average distance between the data point and all points in the nearest neighboring cluster
  • The silhouette score ranges from -1 to 1, where a higher value indicates better clustering quality (data points are well-matched to their own clusters and poorly-matched to neighboring clusters)
  • A silhouette score close to 1 suggests that the data point is appropriately clustered, while a score close to 0 indicates that the data point is on the border between two clusters, and a negative score suggests that the data point may have been assigned to the wrong cluster