Skip to main content

KMeans

K-Means is a popular unsupervised machine learning algorithm used for clustering data points into groups. It minimizes the variance within clusters and works by iteratively refining cluster assignments.

How K-Means Works

  1. Choose K (number of clusters)
    • The user specifies the number of clusters (K).
  2. Initialize Centroids
    • Randomly select K data points as the initial centroids.
  3. Assign Data Points to Clusters
    • Each point is assigned to the closest centroid based on Euclidean distance.
  4. Update Centroids
    • Compute the new centroid of each cluster by averaging the points in that cluster.
  5. Repeat Until Convergence
    • The process repeats until centroids no longer change significantly or a stopping criterion is met.

Mathematical Formula (Objective Function)

The goal is to minimize the within-cluster sum of squares (WCSS):

kmeans-mathematical-formula-01.png

Choosing K: The Elbow Method

To determine the optimal K, use the Elbow Method:

  • Plot WCSS for different K values
  • Look for an "elbow" where WCSS stops decreasing significantly

Advantages

✔ Simple and fast ✔ Works well with large datasets ✔ Scales linearly with number of samples

Disadvantages

❌ Needs pre-defined K ❌ Sensitive to outliers ❌ May converge to local optima

Would you like a Python example to implement K-Means using sklearn? 🚀


Citation

K-Means Clustering in Machine Learning

K-means is an iterative algorithm that splits a dataset into non-overlapping subgroups that are called clusters.

...

https://serokell.io/blog/k-means-clustering-in-machine-learning

Citation

K means Clustering – Introduction

K-means clustering is a technique used to organize data into groups based on their similarity. For example online store uses K-Means to group customers based on purchase frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big Spenders for personalised marketing.

...

https://www.geeksforgeeks.org/k-means-clustering-introduction/