K-means is a machine learning algorithm designed to find “clusters” in data by measuring the “distance” between points.
data:image/s3,"s3://crabby-images/ce3da/ce3da27225da00f46111d4d1ca75bfc614ca6049" alt=""
The “K” in refers to the number of clusters you want to identify.
data:image/s3,"s3://crabby-images/f4eb2/f4eb2ce89d63d0e9f6279da3c2ed3e69d8046c49" alt=""
The algorithm follows these steps:
Step 1: Determine the number of K.
data:image/s3,"s3://crabby-images/7e827/7e8278115d20d49f2f567590ecd6266ef560f7c3" alt=""
This is typically done through the “elbow method” by plotting the “error” of different K’s, and using the number where the curve flattens out.
data:image/s3,"s3://crabby-images/9a9e8/9a9e8cb1f42f00f3e102b7f4fe0fc948f41b3cbd" alt=""
Step 2: Select a random starting point for each cluster.
data:image/s3,"s3://crabby-images/47486/474863e3ec912637323a3e3614d461b9821792b3" alt=""
Step 3: Measure the distance between a point and all three “centroids”.
data:image/s3,"s3://crabby-images/08f37/08f37fc2cfd1c414f8d867c9a19898f5235cbba7" alt=""
Step 4: Assign it to the closest centroid.
data:image/s3,"s3://crabby-images/ea42a/ea42a9382c30a7630f9339be38db6fa003fc3037" alt=""
Step 5: Repeat for all points.
data:image/s3,"s3://crabby-images/949f0/949f036506e7928b0fe924c1dc4bd08784ea6337" alt=""
Step 6: Find the new centers for each clusters based on the current assignments.
data:image/s3,"s3://crabby-images/54ba9/54ba98a4fce131b81e1ccd072e98976cc71c3823" alt=""
Step 7: Repeat steps 5-6 using the new centroids until all points stop moving.
data:image/s3,"s3://crabby-images/91474/91474845e742ab278c9353dff1ac0662e70e48d3" alt=""
Step 8: Score the clustering by calculating the sum of the distance from each point to their cluster center.
data:image/s3,"s3://crabby-images/3792a/3792aa49cc1da80580a4da19eea4be2a5593449c" alt=""
*Based on the random initialization of the centroids, this can lead to very poor clustering.
data:image/s3,"s3://crabby-images/727a6/727a631becf2b0ef64801499e697d8671930bbae" alt=""
Step 9: Repeat steps 1-9 many times keeping track of the “loss” for each . Choose the model with the lowest overall loss.
data:image/s3,"s3://crabby-images/64ec2/64ec2ce5e97a3d5e973476b03ea5cd3c0a49308f" alt=""
Overall, K-means is very fast and easy to understand. However, other clustering algorithms prove to be more consistent and repeatable than K-means.