K-means is a machine learning algorithm designed to find “clusters” in data by measuring the “distance” between points.
The “K” in refers to the number of clusters you want to identify.
The algorithm follows these steps:
Step 1: Determine the number of K.
This is typically done through the “elbow method” by plotting the “error” of different K’s, and using the number where the curve flattens out.
Step 2: Select a random starting point for each cluster.
Step 3: Measure the distance between a point and all three “centroids”.
Step 4: Assign it to the closest centroid.
Step 5: Repeat for all points.
Step 6: Find the new centers for each clusters based on the current assignments.
Step 7: Repeat steps 5-6 using the new centroids until all points stop moving.
Step 8: Score the clustering by calculating the sum of the distance from each point to their cluster center.
*Based on the random initialization of the centroids, this can lead to very poor clustering.
Step 9: Repeat steps 1-9 many times keeping track of the “loss” for each . Choose the model with the lowest overall loss.
Overall, K-means is very fast and easy to understand. However, other clustering algorithms prove to be more consistent and repeatable than K-means.