K-means clustering is one of the commonly used unsupervised techniques in Machine learning. K-means clustering clusters or partitions data in to K distinct clusters. In a typical setting, we provide input data and the number of clusters K, the k-means clustering algorithm would assign each data point to a distinct cluster.
In this post, we will implement K-means clustering algorithm from scratch in Python. We will use Python’s Pandas and visualize the clustering steps.
Let us first load the packages needed.
import pandas as pd import numpy as np import matplotlib.pyplot as plt
We need data set to apply K-means clustering. Let us simulate clusters using scikit learn’s make_blob function. Let us simulate 500 data points with three well defined clusters in total. make_blobs function is a handy function to use for testing out various machine learning algorithms.
from sklearn.datasets.samples_generator import make_blobs # create simulated clusters using scikit learn's make_blobs data, true_cluster = make_blobs(n_samples=500, centers=3, random_state=0, cluster_std=0.60)
In addition to the data points, make_blogs also gives us the true identity of the cluster for each data point. Let us save the data and the true identity of the cluster as Pandas data frame.
data_df = pd.DataFrame(data) data_df.columns=['x','y'] data_df['true_cluster'] = true_cluster data_df.head(n=3)
Let us also hard code a color for each cluster using dictionary and create a new column for the colors.
color_map= {0:'red',1:'yellow',2:'green'} data_df['true_color'] = data_df.true_cluster.map(color_map) data_df.head(n=3)
Let us visualize the cluster data with scatter plot. We can see that our simulated data contains really well separated clusters.
plt.scatter(x='x',y='y',c='true_color',data=data_df) plt.xlabel("x") plt.xlabel("y") plt.savefig('kmeans_data.png')
We will be using K-means clustering algorithm to find the cluster identity for each data point using the data.
Let us say we don’t know the true cluster identity and we have the data without cluster identity. For the sake of simplicity, let us also assume the number of clusters in our data to be K=3.
In our current example, this is obvious by looking at the scatterplot. However, often we don’t know the number of clusters and there are ways to estimate the number of clusters. However, we will not do that here.
K-means algorithm
K-means algorithm is an iterative approach like EM algorithm.
It starts with random cluster assignment for each data point. In this example, each data point could be from any one K=3 random clusters.
Then we iterate
- for each cluster we compute the center point of the cluster. This is what “means” part of the K-means clustering.
- Then for each data point we find the distance to the all K=3 cluster centers and assign the cluster that is closest to the data point. For example, a data point is assigned to a cluster if the distance from the data point and to the cluster is the smallest when compared to the other clusters. We typically use Euclidean distance measure to estimate the distance between a data point to a cluster center.
The iterations are done when our cluster centers stabilizes and do not change between successive iterations.
K-means algorithm: First Iteration
Let us first pick 3 cluster centers randomly and visualize the data. This is kind of equivalent to assigning random clusters randomly to each data point and looking at its center.
We use Pandas’ sample function select K=3 data points and make them as our cluster center.
current_centers = data_df.sample(3,random_state=1) plt.scatter(x='x',y='y', c='black', data=data_df) plt.scatter(x='x',y='y', data=current_centers, c=['red','blue','yellow'], s=100) plt.xlabel("x") plt.xlabel("y")
Since we assigned random points as cluster center, we can see that they do not really match the clusters in the data.
Our next step is use our initial cluster center to assign a cluster to each data point. The we way do that is to find the distance of each data point to three clusters and assign the closest cluster.
Let us first write a function that will compute Euclidian distance between a data point and cluster center.
# distance def dist(x, y): return sum((xi - yi) ** 2 for xi, yi in zip(x, y))
Let us compute distance between our data points and all the cluster centers and assign a cluster that is closest to a data point. We will write a small Python function to do that. The function takes data points and current cluster centers as input and outputs a new cluster label.
Basically, for each data point we compute its distance to all K=3 cluster centers and assign a cluster label that is closest to the data point.
def assign_cluster_labels(data, centers): cluster_labels = [] for point in data: # compute distances between three cluster centers to a data point distances = [dist(point, center) for center in centers] # find which cluster is closest to the data point and assign the cluster it cluster_labels.append(distances.index(min(distances))) return cluster_labels
Let us use the assign_cluster_labels() function on our current centers (which were randomly chosen) and get cluster labels for each data point.
Since we have our data in dataframe we provided the values to function instead of dataframe directly. And we can see the current cluster labels for each data point.
current_labels = assign_cluster_label(data_df[['x','y']].values, current_centers[['x','y']].values) current_labels[0:5] [2, 0, 0, 0, 0, 2, 0, 0, 0, 0]
Note that the current cluster labels will be very different from the true labels.
K-means algorithm: Second Iteration
Since now we have a new cluster labels for each point, we can use the new cluster labels to compute new cluster centers. The new cluster centers will be slightly better than the previous ones.
We will use Pandas’ groupby function to compute cluster centers/means. Basically, we compute mean for each value of the current cluster labels.
current_centers = data_df[['x','y']].groupby(current_labels).mean()
Now we can iterate the above two steps, which is to
- update cluster labels based on current cluster centers and
- compute cluster centers from the updated cluster labels.
current_centers = data_df[['x','y']].groupby(current_labels).mean() print(current_centers) current_labels = assign_cluster_label(data_df[['x','y']].values, current_centers.values) plt.scatter(x='x',y='y',c=current_labels,data=data_df) plt.scatter(x='x',y='y',data=current_centers,c=['purple','blue','yellow'],marker='*', s=200) plt.xlabel("x") plt.xlabel("y")
And visualize our cluster assignments and cluster centers.
K-means algorithm: Third Iteration
We some improvement in the cluster assignment, but still not perfect. So, we can iterate again using the same two steps and visualize.
current_centers = data_df[['x','y']].groupby(current_labels).mean() print(current_centers) current_labels = assign_cluster_label(data_df[['x','y']].values, current_centers.values) plt.scatter(x='x',y='y',c=current_labels,data=data_df) plt.scatter(x='x',y='y',data=current_centers,c=['purple','blue','yellow'],marker='*', s=200) plt.xlabel("x") plt.xlabel("y")
We can see that by the third iteration cluster assignments look great. One can also see that we can nicely wrap up the iteration steps in to a function such that we iterate until cluster means/center stabilizes.
In summary, we implemented K-means clustering algorithm in Python using Pandas and saw step-by-step example of how K-means clustering works. Stay tuned for comparison of k-means algorithm implementation with the method available in Scikit learn.