K-Means clustering is an unsupervised machine learning model.
Unsupervised learning is where we do not provide the model with the actual outputs from the data. Unsupervised learning aims to model the underling structure or distribution in the data to learn more about the data. The most popular use-cases of unsupervised learning are association rules – which is where we uncover rules that describe a large chunk of our data. For example, Amazon uses such learning to state that people that bought this, also bought that.Kodey 2020
Essentially, this model looks at the data and tries to find clusters of information. If we think about customer recommendations, you may see ‘other customers like you also liked…’ and often this is driven by a clustering algorithm, where it looks at many data points it holds about you and finds another cluster or group of customers that are similar.
Thinking about Netflix, they would train their model based on their entire customer population to produce the clustering logic then, when you then sign up, Netflix starts to collect data about you (what you’ve watched; whether you also play kids shows frequently (suggesting you have children); your age; gender and more); and once they have enough information about you, they’ll place you into one of those clusters. Hence, over time, your recommendations will become more relevant.
Okay, so why is it called K-Means, what’s K? In this case K is the number of clusters that the model should output. How do we define K? Well, often you can decide what value K should be by looking at the data visually. Let’s look at an overly simplified example:
Here, we can say we have 3 clusters of customers. We can see that immediately, but machines can’t. So they need to run the k-means algorithm. What happens here is, it places 3 centroids onto the plot. Like below:
The distance between each datapoint and each centroid is measured. The closest centroid is assigned.
We then recalculate the centroid value (the sum of all values belonging to centroid / number of values). This means the centroid will now be in the middle of the cluster it then has a kind of decision boundary, where all values within that boundary belong to that cluster.
The algorithm is really easy to implement in Python, which is something we will cover in an upcoming post.