First things first, what is k-means clustering? It’s an algorithm that helps us to group similar datapoints together into the same ‘cluster’. So, if we have 5,000,000 customers in our dataset; we may decide to group them into 10 clusters of customers. We may do this for marketing purposes, for example.

So, what’s K? In this case K is the number of clusters that the model should output. How do we define K? Well, often you can decide what value K should be by looking at the data visually. Let’s look at an overly simplified example:

Here, we can say we have 3 clusters of customers. We can see that immediately, but machines can’t. So they need to run the k-means algorithm. What happens here is, it places 3 centroids onto the plot. Like below:

The distance between each datapoint and each centroid is measured. The closest centroid is assigned.

We then recalculate the centroid value (the sum of all values belonging to centroid / number of values). This means the centroid will now be in the middle of the cluster it then has a kind of decision boundary, where all values within that boundary belong to that cluster.

## Okay, got it! What do I need to keep in mind?

First and foremost; we need to avoid ‘**the curse of dimensionality**‘. What is that? Well, as we move to more and more dimensions, datapoints which were close together become further apart, so may no longer be a neighbour. If we look at the below charts (which I found on this great website), you can see that the points are clustered together pretty closely on the 1D chart. As we move to 2D, some of those points move to the extremeties of the axis, pushing them further apart.

Look at our points at 0.4 in the original graph. As we move to two dimensional data, those points have been cast to opposite ends of the Y axis. They are no longer nearest neighbours. When we move to 3 dimensions, the problem is futher exacerbated.

We can somewhat mitigate this issue by using **PCA** (Principal Component Analysis). You can read more about it on my post here. But essentially** **Principal Component Analysis reduces the number of features in our dataset by squashing highly correlated features together. This is less accurate than having independent features but it makes it computationally more efficient and reduces the number of dimensions in our dataset for algorithms like KMeans or Nearest Neighbours.

Beyond dimensioinality, we also need to look into **scaling/standardizing** our data which is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”

We also need to handle categorical data which we can do through **encoding**. Let’s say we have this data in a column of our dataset:

- Red
- Blue
- Yellow
- Red
- Red
- Green
- Purple

We want to encode these values to numbers to achieve the best possible outcome from our model. If we use label encoding we can simply convert those values to numbers:

- 1
- 2
- 3
- 1
- 1
- 4
- 5

The problem is though, numbers indicate relationship. Think about a salary field, the bigger it is, the better or it could be a risk field, the higher the number, the higher the risk. But in our example, just because purple is color 5, doesn’t make it 5 times better than red, so this approach can lead to more weight being given to certain colors & hence can bias our model.

So the lesson here is this – if your values have no relationship to one another – just like colours – red has no relationship to yellow; then you should not use label encoding. However, if your values are related (e.g. risk, where risk 5 is higher priority and should be given more weight than risk 1), then you should use it.

If you have no relationship between your datapoints, you should use one hot encoding. This creates a binary output in a new column. For example:

Data | Red | Yellow | Blue | Green | Purple |

row1 | 1 | 0 | 0 | 0 | 0 |

row2 | 0 | 0 | 1 | 0 | 0 |

row3 | 0 | 1 | 0 | 0 | 0 |

row4 | 1 | 0 | 0 | 0 | 0 |

row5 | 1 | 0 | 0 | 0 | 0 |

row6 | 0 | 0 | 0 | 1 | 0 |

row7 | 0 | 0 | 0 | 0 | 1 |

Here, you can see that we have a new column per colour & a binary response ‘was it that colour?’.

This approach solves the problem we face with label encoding, but it adds a tonne more columns to your dataset, especially if you have several categorical columns, with many unique values within.

**Outliers** can be a real pain & can skew our datasets. We can handle them in a number of ways (Whichever method you use, you always run the risk of eliminating valid data points but sometimes this is a necessary evil).

- Statistically, 99.7% of normally distributed data falls within 3 standard deviations from the mean, so it is not unreasonable to take values outside of 3 standard deviations as outliers.
- Another method we could use is to drop the top and bottom 5% of data, which will reduce skew but does also risk removing legitimate datapoints.
- We could use is to set a limit. Where we might set our upper and lower limit, to the 95th and 5th percentile. We then say ‘if the value falls below the lower value, then set the value to be equal to the lower value’ and ‘if the value exceeds the upper value, then set the value to be equal to the upper value’

We also must handle missing values; which we can do by:

- Dropping the missing value
- Imputing a fixed value
- Imputing a calculated value (e.g. mean/median of the field)

Finally, we must also eliminate collinearity where possible to give the best chance of our datapoints remaining close together.

This is by no means an exhaustive list of activities to undertake to get your clustering to work optimally, but it’s a good start & I hope it has been useful.