There are so many things to consider when getting your data ready for machine learning models. In this article, I am going to cover encoding and standardization (or scaling) our data.
Standardization / scaling to remove / reduce bias
Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”
Using the SciKitLearn library, we can convert each feature to have a mean of zero and a standard deviation of 1; removing the potential bias in the model.
For some models, this is an absolute requirement, as certain algorithms expect that your data is normally distributed and centre around zero.
Encoding
Many machine learning algorithms don’t accept / understand strings; and those that do, still perform better & produce more accurate results with numeric inputs. Let’s say we have this data in a column of our dataset:
- Red
- Blue
- Yellow
- Red
- Red
- Green
- Purple
We want to encode these values to numbers to achieve the best possible outcome from our model. If we use label encoding we can simply convert those values to numbers:
- 1
- 2
- 3
- 1
- 1
- 4
- 5
The problem is though, numbers indicate relationship. Think about a salary field, the bigger it is, the better or it could be a risk field, the higher the number, the higher the risk. But in our example, just because purple is color 5, doesn’t make it 5 times better than red, so this approach can lead to more weight being given to certain colors & hence can bias our model.
So the lesson here is this – if your values have no relationship to one another – just like colours – red has no relationship to yellow; then you should not use label encoding. However, if your values are related (e.g. risk, where risk 5 is higher priority and should be given more weight than risk 1), then you should use it.
If you have no relationship between your datapoints, you should use one hot encoding. This creates a binary output in a new column. For example:
Data | Red | Yellow | Blue | Green | Purple |
row1 | 1 | 0 | 0 | 0 | 0 |
row2 | 0 | 0 | 1 | 0 | 0 |
row3 | 0 | 1 | 0 | 0 | 0 |
row4 | 1 | 0 | 0 | 0 | 0 |
row5 | 1 | 0 | 0 | 0 | 0 |
row6 | 0 | 0 | 0 | 1 | 0 |
row7 | 0 | 0 | 0 | 0 | 1 |
Here, you can see that we have a new column per colour & a binary response ‘was it that colour?’.
This approach solves the problem we face with label encoding, but it adds a tonne more columns to your dataset, especially if you have several categorical columns, with many unique values within.
So to summarize;
- Your values have relationship to one another (like Salary or Risk), where the higher value does mean more emphasis should be placed on it, then you should use a label encoder.
- If your values have no relation to one another (e.g. colors or cities) then you should use the one hot encoding method.
As you can see below, these concepts are straightforward to implement:
