ML Series: Feature Engineering and Selection

As with the previous sections in this series, there is a little overlap – but not a huge amount. The techniques we’re going to discuss are related to feature engineering and feature selection.

Binning data

Binning is a really useful technique. It’s a way to convert continuous variables into discrete variables by bucketing them in some way. For example, age ranges of 18-25, 26-30 and so on.

We can also use binning to help reduce the number of different options in our categorical fields. For example, if you have a list of towns all across the world, you may choose to bin them. Instead of London, you may say UK or you may even say Europe – depending at what granularity you want your data to be at. 

Ultimately with binning, you can write your own bespoke rules or you can bin data based on some mathematical formulae – like binning by percentile.

Right, so that’s what it is, but why do we do it? Well, working with raw, continuous metrics can cause us issues because their distribution is almost always skewed. This is because some values appear quite frequently while others will be quite rare. Machine learning models generally perform better when we have a normal distribution of numeric values.

Binning can help improve the accuracy of a machine learning model by reducing the noise in the data and helping to reduce the impact of outliers – it can also lead to less overfitting where the additional granularity is not required.

Personally, I have not found a reasonable use-case for converting continuous data to discrete, but it is very useful for grouping together categorical data into a smaller number of categories (e.g. W1 = London).

Deriving new features 

This is where feature engineering gets interesting. This is where we develop new features which we think will be beneficial to our model accuracy.

Let’s think about a customer visiting our restaurant. They tell us that they travelled from the town of Slough. So we have a dataset which provides the customers’ start point. That though, is pretty useless. What would be useful would be to know how far they travelled in KM; so we can create a new feature to store this to pass into the model.

And what about customer tenure. They became our customer in August 2001 and it’s now September 2020. So, again, passing the start date into the model is not going to help us much. We can calculate their tenure in months and pass that into the model. 

Next, we can extract new features from existing features. If we have a list of albums in the format: The Eminem Show (2000); we might want to extract the release year and then calculate the age of the album. So we would split this feature, extract 2000 and subtract it from the current year – giving us an age of 240 months.

As a final example, let’s think about domain names. If you have domain names in your data with a tonne of subdomains – like graph.example.com or something.something.example.com – you could group both of them as example.com; which might be more valuable information than hundreds of subdomains for each domain. 

Scaling and encoding data

Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”

Using the SciKitLearn library, we can convert each feature to have a mean of zero and a standard deviation of 1; removing the potential bias in the model.

For some models, this is an absolute requirement, as certain algorithms expect that your data is normally distributed and centre around zero.

Encoding
Many machine learning algorithms don’t accept / understand strings; and those that do, still perform better & produce more accurate results with numeric inputs. Let’s say we have this data in a column of our dataset:

  • Red
  • Blue
  • Yellow
  • Red
  • Red
  • Green
  • Purple

We want to encode these values to numbers to achieve the best possible outcome from our model. If we use label encoding we can simply convert those values to numbers:

  • 1
  • 2
  • 3
  • 1
  • 1
  • 4
  • 5


The problem is though, numbers indicate relationship. Think about a salary field, the bigger it is, the better or it could be a risk field, the higher the number, the higher the risk. But in our example, just because purple is color 5, doesn’t make it 5 times better than red, so this approach can lead to more weight being given to certain colors & hence can bias our model.

So the lesson here is this – if your values have no relationship to one another – just like colours – red has no relationship to yellow; then you should not use label encoding. However, if your values are related (e.g. risk, where risk 5 is higher priority and should be given more weight than risk 1), then you should use it.

If you have no relationship between your datapoints, you should use one hot encoding. This creates a binary output in a new column. For example:

DataRedYellowBlueGreenPurple
row110000
row200100
row301000
row410000
row510000
row600010
row700001

Here, you can see that we have a new column per colour & a binary response ‘was it that colour?’.

This approach solves the problem we face with label encoding, but it adds a tonne more columns to your dataset, especially if you have several categorical columns, with many unique values within.