ML Series: Supervised Regression Model Options

In this article, we’re going to cover what models you could use to predict continuous numeric values. You have a number of model choices – let’s discuss three:

  1. Linear regression
  2. Random forest regressor 
  3. Gradient Boosting Tree

These are big enough topics to have their own articles and indeed, they shall. But for now, let’s give some high level descriptions of these models & when they perform best.

Linear Regression

I have described linear regression in detail in this article.  In short, it’s a supervised learning algorithm which uses a line of best fit on a graph to predicts a continuous value as an output given an input. For example:

This model is really good for data which has a genuine linear relationship – like calories consumed vs weight.

That said, in almost every situation, the linear regression model is inferior to a random forest regressor. 

Random Forest

A random forest is a grouping of decision trees. Data is run through each tree within a forest and provides a result. It’s important that we do this, because, while decision trees are great & easy to read, they’re inaccurate. They’re good with their training data, but aren’t flexible enough to accurately predict outcomes for new data. Each tree works on predicting the same thing, but it does so with different bootstrapped datasets. Because we build lots of different trees from different sample data, it’s much more likely that we capture all eventualities. You can read more about random forests in my article here.

So how do random forests compare to the linear regression model? Well, with linear regression, you need to massage your data a lot before you ingest it – you need to scale and encode your data because the model can’t cope with categorical data – a random forest can; well, in the background most models do the encoding, but, you don’t have to – you don’t even have to remove outliers! 

Random forests are good at handling messy data with messy relationships. For example if we had simply age vs salary; we may use a linear model. It’s simple data, it’s probably pretty linear by nature. But, if we have age; salary; education level; gender; country and more information it becomes less clear – because a 40 year old male with a PHD salary (you would hope) would differ quite significantly to your average 40 year old person. Hence, a random forest would likely handle this data better.

Gradient Boosted Trees

Remember bagging in random forests? Well if not, don’t worry – I’ll explain it. Bagging is where we randomly sample data from our training set with replacement. Replacement means, we put it back, so it could be used again. 

When you sample blocks of data for your random forest (bagging), they get sampled in parallel. So the data is all split into chunks & that forms the data which will be used for each tree. However, when we do that with a boosted tree – we use ‘boosting’. This means, each model is trained sequentially and is based on the previously trained model.

In the below we select a bunch of data and train the model based on it. We then test that model with ALL of our training data – not just the chunk we took. 

We then look at the outcome and we find that some points had a high error (not well predicted) which I have denoted in red. 

When we go to select the datapoints for our next bag of data; the points are weighted. So it makes it more likely that those datapoints with high error will be selected for bag 2. We then test the model with BOTH trees we’ve created so far and again look at the error. This continues until we have the number of trees we require.

This algorithm can help improve accuracy but it can also cause overfitting and also takes longer to train than your standard random forest. Tuning parameters such as tree depth and reducing the number of dimensions helps to reduce overfitting risk (parameters are discussed in detail here).