ML Series: A Look At Gradient Boosted Trees

In a traditional random forest, there is parallel learning. In the below, we can see that each model samples data from the overall dataset and produces a model from it. It does this in parallel and independently – no model has any influence on any other model.

Gradient boosting seeks to improve on a weak learner (in this case a decision tree) by taking the residuals (errors) from the first model and feeding them into model 2. The below from Quora shows this in action.

The first graph shows the original dataset. The second graph shows the line that was fitted with model 1.

Now, when we input ONLY the errors into model 2, we get a very different line which captures those mis-predicted datapoints well.

We then input the errors from model 2 into model 3. Model three looks at the combined output of model 1 and model 2. It then takes those errors and fits its own line to them.

The addition of subsequent models does not impact/change the original. But, when it comes to making a decision at the end, we ensure that the data is accurately captured by at least one tree.

Each subsequent model is built to accurately predict the cases that the previous model performed poorly on (i.e. those cases which were poorly predicted).  Each successive model attempts to correct the COMBINED shortcomings of the previous models.

Note: the more trees we have, the more accurate our model is with a standard random forest. This is not the case with gradient boosted trees. If you reach a point of excessive tree counts, you will risk overfitting.