A random forest is a grouping of decision trees. Data is run through each tree within a forest and provides a classification. It’s important that we do this, because, while decision trees are great & easy to read, they’re inaccurate. They’re good with their training data, but aren’t flexible enough to accurately predict outcomes for new data.
Each tree works on predicting the same thing, but it does so with different bootstrapped datasets. Because we build lots of different trees from different sample data, it’s much more likely that we capture all eventualities.
What is a bootstrapped data set?
A bootstrapped dataset is a method of sampling. It’s taking a small sample of the dataset. So, let’s say we have a dataset: 55, 31, 42, 88, 12, 44, 19, 91.
Here, we would randomly draw a few numbers (55, 12 and 91), you’ll repeat this process N times, depending on the size of the original dataset. Because these numbers are selected from & not removed from the original dataset, it is possible that the same numbers can be randomly selected several times & that you’ll have duplicates in your bootstrapped sample. That’s OK.
We don’t want to run the random draw so many times that we capture all rows of our dataset – we need some out-of-bag samples. We use these to validate our random forest, without this we would have no data to test our forest with, that the forest had not already seen before.
How many trees do I need?
Generally speaking, the more trees we have in our forest, the more accurate our model will be. That said, there is a trade-off with computation times. We need to use the accuracy of our out-of-bag samples to determine the number of required trees. You can make the choice between running a few trees with relatively large sample data versus lots of trees with smaller sample data, based on the accuracy of the model in both scenarios.
How do we know what the prediction is if each tree can provide a different output?
Once each tree has provided a classification, we select the classification with the most votes as being correct. So imagine, we run our loan application through our 3 tree forest.
- Tree 1 = Approved
- Tree 2 = Rejected
- Tree 3 = Approved
We sum up the total number of approved / rejected that are provided across all the trees & the majority wins. So, in this case, the loan would be approved.
How do we test our forest model?
We can then test our model using our out-of-bag samples. The accuracy of the model can be determined by the proportion of out-of-bag samples that were correctly classified. In the above example – if the correct result was approved & one of the 3 trees was wrong, we have 67% accuracy.