In the last article, we built a simple machine learning model which predicted the quality of wine based on a number of features. In this article, we’re going to tune that model.
Our starting point was an accuracy of 0.68 (68%), so we aim to get better than that.
The way we make models more accurate is three fold, feature reduction, feature engineering and hyper parameter tuning; in this article, we’re going to look specifically at tuning our hyper parameters.
What parameters can we tune?
In this case, we have a random forest algorithm. I’ve listed out the parameters we can tune below, along with a description of each.
|N_estimators is the number of trees you want to include in your forest. The trees essentially vote & the majority wins. The higher number of trees gives you better model performance, but it can be very slow. The key is to take the maximum number of trees that your processor can handle as it makes your predictions more accurate and more stable when released into the wild.|
|Min Leaf Size||Min leaf size: The leaf is the end node of the decision tree. The smaller leaf makes the tree more prone to capturing noise in training data & overfitting. Keeping the leaf min size at around 50 observations should help remove some of the overfitting.|
|Bootstrap||Bootstrap is all about the sampling methodology. If we are working with a sample of the population as our training dataset (almost always), bootstrapping is the notion of repeatedly sampling that dataset. But, instead of removing the sample from the training dataset, bootstrapping puts it back, so it’s possible some of the same data points will appear in multiple samples. This means, the same sub samples may be used, leading to slightly different distributions of data & a more representative sampling strategy.|
|Max Depth||The maximum number of levels in each decision tree. The deeper the tree, the more it understands about the data. But, if the tree is too deep, it’ll over fit your model & start to sacrifice accuracy on new data.|
|Min Samples Leaf||The number of data points which are permitted on a leaf node. A node is what we call each bubble of a decision tree. Tomorrow’s article will cover decision trees / random forests in more theoretical detail.|
|Min Samples Split||The minimum number of data points placed on a leaf before the node is split. Tomorrow’s article will cover decision trees / random forests in more theoretical detail.|
In the above, we have used GridSearchCV to determine what the best parameters are. We’ve defined some values in the parameter grid & we’ve let the application run. We get the below results:
This affects the model score massively. From 68% to 84% in one simple swoop. So you can see the importance of selecting the correct hyper parameters for your job!