An end to end Random Forest Classifier implementation guide

In this article, we are going to go through some of the key steps to implementing a random forest machine learning model. The data we will be using is from Kaggle – it’s a dataset describing whether patients have diabetes or not.

Luckily, this dataset is nice and clean, so we don’t need to worry about cleaning & encoding our data.

The first thing we do is import a whole lot of libraries. The use of these will become evident in the coming stages of the ML modelling. We then ingest our data, specifying what we consider to be a missing value. We then split the data into two chunks: the features and the target (the thing we are trying to predict).

import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pickle
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

#Ingest data, split features & target
#---------------------------------
missing_values = ['?', '-', '']
data = pd.read_csv('/home/Datasets/diabetes2.csv', na_values = missing_values)

features = data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]

target = data[['Outcome']]

Next, let’s make sure our data is as clean as we can get it by handling null values. That is, if a field is empty, what should we do with it? In the below, I have made the assumption that the mean of the column makes sense in most cases. You will need to use your domain expertise to make a decision around the correct imputation technique for your use-case.

#Handle null values in each column
#---------------------------------
features['Pregnancies'].fillna(0) #we assume empty means 0 pregnancies

meanGlucose = features['Glucose'].mean()
features['Glucose'].fillna(meanGlucose) #we assume the mean is the right measure

meanBP = features['BloodPressure'].mean()
features['BloodPressure'].fillna(meanBP) 

meanST = features['SkinThickness'].mean()
features['SkinThickness'].fillna(meanST) 

meanI = features['Insulin'].mean()
features['Insulin'].fillna(meanI) 

meanBMI = features['BMI'].mean()
features['BMI'].fillna(meanBMI) 

meanDPF = features['DiabetesPedigreeFunction'].mean()
features['DiabetesPedigreeFunction'].fillna(meanDPF) 

meanAge = features['Age'].mean()
features['Age'].fillna(meanAge) 

Next, we need to check if we have a class imbalance. A class imbalance is where you have a much larger proportion of one class than another. This can lead to imbalance because the model will have very few samples of the minority class to learn from & hence, the model wil bias towards the other class. In our example below, we have a resonable split between the two classes.

#check for class imbalance
#---------------------------------
print(target[['Outcome']].value_counts())
'''
0          500
1          268'''

Next, we split out our testing and training data and again check for a class imbalance on the output. It’s always possible that in splitting the data, all your minority class end up in the testing set, which would skew your model. In this instance the 0 class is 64% of samples in the test set and 65% of the training set. So, it’s nicely split.

#Split testing and training data
#---------------------------------
train_features, test_features, train_labels, test_labels = train_test_split(features, target, test_size = 0.2, random_state = 42)
print(len(train_features))
print(len(test_features))

#check for class imbalance in split data
#---------------------------------
print(test_labels[['Outcome']].value_counts())
'''
0          99
1          55
'''

print(train_labels[['Outcome']].value_counts())
'''
0          401
1          213'''

Now to the intersting bit. We train a model with absolutely no parameters. The idea is, we just understand the baseline performance & can look at the most influential features.

#train the initial model - no parameters
#---------------------------------
clf = RandomForestClassifier()
clf = clf.fit(train_features, train_labels)

predictions = clf.predict(test_features)

print(confusion_matrix(test_labels, predictions))
'''
[[79 20]
 [21 34]]
'''

print ("Accuracy", accuracy_score(test_labels,predictions))
'''
Accuracy 0.7337662337662337
'''

Drilling down on the feature importance, we can look at which features matter the most to the model. This must be interpreted carefully. The feature importances do not indicate a direct dependence between the predictor and the target and they must not be interpreted as correlations. They are however useful when it comes to feature selection.

#what features are important in your model?
#---------------------------------
importance = clf.feature_importances_
print(importance)

#make it readable with names of features
feat_importances = pd.Series(clf.feature_importances_, index= features.columns)
feat_importances.nlargest(20).plot(kind='barh', figsize=(20,10))


Let’s tune our model and see what we can do. Below, I have described some of the key hyper parameters that we can tune.

PARAMETERDESCRIPTION


n_estimators

N_estimators is the number of trees you want to include in your forest. The trees essentially vote & the majority wins. The higher number of trees gives you better model performance, but it can be very slow. The key is to take the maximum number of trees that your processor can handle as it makes your predictions more accurate and more stable when released into the wild.
Min Leaf SizeMin leaf size: The leaf is the end node of the decision tree. The smaller leaf makes the tree more prone to capturing noise in training data & overfitting. Keeping the leaf min size at around 50 observations should help remove some of the overfitting.
BootstrapBootstrap is all about the sampling methodology. If we are working with a sample of the population as our training dataset (almost always), bootstrapping is the notion of repeatedly sampling that dataset. But, instead of removing the sample from the training dataset, bootstrapping puts it back, so it’s possible some of the same data points will appear in multiple samples. This means, the same sub samples may be used, leading to slightly different distributions of data & a more representative sampling strategy.
Max DepthThe maximum number of levels in each decision tree. The deeper the tree, the more it understands about the data. But, if the tree is too deep, it’ll over fit your model & start to sacrifice accuracy on new data.
Min Samples LeafThe number of data points which are permitted on a leaf node. A node is what we call each bubble of a decision tree. Tomorrow’s article will cover decision trees / random forests in more theoretical detail.
Min Samples SplitThe minimum number of data points placed on a leaf before the node is split. Tomorrow’s article will cover decision trees / random forests in more theoretical detail.


To do this, we will use RandomizedSearchCV. I have described it here. But essentially, you define a list of parameter values you would like to try. Randomized Search then randomly samples your list (where the number of samples it takes is defined by n_iter) and finds the best model. Remember, the CV bit stands for cross validation, which I have discussed here.

Cross Validation splits our data into X number of blocks. In four-fold cross validation, the data would be split into four chunks. It then runs 4 times across the whole dataset, changing which chunk is a test chunk and which is a training chunk (like below).


So, together, we have a randomized search, which will try different parameters on different chunks of our data (defined by cross validation).

In the below I have defined a number of parameters and passed them into the RandomizedSearch. It’s going to take 100 samples of parameters from our list and it’s going to use 3 fold cross validation. From this, it returned the best model of: {‘n_estimators’: 1200, ‘min_samples_split’: 10, ‘min_samples_leaf’: 1, ‘max_features’: ‘sqrt’, ‘max_depth’: 118, ‘bootstrap’: False}.

#let's tune the model
#---------------------------------

#generate a list of 10 integers between 200 and 2000. 
n_estimators = np.linspace(start = 200, stop = 2000, num = 10, dtype = int)

max_features = ['auto', 'sqrt']

max_depth = np.linspace(start = 10, stop = 150, num = 10, dtype = int) 

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]

bootstrap = [True, False]


parameters = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

tuned = RandomizedSearchCV(estimator = clf, param_distributions=parameters, n_iter = 100, cv = 3, random_state = 43, n_jobs = -1)

tuned.fit(train_features, train_labels)

print(tuned.best_params_)
'''
{'n_estimators': 1200, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 118, 'bootstrap': False}
'''

We can now run our model again, using these tuned parameters. You can see in the below that we’ve improved the model accuracy from 0.733 to 0.740. It’s not huge, but it’s something.

#train the revised model 
#---------------------------------
clf = RandomForestClassifier(n_estimators= 1200, min_samples_split= 10, min_samples_leaf= 1, max_features = 'sqrt', 
                             max_depth= 118, bootstrap= False)
clf = clf.fit(train_features, train_labels)

predictions = clf.predict(test_features)

print(confusion_matrix(test_labels, predictions))
'''
[[79 20]
 [20 35]]
'''

print ("Accuracy", accuracy_score(test_labels,predictions))
'''
Accuracy 0.7402597402597403
'''

We can also have a quick check to see whether GridSearch provides a better result. This, instead of sampling your parameter grid, takes every possible combination. We define it in the same way:

parameters = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

tuned = GridSearchCV(estimator = clf, param_grid=parameters, cv = 3, n_jobs = -1)

tuned.fit(train_features, train_labels)

print(tuned.best_params_)

Next steps to improving the quality of your model is to look into feature selection and further tuning, discussed here. Once you are happy with your model you can pickle it (to store the configuration) and use it in your API calls, described here.

pickle.dump(clf, open('/home/pickled_model.pickle', 'wb'))

In upcoming articles, we will look at how other types of models may provide better results and look into the reasons why that may be.

Kodey