Ensemble modelling to improve your model performance

In my last article, I spoke about auto-sklearn. I said that, the library would train several models and then use them in conjunction with one another to make a final prediction. This is what we call ensemble modelling (or meta algorithms). It’s the process of including multiple models into the prediction process with the goal of improving prediction accuracy.

If we think about an example, we may need to predict whether someone has diabetes using our input dataset. We could choose a single model (e.g. random forest) and we may get good levels of accuracy. But, we may also want to test ensemble learning to see how much we can improve our predictions. In this case, we might have:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('/home/Datasets/creditcard.csv')

output = df['Class']
features = df.drop('Class', 1)

train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

models = (RandomForestClassifier(), LogisticRegression())

preds = [] 

for model in models:
    print(model)
    model = model
    model.fit(train_features, train_labels)
    prediction = model.predict(test_features)
    preds.append(list(prediction))
    
x = pd.DataFrame(preds).T
x.columns = ['pred1', 'pred2']

x['same'] = x['pred1'] == x['pred2']
x['same'].value_counts()

Now we have 2 predictions for each of the observations. When looking at the value counts, they have mostly been predicted as being equal, but not always. So, we can put some logic in to determine what the output should be.

True     56883
False       79

What are our options? Well, with a binary classifier that has an odd number of models in its ensemble, we can simply take the majority (i.e., if most models assign a given class, then we will take that as being true). As below, I’ve added a third model into the mix and have applied some logic which takes the majority (if class 1 has been predicted more than one time, it’s a majority, else class 0). In this case, we are assuming all models should be applied the same weight (or significance). You can adjust this too in order to make the best prediction.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

df = pd.read_csv('/home/Datasets/creditcard.csv')

output = df['Class']
features = df.drop('Class', 1)

train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

models = [RandomForestClassifier(), LogisticRegression(), RandomForestClassifier()]

preds = [] 

for model in models:
    print(model)
    model = model
    model.fit(train_features, train_labels)
    prediction = model.predict(test_features)
    preds.append(list(prediction))
    
x = pd.DataFrame(preds).T
x.columns = ['pred1', 'pred2', 'pred3']

def find_majority(row):
    first = row['pred1']
    second = row['pred2']
    third = row['pred3']
    
    if first + second + third >= 2:
        return 'class 1'
    else:
        return 'class 0'

x['winner'] = x.apply(find_majority, axis = 1)

You can see this logic in action as below:

We could alternatively work with the prob_class values (as below). Once we have the probabilities, we have a couple of options:

  • Take a weighted average of the predicted probabilities & evaluate the result against a class threshold
  • Take the mean of the predicted probabilities & evaluate the result against a class threshold
  • Take each individual models predicted class probabilities; assess it against a threshold & assign a class as a result. Then, use majority consensus scoring to determine which should be the final predicted class.

In the below example, the code has been adjusted to calculate the class probabilities and to extract the probability of belonging to class 1 as the variable ‘probs’ (the predict_proba function returns two probabilities (one for each class)).

for model in models:
    print(model)
    model = model
    model.fit(train_features, train_labels)
    prediction = model.predict_proba(test_features)
    df = pd.DataFrame(prediction)
    df.columns = ['class0', 'class1']
    probs = list(df['class1'])
    preds.append(list(probs))
    
x = pd.DataFrame(preds).T
x.columns = ['pred1', 'pred2', 'pred3']

If we look at the output of x; we see three columns of class probabilities. From here, we can apply one of the above options.

In the below example, I have simply taken an average of each and then assessed whether the average is greater than my threshold of 0.4.

def decide_winner(row):
    one = row['pred1']
    two = row['pred2']
    three = row['pred3']
    calc = round(((one+two+three)/3),2)
    
    if calc >=0.4:
        return 1
    else:
        return 0

x['winner'] = x.apply(decide_winner, axis=1)

In a regression problem, we could take the average of all the predictions.

When it comes to ensemble learning, we don’t have to use different model types. We could simply take several instances of the same model type (e.g. random forest) and train each against different training data. We would then use one of the options above to decide which prediction to take forward.

The above is one way we can look into ensemble learning. We can also use a method called stacking (also known as stacked generalization or blending). This is where we train a number of base models (level zero models). Those model predictions are then fed into a level 1 model as features. Effectively, we make a prediction, based on the base model predictions.

In the example above, we have an output dataframe called x. This can be used then as input features for the stacked model, where we might train the model based on X, providing the train_labels too.

model = LogisticRegression()
model.fit(x, train_labels)

Kodey