How To Implement A Random Forest Machine Learning Model On An Imbalanced Dataset

For this article, I wanted to demonstrate an end to end model implementation, including tuning for an imbalanced dataset.

First off, let’s ingest our data:

import pandas as pd
df = pd.read_csv('/home/Datasets/cc_fraud.csv')

Now, we can start to explore our dataset. First off, are there any null values that we need to worry about?

df.isnull().sum()

There are no nulls! a rare pleasure. Let’s now look at the shape of the data; limiting the values to only the numeric columns.

import seaborn as sns
df2 = df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFraud']]
sns.pairplot(df2)

This outputs one of my favourite tools – a pairplot. This shows us:

  1. Some reasonable positive correlations between amount and oldbalancedest and newbalancedest fields.
  2. Some extremely strong and even perfect correlations between the new and old balance fields.

All of this information is generally good to know about our dataset. That, along with understanding the spread of data is extremely useful. We can, summarise the correlations in the below matrix, which is a nice way to view the relationships betwen fields.

sns.heatmap(df.corr(), annot=True)

Now, we normalize our data – we bring the range for all values between zero and one. As you remember from previous posts, this is to reduce model bias induced by fields of significantly different magnitudes; we also encode the categorical columns.

output = df[['isFraud']]
df = df[['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
df = pd.get_dummies(df, columns=['type'])
df['amount'] = df['amount']/max(df['amount'])
df['oldbalanceOrg'] = df['oldbalanceOrg']/max(df['oldbalanceOrg'])
df['newbalanceOrig'] = df['newbalanceOrig']/max(df['newbalanceOrig'])
df['oldbalanceDest'] = df['oldbalanceDest']/max(df['oldbalanceDest'])
df['newbalanceDest'] = df['newbalanceDest']/max(df['newbalanceDest'])

Next, we convert those dataframes into Numpy arrays, ready for ingestion by the model:

import numpy as np
df = np.array(df)
output = np.array(output).ravel()

Next, we split our testing and training data:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

Finally, we fit the model, make some predictions and output our accuracy scores:

model = RandomForestClassifier(n_estimators = 20)
model.fit(train_features, train_labels)
predictions = model.predict(test_features)
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, predictions)))
print ("Accuracy", accuracy_score(test_labels,predictions))
print(classification_report(test_labels,predictions))
print(confusion_matrix(test_labels, predictions))
accuracy_score(test_labels,predictions)

In the above, you’ll notice that we have an accuracy of 0.9996 or 99.96%, which seems unbelievable.

If you look at this dataset, you will notice that there is an extreme bias. Of the 6.3 million rows, only 8,200 are fraudulent cases. That means only 0.12% of our samples are fraudulent cases. So, even if our model were to predict that the transaction was not fraudulent every time, you would still get 99.88% accuracy.

In our test set the bias is 1620 records out of a test set of 1,270,904 total records. That is also 0.12%.

Precision in the above tells us how many of the positive identifications were actually correct, while recall tells us how many actual positives were correctly identified; confusing, yes.

Let’s look – 100% of actual non-fraudulent transactions were correctly identified as such; while only 77% of fraudulent transactions were. Which makes sense – we got 99.96% accuracy, while blindly assigning a status of not fraudulent to every transaction would have got us 99.88%. So, we have a delta, which is the number of fraudulent transactions which were correctly identified as fraudulent.

from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, predictions)

If we now look at the confusion matrix. We have 1.2M records where we predicted it was not fraudulent and we were correct; we have 39 cases where the model was incorrect. We then have 372 cases where we incorrectly stated that the transaction was fraudulent and 1248 where we correctly stated the transaction was fraudulent. So, in total, we have a success rate of (1270865+1248)/(1270865+39+372+1248) which gives us our 99.96% from above.

Remember, a confusion matrix is [[TP, FP][FN, TN]]

So what does this tell us? It tells us that the majority of the error is caused by too many predictions of ‘not fraud’ where it should have been. This could be due, in part to the imbalanced dataset. We could then manually limit our dataset to 8,000 rows of fraudulent transactions and 8,000 of not fraudulent to test the theory.

is_fraud = df.loc[df['isFraud'] == 1].head(8000)
is_not_fraud = df.loc[df['isFraud'] == 0].head(8000)
new_model_df = pd.concat([is_fraud, is_not_fraud])

If we look at the total number of mis-classified fraudulent transactions as a percentage of the total fraudulent transactions in the original, we would take (375/(372+1248))*100; which gives us a 23.14% failure rate for fraudulent transactions.

Using the balanced dataset, I get the below confusion matrix; which takes the failure rate for fraudulent transactions to (7/(1+1597))*100; which is 0.43%. So, just by balancing our dataset, we’ve taken our fraudulent transaction failure rate from 23% down to less than 1% – that’s a great success.

The interesting thing is; the overall model accuracy has dropped slightly to 99.5% from 99.9%. In normal circumstances, I would look to tune parameters at this point; but with such a dataset it probably is not worth it.

Dude, there is no way you’re getting a 99% accuracy here…

Yeah, you’re right. We do get a 99% accuracy, but its because there is a rule in our dataset that determines perfectly if the transaction is fraud. That is, if the transaction amount is equal to the original balance (i.e. the bank account is being empty).

Take a look at the below:

def is_all_money(row):
    if row['amount'] == row['oldbalanceOrg']:
        return 'EMPTIED'
    else:
        return 'NOT EMPTIED'
    
df['flag'] = df.apply(is_all_money, axis=1)
df.groupby(['isFraud', 'flag']).count()

And the output supports this claim that the rule gives the model 99% accuracy.

So, what do we do? Well, we can drop those fields which give us a more realistic accuracy by removing the link between the value of the transaction and whether it is fraudulent. Of course, we have another link; if the new balance of originating account is 0.00; that’s exactly the same thing as saying ‘did the bank account get emptied in the transaction’ so there is still a really strong link there.

It may be the case that these are legitimate rulesets that we are finding & it may be the case that they’re not and won’t stand up to scruitiny on new data – the best thing to do is to test it against a larger test dataset.

Kodey