ML Series: A Random Forest Implementation With An Imbalanced Dataset

For this article, I wanted to demonstrate an end to end model implementation, including tuning for a dataset. When I chose the dataset here, I did not realise that the initial, untuned implementation would achieve 99% accuracy – as you can see, many people on Kaggle are coming up with such accuracies too! So tuning it is probably not necessary. However, let’s go through it anyway, as it can still provide you with some valuable insight.

First off, let’s ingest our data:

import pandas as pd
df = pd.read_csv('/home/Datasets/cc_fraud.csv')

Now, we can start to explore our dataset. First off, are there any null values that we need to worry about?


There are no nulls! a rare pleasure. Let’s now look at the shape of the data; limiting the values to only the numeric columns.

import seaborn as sns
df2 = df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFraud']]

This outputs one of my favourite tools – a pairplot. This shows us:

  1. Some reasonable positive correlations between amount and oldbalancedest and newbalancedest fields.
  2. Some extremely strong and even perfect correlations between the new and old balance fields.

All of this information is generally good to know about our dataset. That, along with understanding the spread of data is extremely useful. We can, summarise the correlations in the below matrix, which is a nice way to view the relationships betwen fields.

sns.heatmap(df.corr(), annot=True)

Now, we normalize our data – we bring the range for all values between zero and one. As you remember from previous posts, this is to reduce model bias induced by fields of significantly different magnitudes; we also encode the categorical columns.

output = df[['isFraud']]
df = df[['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
df = pd.get_dummies(df, columns=['type'])
df['amount'] = df['amount']/max(df['amount'])
df['oldbalanceOrg'] = df['oldbalanceOrg']/max(df['oldbalanceOrg'])
df['newbalanceOrig'] = df['newbalanceOrig']/max(df['newbalanceOrig'])
df['oldbalanceDest'] = df['oldbalanceDest']/max(df['oldbalanceDest'])
df['newbalanceDest'] = df['newbalanceDest']/max(df['newbalanceDest'])

Next, we convert those dataframes into Numpy arrays, ready for ingestion by the model:

import numpy as np
df = np.array(df)
output = np.array(output).ravel()

Next, we split our testing and training data:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

Finally, we fit the model, make some predictions and output our accuracy scores:

model = RandomForestClassifier(n_estimators = 20), train_labels)
predictions = model.predict(test_features)
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
print('Mean Absolute Error:', metrics.mean_absolute_error(test_labels, predictions))
print('Mean Squared Error:', metrics.mean_squared_error(test_labels, predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_labels, predictions)))
print ("Accuracy", accuracy_score(test_labels,predictions))
print(confusion_matrix(test_labels, predictions))

In the above, you’ll notice that we have an accuracy of 0.9996 or 99.96%, which seems unbelievable.

If you look at this dataset, you will notice that there is an extreme bias. Of the 6.3 million rows, only 8,200 are fraudulent cases. That means only 0.12% of our samples are fraudulent cases. So, even if our model were to predict that the transaction was not fraudulent every time, you would still get 99.88% accuracy.

In our test set the bias is 1620 records out of a test set of 1,270,904 total records. That is also 0.12%.

Precision in the above tells us how many of the positive identifications were actually correct, while recall tells us how many actual positives were correctly identified; confusing, yes.

Let’s look – 100% of actual non-fraudulent transactions were correctly identified as such; while only 77% of fraudulent transactions were. Which makes sense – we got 99.96% accuracy, while blindly assigning a status of not fraudulent to every transaction would have got us 99.88%. So, we have a delta, which is the number of fraudulent transactions which were correctly identified as fraudulent.

from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, predictions)

If we now look at the confusion matrix. We have 1.2M records where we predicted it was not fraudulent and we were correct; we have 39 cases where the model was incorrect. We then have 372 cases where we incorrectly stated that the transaction was fraudulent and 1248 where we correctly stated the transaction was fraudulent. So, in total, we have a success rate of (1270865+1248)/(1270865+39+372+1248) which gives us our 99.96% from above.

Remember, a confusion matrix is [[TP, FP][FN, TN]]

So what does this tell us? It tells us that the majority of the error is caused by too many predictions of ‘not fraud’ where it should have been. This could be due, in part to the imbalanced dataset. We could then manually limit our dataset to 8,000 rows of fraudulent transactions and 8,000 of not fraudulent to test the theory.

is_fraud = df.loc[df['isFraud'] == 1].head(8000)
is_not_fraud = df.loc[df['isFraud'] == 0].head(8000)
new_model_df = pd.concat([is_fraud, is_not_fraud])

If we look at the total number of mis-classified fraudulent transactions as a percentage of the total fraudulent transactions in the original, we would take (375/(372+1248))*100; which gives us a 23.14% failure rate for fraudulent transactions.

Using the balanced dataset, I get the below confusion matrix; which takes the failure rate for fraudulent transactions to (7/(1+1597))*100; which is 0.43%. So, just by balancing our dataset, we’ve taken our fraudulent transaction failure rate from 23% down to less than 1% – that’s a great success.

The interesting thing is; the overall model accuracy has dropped slightly to 99.5% from 99.9%. In normal circumstances, I would look to tune parameters at this point; but with such a dataset it probably is not worth it.