Deep learning for beginners (Part 4): inspecting our Multi Layer Perceptron (MLP) model

This is the fourth part in the Deep Learning for Beginners series. The other parts of the series are:

  1. Deep learning for beginners (Part 1): neurons & activation functions
  2. Deep learning for beginners (Part 2): some key terminology and concepts of neural networks
  3. Deep learning for beginners (Part 3): implementing our first Multi Layer Perceptron (MLP) model
  4. Deep learning for beginners (Part 4): inspecting our Multi Layer Perceptron (MLP) model
  5. Deep learning for beginners (Part 5): our first foray into Keras
  6. Deep learning for beginners (Part 6): more terminology to optimise our Keras model
  7. Deep learning for beginners (Part 7): neural network design (layers & neurons)
  8. Deep learning for beginners (Part 8): Improving our tuning script & using the Keras tuner

In the previous article, we worked through building our first neural network using SKLearn. You will remember that we have:

  • 31 nodes input layer (31 features)
  • 20 node hidden layer 1
  • 10 node hidden layer 2
  • 1 node output layer

The first thing we might like to do is just to check the number of layers in the model aligns with this expectation. We expect it to be 4 (input, hidden1, hidden2, output).

mlp_clf.n_layers_
#out: 4

Let’s also quickly check the shape of the network. We can check how many features have been seen by the model.

mlp_clf.n_features_in_
#out: 31

We may also be interested to see the number of iterations (or epochs) that the solver has run through. We set a max when defining the model, but we can check what really happened.

mlp_clf.n_iter_
#out: 20

Before we get into the meat of the investigation, let’s just verify the activation function & the number of training samples used to create the model.

mlp_clf.out_activation_
#out: logistic

mlp_clf.t_
#out: 4556939

We can inspect the weights applied at each stage of the neural network by looking at the model coefficients. When you look at the output of coefs, it’ll look quite intimidating, but let’s break it down – it’s a list of lists, so we can look at each nested list.

By printing the below, we see lengths of 31, 20 and 10 – which align with the structure of our neural network – coincidence? No! These are the weights assigned to each as part of the model.

print(len(mlp_clf.coefs_[0]))
print(len(mlp_clf.coefs_[1]))
print(len(mlp_clf.coefs_[2]))

If we now inspect mlp_clf.coefs[0] in more detail, we can see, it’s still a list of lists. The length of the internal lists is 20 – which is, of course, the size of the second layer of our MLP. So each of the subsequent nodes get a different weighting for the ingested feature.

array([[ 0.04831248, -0.03457129, -0.06804578, -0.06056679, -0.02949333,
        -0.00956451, -0.11401471, -0.11448489,  0.01421299, -0.01884066,
        -0.13625769, -0.17885363, -0.12397866, -0.16187424, -0.04780782,
        -0.18715104,  0.06372367, -0.00276321, -0.11435387, -0.13513219],
       [-0.17053891, -0.13909741,  0.14405762,  0.09187875,  0.09784029,
        -0.13698489,  0.0862132 ,  0.05888509, -0.05943338,  0.09844012,
         0.10069221, -0.09932501,  0.08861701, -0.17436223, -0.17951776,
        -0.1030934 ,  0.12217199, -0.11768607, -0.12760298, -0.12969404]

Similarly, if we look at mlp_coefs[1], we see a list of lists with 10 objects in each list – the length of the second hidden layer – so it’s all starting to come together!

array([[ 0.28856533, -0.18428438,  0.11768056,  0.30265482,  0.00159602,
        -0.07851345,  0.13099072,  0.00999434,  0.17557096,  0.40800495],
       [ 0.06946405, -0.14787743,  0.19868054,  0.02940571, -0.23158076,
        -0.22717954, -0.17543366, -0.25345448,  0.00444168, -0.06535285]

We can also look at the model bias (or intercept). Remember, bias is a way to move our curve. In a logistic or linear function, the coefficients determine the steepness of the line but the bias (or intercept) determines the location of the line (shifting it further along the X axis or up the Y axis).

mlp_clf.intercepts_

Okay, so let’s figure out what is going on here, by making it a bit simpler & walking through it. Let’s change the number of hidden layers to be 1, with 2 nodes and select only the top 3 features for the model. That means, our new Neural Network will have 3 input layer nodes, 2 hidden layer nodes and a single output node.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('/home/Datasets/creditcard.csv')

output = df['Class']
features = df.drop('Class', 1)

train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

mlp_clf = MLPClassifier(hidden_layer_sizes=(2),
                        max_iter = 20 ,activation = 'logistic',
                        solver = 'sgd')


t = train_features[['V1', 'V2', 'V3']].head(1)
t2 = train_labels.head(1)

mlp_clf.fit(t, t2)

By running this, we get the below values & coefficients. At the bottom right we have the resulting values for wx+b (weights * value + bias) which will be fed into the activation function.

eature ValueNode 1 CoefficientNode 2 CoefficientWeighted Input Node 1Weighted Input Node 2BIAS Node 1BIAS Node 2Node 1 with BiasNode 2 with bIas
1.9550410.47502813-0.423823510.9286994703-0.8285923388-0.06805311-0.034386480.8606463603-0.8629788188
-0.380783-0.40288987-0.284776420.15341361340.1084380195-0.06805311-0.034386480.085360503370.07405153954
-0.315013-0.235997760.41269190.07434236237-0.1300033135-0.06805311-0.034386480.006289252371-0.1643897935
wx+b0.952296116-0.9533170728

Now we can step through the analysis.

  1. wx+b is (x * weight) + bias – which is ingested into the activation function on each node.
  2. sigmoid L1 is the next line – the function is 1.0/(1.0 + e^-(x+b)) – or in other words; where wx+b = 0.9; sigmoid will be 1.0/(1/0 + e^0.9)).
  3. So now, we have the value calculated by the sigmoid function. We move to the next layer in the network. The weights and bias for those are taken into acount to create the L2 wx+b value, which again is passed into the sigmoid function.
  4. The ouput is 0.3, which is below 0.5 and hence it’s in the class 0 not the class 1.
eature ValueNode 1 CoefficientNode 2 CoefficientWeighted Input Node 1Weighted Input Node 2BIAS Node 1BIAS Node 2Node 1 with BiasNode 2 with bIas
1.9550410.47502813-0.423823510.9286994703-0.8285923388-0.06805311-0.034386480.8606463603-0.8629788188
-0.380783-0.40288987-0.284776420.15341361340.1084380195-0.06805311-0.034386480.085360503370.07405153954
-0.315013-0.235997760.41269190.07434236237-0.1300033135-0.06805311-0.034386480.006289252371-0.1643897935
wx+b0.952296116-0.9533170728
sigmoid L10.7215767110.2782182214
L2 Weights0.398013960.09061815
L2 Bias-0.55010368-0.55010368
L2 wx+b-0.7877981353
sigmoid L20.3126416479
Binary Output0

At this point, we can say that we’re quite comfortable with the conceptual design of the neural network. We can inspect feature importances using LIME or SHAP, but the models are inherently more complex & black box than more traditional models, so you won’t get 100% visibility.

Kodey