Deep learning for beginners (Part 3): implementing our first Multi Layer Perceptron (MLP) model

This is the third part in the Deep Learning for Beginners series. The other parts of the series are:

  1. Deep learning for beginners (Part 1): neurons & activation functions
  2. Deep learning for beginners (Part 2): some key terminology and concepts of neural networks
  3. Deep learning for beginners (Part 3): implementing our first Multi Layer Perceptron (MLP) model
  4. Deep learning for beginners (Part 4): inspecting our Multi Layer Perceptron (MLP) model
  5. Deep learning for beginners (Part 5): our first foray into Keras
  6. Deep learning for beginners (Part 6): more terminology to optimise our Keras model
  7. Deep learning for beginners (Part 7): neural network design (layers & neurons)
  8. Deep learning for beginners (Part 8): Improving our tuning script & using the Keras tuner

In this article, we’ll look at a simple implementation of a multi-layer perceptron neural network. It’s simple, because it’s using the SKLearn library, which is not as flexible as other frameworks (e.g. Tensorflow). It is however, easy to understand & good for building a baseline of knowledge around neural networks.

Key Hyperparameters

  1. Hidden_layer_sizes is a tuple which defines how many nodes we want at each layer of the neural network. For example (5, 4, 2) would indicate three hidden layers, where the first has 5 nodes, the second has 4 nodes and the final has 2 nodes. We can consider that the optimal size of the hidden layer is somewhere between the size of the input layer & output layer. If we have one hidden layer; 10 input and 4 output; then we can set the size of the hidden layer to 7.
  2. max_iter is the number of iterations we will consider during the run. Each iteration is effecitvely one epoch (an epoch is the number of passess through the entire training dataset the model will complete while learning). For stochastic solvers (sgd or adam) this will be the number of epochs, for other models it will be the number of iterations.
  3. activation defines the activation function we will use in the model. The options are ‘identify’; ‘logisitic’; ‘tanh’ and ‘relu’ – all of which are discussed here.
    1. For regression problems, we may wish to use ReLU (which will give us a numerical value that is greater than zero); or a linear function which has a range of minus infinity to plus infinity
    2. For a binary classification problem, we may look to use the sigmoid/logistic function
    3. For a multi-label classification problem, we could use the sigmoid function with three output nodes – giving a value between 0 and 1 – how likely it is to belong to that class
    4. If we want to predict a single label outcome (e.g. the class the user belongs to rather than likelihoods for multiple classes), then we may want to use a softmax function (not available in sklearn).
  4. solver is the tool used for weight optimisation. The options are ‘lbfgs’, ‘sgd’, which is our stochastic gradient descent we discussed previously or ‘adam’ which is a stochastic gradient descent based optimiser.
  5. If solver is set to ‘sgd’ (stochastic gradient descent) then we also want to tune the learning_rate which defines the step size in our gradient descent slope.
  6. alpha is a regularization parameter. It combats overfitting by constraining the size of the weights applied. Increasing alpha may fix overfitting by allowing smaller weights. Decreasing alpha may fix high bias (underfitting) but allowing larger weights.
  7. Learning_rate_init – controls the step size when upgrading the weights (only used in ‘sgd’ or ‘adam’)
  8. Learning rate – only when the solver is ‘sgd’ can be set to a few values. used in conjunction with learning_rate_init
    1. Constant – refers to the value you have set in learning_rate_init
    2. Adaptive – uses the constant value, as long as the loss continues to decrease. If 2 consecutive epochs fail, learning rate will be divided by 5.~
  9. Tol refers to the optimisation tolerance. If the loss does not improve by at least the value defined in tol, then training will stop (unless learning_rate is set to ‘adaptive’). used in conjunction with n_iter_no_change
  10. n_iter_no_change – maximum number of epochs to not meet tol improvement (when solver is ‘sgd’ or ‘adam’) used in conjunction with tol
  11. Early_stopping refers to a boolean parameter which terminates model training when the validation score isn’t improving by the value defined in tol, for thee number of iterations defined in n_iter_no_change. If set to true, 10% if data will be set aside for validation, when that score does not improve, training will be stopped. used in conjunction with tol & n_iter_no_change. If this value is set to false, training will stop when training loss does not improve by more than tol for n_iter_no_change passes. (For use when solver is ‘sgd’ or ‘adam’)
  12. validation_fraction – by default 10% of the training data is set aside for validation if early_stopping is set to true. You can alter that here. used in conjunction with early_stopping.

Note: we use the validation set to validate that our back propagation is effective for both seen & unseen data. Simply using the training set over and over again may not work as effectively & data may not generalise well.

The Code:

The implementation of a neural network in SKLearn is very similar to any other algorithm – that’s the reason we’ve started with this library – it’ll help us to understand the concept, before moving into more complex implementations (e.g. Tensorflow).

In the below, we’re loading our data in from a creditcard fraud dataset; creating our MLP classifier; fitting data to it & making predictions from it. Note that we have defined hidden_layer_sizes as (20,10). We have 31 features and a single output, so our model will have 31 input nodes, 20 nodes in the first hidden layer; 10 nodes in the second hidden layer and 1 output node.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('/home/Datasets/creditcard.csv')

output = df['Class']
features = df.drop('Class', 1)

train_features, test_features, train_labels, test_labels = train_test_split(df, output, test_size = 0.2, random_state = 42)

mlp_clf = MLPClassifier(hidden_layer_sizes=(20,10),
                        max_iter = 10, activation = 'logistic',
                        solver = 'sgd'), train_labels)
y_pred = mlp_clf.predict(test_features)
print(accuracy_score(test_labels, y_pred))

Similarly to other SkLearn models, we can check out the confusion matrix:

fig = plot_confusion_matrix(mlp_clf, test_features, test_labels, display_labels=mlp_clf.classes_)
fig.figure_.suptitle("Confusion Matrix")

And we can print the classification report!

print(classification_report(test_labels, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.00      0.00      0.00        98

    accuracy                           1.00     56962
   macro avg       0.50      0.50      0.50     56962
weighted avg       1.00      1.00      1.00     56962'''

We can look for the optimal parameters using gridsearch, in a similar manner to other algorithms too. I have discussed gridsearch here.

param_grid = {
    'hidden_layer_sizes': [(2, 3, 4), (4, 5, 6), (5, 10, 20)],
    'max_iter': [10, 20, 30],
    'activation': ['sigmoid', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],

grid = GridSearchCV(mlp_clf, param_grid, n_jobs= -1, cv=5), train_labels)


We can also plot the loss curve (the below image is from here, where there is an excellent explanation). The below shows what the shape of your loss curve might mean – the loss being the Y axis and the X axis either being the epoch or the iteration (epoch being favourable).

Wikipedia states that “A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum” – so, there is an element of fiddling with the learning rate to achieve the optimum.

  • An epoch is where we feed the entire dataset forward & backward through the neural network once. We want to do this many times, because each time we process the epoch we will adjust the weights in our neural network – potentially making it more accurate with each pass. One epoch will likely lead to underfitting & too many epochs will lead to overfitting – there is a happy medium somewhere.
  • This is about iteratively sending batches of data through the model – plotting across iterations only gives you the loss for that subset of your dataset.

So, when we pot our loss curve, we can see that it’s most like the ‘good learning rate’ line from above.

plt.title("Loss", fontsize=14)

This quick intro hopefully has helped to show the sort of parameters we can control in relation to neural networks. Next up, we will look at better libraries and continue with a little theoretical discussion too.