Deep learning for beginners (Part 6): more terminology to optimise our Keras model

This is the sixth part in the Deep Learning for Beginners series. The other parts of the series are:

  1. Deep learning for beginners (Part 1): neurons & activation functions
  2. Deep learning for beginners (Part 2): some key terminology and concepts of neural networks
  3. Deep learning for beginners (Part 3): implementing our first Multi Layer Perceptron (MLP) model
  4. Deep learning for beginners (Part 4): inspecting our Multi Layer Perceptron (MLP) model
  5. Deep learning for beginners (Part 5): our first foray into Keras
  6. Deep learning for beginners (Part 6): more terminology to optimise our Keras model
  7. Deep learning for beginners (Part 7): neural network design (layers & neurons)
  8. Deep learning for beginners (Part 8): Improving our tuning script & using the Keras tuner

What is a Tensor?

A tensor is a mutli dimensional array.

What is Tensorflow?

Tensorflow is a library used for deep learning, developed by Google

How does Tensorflow relate to Keras?

Keras is a high level interface that has Tensorflow at the backend. Tensorflow 2 has been tightly coupled with Keras

What is a dense layer in Keras?

There are a number of different types of layers that we can use in our Keras neural network. However, the one that is used most frequently is the Dense layer. Dense layers are standard deeply connected neural network layers. Each node connects to every node on the subequent layer.

What loss functions can we use?

  1. Binary Cross Entropy (binary_crossentropy): requires (y_true and y_test) – where y_true is the real value (1 or 0) and y_test is the floating point predicted probability of belonging to the class. The output of cross entropy is a calculation that summarizes the average difference between our true values and our predicted probability. A cross entropy of zero indicates no error.
  2. Categorical Cross Entropy (categorical_crossentropy): similar to Binary Cross Entropy but for multi-class problems; where labels are provided in one-hot encoded format (many columns with a binary response).
  3. Sparse Categorical Cross Entropy (sparse_categorical_crossentropy): similar to the Categorical Cross Entropy, but where we have not used one-hot encoding for our output field. For this, we need to pass the probabilities of each class, as below – we have three possible classes (0, 1, 2) so we pass three probabities for each prediction.

    true = [0, 2]
    probs = [[0.7, 0.1, 0.3], [0.1, 0.2, 0.9]]

  4. Mean Absolute Error (MAE – mean_absolute_error): This simply takes the difference between the predicted value and the actual value for every prediction and takes an average of the result. However, to avoid values cancelling one another out (where one is negative & one positive), it takes the absolute value.
  5. Mean Absolute Percentage Error (mean_absolute_percentage_error): With MAE and RMSE we sometimes run into issues around interpreting the results. If you have a RMSE of 54, is that bad? Well, if you are trying to predict something where the value was 60, then yes, being wrong by 54 points, is quite bad; but if you were predicting something where the actual value was 500,000,000; then being wrong by 54 on average is excellent. For this reason we can use the percentage error; which gives us an output relative to the prediction, rather than an unscaled number. 

What optimisers can we use?

Optimization is about minimizing loss as much as possible. The most common approach we have is Gradient Descent. Effectively, this randomly initializes values (weights/ bias) in the model. It then updates these values iteratively to achieve the minimum loss possible – optimising the algorithm.

When the model updates the values, it does so using the learning rate. The learning rate determines how much the model will update by & hence controls the step size towards the minimum loss. If your learning rate is too high; you might miss the lowest possible value. If it’s too low, it might take a very long time to optimise – so it’s a trade-off.

Stochastic Gradient Descent (SGD) randomly selects datapoints from the dataset rather than optimizing error for the whole dataset which reduces computational overheads significantly. Stochastic Gradient Descent maintains a constant learning rate throughout the entire optimization process.

Adagrad is an enhanced stochastic gradient descent without the need to manually tune the learning rate. It provides a per-parameter / dimension / feature learning rate, rather than a single rate for the whole model.

Adam is another enhancement on Stochastic Gradient Descent.

What are epochs and batches?

The number of epochs is the number of times the entire training dataset will pass through the model & back (forward & back propagation). Within each epoch we have a number of batches – the model weights etc.. are updated each time a batch is processed.