Tuning our machine learning model is of course critical to its accuracy. There are two methods for tuning models: tuning the data and tuning the parameters.
We ‘tune’ the data to clean it up, select the right features and make sure we don’t have a class imbalance – because afterall, if you pump garbage into the model, you can’t expect anything other than garbage to come out the other side. We will discuss this in detail below.
We tune the hyper parameters to ensure your model is configured to run in the most accurate manner given your specific dataset and the goals you hope to achieve from it.
When we’re trying to tune our machine learning model, we start looking at ways to reduce the number of features in our dataset. The reason for this is the spookily named ‘Curse Of Dimensionality’. I’ve discussed it in detail here, but essentially, it’s the concept where two datapoints in 1D space can be very close together, when we move to a multi-dimensional space, those same points can split apart. So in 1D space, two points might be right next to one another, but in 3D space, they may be really very far away. This makes making relationships between features harder.
We have quite a few methods to select features, I’ll discuss a few here which are easy to implement and understand.
First, we have pearson’s correlation which simply finds the features which have the highest correlation to the target. We then retain the top N features, based on their correlation. Simple, right?
Next, we have Recursive Feature Elimination which trains a model with all the features in your dataset and extracts the coefficient or feature importances. The algorithm then removes the least important feature. It continues to repeat the same steps over and over again until it reaches the desired number of features. We can use cross validation to calculate the best number of features.
Now we have the lasso (or regularization) which adds each feature into the regression model, one at a time. If that feature does not improve the model enough, it will not be added (which is achieved by setting the coefficient to zero).
Other Dimensionality Reduction Methods
Principal Component Analysis (PCA) is a linear transformation algorithm which sets out to simplify complex, highly dimensional data into fewer features; while retaining the trends and patterns in the data that we require for our analysis; we call this reduction in features dimensionality reduction.
We do this to improve computational efficiency and make it easier to visualize our datasets. Generally PCA captures most of the nuances in our data but there is a trade-off between having our original, granular data & the performance gains of utilizing PCA.
TL;DR – Principal Component Analysis reduces the number of features in our dataset by squashing highly correlated features together. This is less accurate than having independent features but it makes it computationally more efficient.
PCA transforms our features so that we end up with fewer dimensions. Let’s say we have height and weight; generally there is a strong relationship between these two features & we could summarize them as a third variable and remove the original height and weight columns. This would reduce our two dimensions to one.
Now, if you consider we have a dataset with 500+ features; reducing dimensionality by 50% would be a huge improvement for processing times.
PCA is an unsupervised algorithm, a bit similar to clustering. PCA finds patterns in the data autonomously and projects those patterns into principal components (summarized features); with the goal of summarizing our data with the fewest number of principal components (PC’s).
Implementing Principal Component Analysis in Python is super simple and hopefully this article has helped you to understand why we might choose to do it.
Choosing the number of components (n_components) is not a straightforward task and requires a bit of trial and error. You need to pick a number of components that would explain enough of the variability of the data to make accurate predictions – we can use the explained variability ratio to support our decision making.
The explained variance ratio tells you the variance attributed to each of your principal components. So, if you have 3 components, you may find the variances to be [0.7, 0.29, 0.01]. This means, the first principal component contains 70% of the variance; the second contains 29% and the final one is 1% So here, n_components = 2 would suffice to cover 99% of the variance in the dataset.
Do you have a Class Imbalance (an imbalanced dataset)?
A class imbalance in machine learning is where our targets are well, unbalanced. If we were trying to determine a binary outcome to a classifier with 1,000 input datapoints, of which 895 were 1 and the remainder were 0, we would have an imbalance.
First off, why is this an issue? Well imagine you’re a bank and you are trying to determine whether a transaction is fraudulent. You have a sample of 1,000,000 transactions. Of those, 950 are fraudulent and the rest are not.
In machine learning, we generally find that models produce poor classifiers when faced with an imbalance dataset. In this example, even if our classifier always returned a result of ‘not fraudulent’, it would be correct more than 99% of the time. This kind of naive result is as a direct result of your imbalance dataset. Because after all, your model requires data to learn and because there is such class imbalance, there is a minority class which the model fails to recognise.
We can use our confusion matrices (discussed here) to help us understand whether we’re getting a large number of false readings.
We can work around this imbalance by making our sample intentionally bias. Instead of the last 1,000,000 transactions, we could rather, force 50% of our transactions to be fraudulent and 50% to be legitimate. This would remove the minority class issue & give us a balanced dataset. This is a bit messy.
A more elegant solution is to over or under sample our dataset. Oversampling is where we artificially inflate the number of minority classes by duplicating them – increasing their relative performance and undersampling is the process of removing some of the majority class. We can use SMOTE or ADASYN for artificially synthesising our classes.
We can also weight our classes, providing a higher weight to the minority class and a lower weight to the majority class to help us fix the skew in our dataset.
Use That Confusion Matrix!
You’ve made a model & it’s getting 56% accuracy and you want to know why? This is the optimal time to check your confusion matrix to see which problems the algorithm is getting wrong. Is it always failing on a particular class?
Tune your hyperparameters
This is a big topic. A hyper parameter is a model parameter which is set before the model begins learning and is tuneable to improve model performance. Think of a hyper parameter like your road car. When you go to the fuel station, you can choose from Unleaded or SUPER Unleaded.
This is a parameter which you can change about your car, which may lead to improved performance. You can set the parameter to be Unleaded or SUPER Unleaded before you start driving your car. In the same way, we can set our hyper parameters before our model starts learning to help it learn better.
As I say, this is a big topic because each of the different models have different hyper parameters which we can tune. I have covered Random Forest parameters in quite a lot of detail here and will be covering logistic regression parameters as part of our Diabetes case study.