ML Scenario Part 3: Reducing dimensionality to improve accuracy

Reducing dimensionality can sometimes help to drastically improve your model accuracy. In the case of our sample dataset, I we don’t really expect it to – as there are not many features in the first place & there aren’t any that are particularly obsolete.

In the below, you can see that none of our models attribute less than 6% to the model predictions.

This image has an empty alt attribute; its file name is image-29.png

Nevertheless, let’s run through the changes we could make, to give us some exposure to the way we would reduce dimensionality.

We can approach this in a couple of ways. We could reduce the number of dimensions by manually removing some columns from ‘data’ or we could use Principal Component Analysis – as discussed here. In short, Principal Component Analysis reduces the number of features in our dataset by squashing highly correlated features together. This is less accurate than having independent features but it makes it computationally more efficient.

In the below, I’ve created X_pc, which is the output of our PCA. Here, we define n_components, which is how many features we want to produce as the output to our principal component analysis.

This image has an empty alt attribute; its file name is image-26.png

Next, we run our grid search again, but this time using our new principal components rather than the original x.

This results in the below optimal parameters. You can see from the feature importance, that we do indeed now have just 2 features in this model.

How do we choose the n_components in pca?

In PCA, we try to capture 99% of the variance in our data with the fewest features. The explained variance ratio is the ratio between the variance of that principal component and the total variance of the dataset.

[explained variance ratio is the] Percentage of variance explained by each of the selected components. If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

SciKit Learn

If we take a cumulative sum of the explained variances we would get a table like the below where the second row is the running total of the explained variances. Of course, by the time we reach 11 features, we have explained 100% of the variance in the data.

If we were looking to capture 99% of variance as part of our PCA, we would take n_components of 10.

This image has an empty alt attribute; its file name is image-30.png