Principal Component Analysis (PCA) is a linear transformation algorithm which sets out to simplify complex, highly dimensional data into fewer features; while retaining the trends and patterns in the data that we require for our analysis; we call this reduction in features dimensionality reduction.
We do this to improve computational efficiency and make it easier to visualize our datasets. Generally PCA captures most of the nuances in our data but there is a trade-off between having our original, granular data & the performance gains of utilizing PCA.
TL;DR – Principal Component Analysis reduces the number of features in our dataset by squashing highly correlated features together. This is less accurate than having independent features but it makes it computationally more efficient.
PCA transforms our features so that we end up with fewer dimensions. Let’s say we have height and weight; generally there is a strong relationship between these two features & we could summarize them as a third variable and remove the original height and weight columns. This would reduce our two dimensions to one.
Now, if you consider we have a dataset with 500+ features; reducing dimensionality by 50% would be a huge improvement for processing times.
PCA is an unsupervised algorithm, a bit similar to clustering (which we will discuss in an upcoming post). PCA finds patterns in the data autonomously and projects those patterns into principal components (summarized features); with the goal of summarizing our data with the fewest number of principal components (PC’s).
Implementing Principal Component Analysis in Python is super simple and hopefully this article has helped you to understand why we might choose to do it.
Choosing the number of components (n_components) is not a straightforward task and requires a bit of trial and error. You need to pick a number of components that would explain enough of the variability of the data to make accurate predictions – we can use the explained variability ratio to support our decision making.
The explained variance ratio tells you the variance attributed to each of your principal components. So, if you have 3 components, you may find the variances to be [0.7, 0.29, 0.01]. This means, the first principal component contains 70% of the variance; the second contains 29% and the final one is 1% So here, n_components = 2 would suffice to cover 99% of the variance in the dataset.