Detecting outliers in highly dimensional data is hard. There are so many observations across a large number of dimensions; so plotting it is often not possible and if you can plot it, interpreting it with the naked eye is extremely challenging.
Before we get into how we may identify the outliers using an isolation forest, first, let’s consider use-cases for doing so. In other words, why would we want to identify the outliers in a dataset?
Well, consider the case of Mastercard or Visa – they have billions of transactions running through their infrastructure every day and being able to identify the very small percentage of those that are fraudulent quickly, can save a significant amount of money.
Here then is the use-case for an isolation forest. It’s an unsupervised learning algorithm which seeks to find those transactions that are fraudulent, by identifying them as outliers.
It does this by partitioning (or splitting) your data randomly within the minimum and maximum value bounds of your features. The idea is to isolate the anomaly. Let’s look at an example.
In the below (left), we have quite an obvious outlier. Let’s see how we might isolate it. Firsr, we will make a random split of the data (centre) and you can see that it hasn’t isolated the datapoint. So we try again and make another split (right) – this time, we have isolated our anomaly.
The idea of the algorithm is to isolate the anomalies by creating decision trees over the randomly selected attributes. The anomaly score is based on the depth of the path, because, fewer instances (anomalies) result in smaller partitions. Imagine a decision tree on age: Age > 200 — YES/NO. Here, the YES would be very sparsely populated and will be comprised of outliers, because nobody lives for more than 200 years.
The idea is that the really distinguishable values in a dataset are going to split into their own partitions early on in the partitioning process. If we had a gender field (Male | Female | Boat), the likelihood is, this very different and unexpected value will appear on its own very short path within the tree.
We can demonstrate this – in the above, we have identified the anomaly in 2 splits (2 partitions). Whereas, if we decided that the below green point was an anomaly, it would take a lot more splits to isolate it – it’s an absolute minimum of 4 splits but as the splits are done at random, it’s likely to be significantly more, as shown below.
The number of splits determines the level of isolation and generates the anomaly score. We note the anomaly score for each datapoint, suggesting the likelihood that a datapoint is an anomaly. The function then gathers the top n points and declares them as anomalies.
We can sample our data to train the algorithm to reduce the computational overhead. Depending on how noisy your data is, you will need a bigger/smaller sample – that will be very much dependent on your dataset.