Outliers can be a real pain & can skew our datasets. In the below, I’ve added an obvious outlier on row 3 (company name = Talane). If we were to take an average of InvoiceValue, this outlier would skew the result massively.
Extract values within x standard deviations from the mean.
Statistically, 99.7% of normally distributed data falls within 3 standard deviations from the mean, so it is not unreasonable to take values outside of 3 standard deviations as outliers. In the below, we are:
- u = the mean of all invoice values. We then calculate the standard deviation of the field and multiply it by 3. When we add that to the mean we calculated, we now have 3 standard deviations above the mean calculated.
- l = the same as the above, but we’re calculating 3 standard deviations below (rather than above) the mean.
- We then create a new dataframe called data which is only the records from df that are above l and below u. That gives us only records within 3 standard deviations of the mean.
Skim the top and bottom 5% off the dataset
Another method we could use is to drop the top and bottom 5% of data, which will reduce skew but does also risk removing legitimate datapoints.
In the below, we create u, which is the 95th percentile and l which is the 5th percentile. We then create data, which is everything between those two values.
Set an upper or lower limit on your data
The final method we could use is to set a limit. Where we might set our u and l as we did above, to the 95th and 5th percentile. We then say ‘if the value falls below l, then set the value to be equal to l’ and ‘if the value exceeds u, then set the value to be equal to u’
Whichever method you use, you always run the risk of eliminating valid data points but sometimes this is a necessary evil.