Data leakage is where we accidentally share data between the test and training sets; so,that our predictions are more accurate than they will ever be in the real world. If you’ve run your model against your test dataset and have a 99% accuracy, you probably have data leakage. Maybe you don’t, but you probably do.
This article is a checklist of things we can do to reduce the risk of data leakage.
1. Normalizing our data at the right time
Let’s consider our normalization of data. Normalization makes the mean and standard deviation of the field ‘normal’. If you do this before splitting your data, you will be leaking information into the model, as it will be able to ‘fill in the gaps’.
Similarly, if you’re going to use cross validation to tune your hyper parameters, you should do this before normalizing your data because cross validation further divides your dataset into smaller subsets & hence, it’s just the same as above; you’re telling it the answer.
2. Removing Duplication
This is obvious, right? If you have duplicated data between your test and training datasets, you will be giving the model the correct answer – and duplication can come in two forms: exact duplication of the entire record or duplication in certain fields.
If you were an email spammer for example, the text of your record would be the same across thousands of records, but you probably put Dear XXX at the top, where the name of the person changes and the timestamp of the record would be different. In this instance, we can use a fuzzy match to determine if it is really a duplicate.
3. Splitting data at the right time
You should split your dataset before any of your preprocessing steps & run those steps independently on the two portions of the split dataset to remove the risk of leakage as much as possible.
The best thing to do is to leave your test split untouched until the very end to validate your model with a dataset which has the smallest possible risk of being leaked.
4. Choose features wisely
We have the risk of one variable in the dataset being tied to the outcome and not being available when data was collected.
For example, if we have lots of data about someone and we are trying to predict whether they are diabetic & there is a ‘prescribed_insulin’ feature. Well, if they are diabetic, they will have insulin, if they aren’t, they won’t. Hence, this feature is actually tied to the output and will give us near perfect results,
In the real world, when we are trying to predict whether someone is diabetic or not, we will not have information on what they have been prescribed. This is an example of using data from the future to try and predict the past. Hence, we should remove this feature.
Another example would be predicting annual salary given education levels, background, age etc.. If in the dataset, you have a monthly salary field, you’re essentially giving the model a relationship between monthly & yearly salary, hence it will be very accurate.
5. Conduct some serious data exploration
It’s up to us as data scientists to find the really strong links between our input features and the target. Understanding the data on a very deep level will give us a good idea of where leakage may occur.