Understanding ML Data Leakage: A self-fulfilling Prophecy

DERBEL MohamedAziz 21 Feb, 2023 • 8 min read

Introduction

Testing your machine learning model on an unseen dataset is a mandatory step to evaluate the model performance and gain insights into the overall behavior after the pre-training stage. At this stage, two major outcomes are possible: Either you get imperfect results that lead you to further investigations, or you get remarkable answers with extremely optimistic evaluation metrics which state that your model is ready to be deployed. Even in this case, the possibility of the model failure in the production stage is still present. Data leakage, for instance, might be the main reason for that. Yes, it’s possible, and it has great odds of occurring, especially if you are about to start your journey toward data science.

In this article, I will walk you through data leakage definition. Then, common sources and different formats of data leakage will be covered. To finish with some techniques to detect it and a practical example. As a machine learning practitioner, I recommend you follow this article since it deals with one of the most overlooked machine learning issues, which greatly impacts the outcome.

Learning Objective

Understanding the basics of data leakage
Going through the use cases of data leakage
Solving a real-life based problem to understand the concept.

This article was published as a part of the Data Science Blogathon.

What is Data Leakage?

During the development process of a machine learning model, it is common to make unintentional technical mistakes. Still, fortunately, you are likely to be able to detect them as soon as you encounter unexpected behavior. This is because the consequences of most of those mistakes are directly reflected in the model performance during the development stage.

Yet, data leakage is sneakier because it does not show its effect unless you deploy the model to the public, where the model would face unseen real-life scenarios. Moreover, it gives the modeler the illusion that he got to the optimal state he had been searching for through extremely high evaluation metrics in both datasets. But once in production, not only does the model perform worse than it did in your test runs, but also you can spend hours inspecting and tuning some fancy algorithm that leverages all of the state-of-the-art machine learning tools. As an ML modeler, you still get paradoxical outcomes between the development and production stages.

If I were to give a definition for data leakage, I would say it is the hidden power that alters your excellent model in the production stage. But of course, we need a more formal definition made up of scientific shreds of evidence. Out there on the internet, you will surely encounter dozens of definitions that state and use complex terms such as features legitimacy, which makes it more confusing. Therefore, at this point, I will formally define data leakage as the introduction of extra information in the training dataset about the very thing we trying to predict, and this extra information will not be available or observable in a real-life scenario.

Sources of Data Leakage

The introduction of this illegitimate information is unintentional and facilitated by the data collection, aggregation and preparation process. It is usually subtle and indirect, making it very hard to detect and eliminate. Consequently, during the training process, the model will catch this correlation, or the strong relationship, between the extra information and the target value to learn how to make predictions and it would fail once it is released into the wild since this extra information is not available.

Some statistical transformations, such as imputation and data scaling, might be applied during data aggregation and preparation. These transformations make use of statistical data distribution. Therefore, applying those amendments to the whole dataset would not have had the same results if we had split our training and test sets right before executing them because, in this case, the test part of the data would have influenced the data distribution of the training set. As an analogy, think of it as a data series that contains 100 values of a certain characteristic. The statistical attributes like the average and the standard deviation would not have been the same if we had broken the series into 2 equal groups of 50 values.

For time series forecasting tasks, another example would be, applying k-fold cross-validation to datasets. The shuffling process might introduce past data instances in the validation set and future instances in the training sets. In this case, I let your pure imagination describe the outcome the ML modeler would get at the final stage.

Real-world Analogy for Data Leakage

My analogy for data leakage in real life would be two students who have been preparing for an international certification exam through local exams. The first student, student A, in addition to the course material, managed to get some insights into the local exam’s content so he devoted the majority of the time to preparing specific notions that were covered in those exams.

On the other hand, the second student, student B, had prepared for local exams using only his course materials when the local exams final results were announced, student A passed the local exams with perfect marks making him a strong candidate to pass the final certification exam. In contrast, student B’s marks were just fine and allowed him to get to the final certification exam but with fewer odds than student A.

Now for the final certification exam and from a probability point of view, for 100 attempts, I would say student B has a higher chance to pass the exam than student A even though the latest had higher results in the local exams. It is simply because student B had gained the necessary generalization power to perform well on unseen exercises.

The same thing applies to machine learning models in the production environment; data leakage-free models would perform better than the ones with higher test outcomes and somehow have been affected by data leakage.

Data Leakage Formats

Data leakage can take different formats in the training datasets, mainly from the data collection and aggregation phase. But for the sake of simplicity, in this section, I will talk about the most prevailing type of data leakage, the NTMC violation.

NTMC stands for no time machine condition requirement. This requirement dictates a simple rule that must be taken into consideration when building predictive models. The general rule might be: if X happened after Y, you shouldn’t build a model that uses X to predict Y. Meaning, if one feature from the features vector X has been added after the target value Y, a data leakage is present on the dataset. A quick example would be trying to predict whether a football match starts at 8:00 pm as a function of whether the referee had made the final whistle after 9:30 pm.

But, an NTMC breach is harder to detect in real-life datasets since it could contain hundreds of features that may or may not overlap with the target value.

How Can Data Leakage be Spotted?

At this point, your perspective on training sets could be changed if you have never heard about data leakage and its consequences before reading this article. Now you would give more attention to the data understanding process in an attempt to locate different spots where data leakage might happens. Intuitively, data understanding is the key element if you were to investigate any possible data leakage. The data understanding process involves various tasks, from understanding the process of collection and aggregating it to acknowledging its features which require business expertise.

Even though you could not figure it out from the beginning, from the early stages of the exploratory data analysis process, here are strong indicators that may lead you to reveal the presence of data leakage in your training sets. First, the model performance in the training set and the test set is too good to be true. Too good in a rarely miss-the-target value kind of way. Here you must get suspicious because even the most powerful ML engines make mistakes sometimes, especially if it is the first deployment.

Second, we might say that as a modeler, you do not want to get to this point of advancement; meanwhile, you have a data leakage left out in the data and then you find yourself obligated to take a step back re-investigate the data. Features/target correlation, in this case, is an efficient solution to detect any suspicious overlapping between features and the target value. Extremely high correlation is where you might start your journey of data leakage discovery.

Practical Case

In this section, I set up a concrete example of one data leakage format I previously mentioned, which is the NTMC violation. Let’s say we want to build a predictive model that will determine whether a person has been affected by a disease X according to a set of features like several viral and blood tests. Let’s suppose also that there is only one way for an affected person to recover from the disease X, which is getting an anti-X vaccine. After aggregating the collected data from different sources, we ended up with a dataset of thousands of records of randomly diagnosed patients.

For simplification purposes, I sampled just 5 instances from the training sets, and there are organized as the following:

Hemoglobin rate (g/dL)	Hematocrit (%)	Adenovirus	Monocytes (/microliter)	Anti-X vaccine	got disease X
13.8	41,50	Yes	250	No	No
14.02	45.77	No	300	Yes	Yes
16.85	50.39	No	562	Yes	Yes
16.01	42.22	Yes	412	Yes	Yes
13.44	44.12	Yes	498	No	No

From the first glance, we can detect that there is NTMC violation since the column ‘Anti-X vaccine’ is updated after recording the target value got disease X. Owning to the fact that the’Anti-X vaccine’ column belongs to the features set, it will be considered as an information leakage from the target value.

Therefore, the model will be able to catch the high correlation between those two variables during the training phase. And it will build its generalization capabilities upon this association that will not be visible in real-life scenarios. Consequently, the evaluation metrics during the training and testing phases will be utterly optimistic. However, in the production phase, the model will fail its mission.

Conclusion

To conclude, data leakage is an undesirable state in which the machine learning model will be tricked into learning associations that will not be available in real-life scenarios. Good data understanding is an effective way to detect and avoid this problem before uploading the model to the production stage. Besides, techniques such as removing any target-derived features, splitting the data based on time, and only using data from a specific time period for training and testing could hugely affect building leakage free-models.

Data leakage is sneakier because it does not show its effect unless you deploy the model to the public, where the model would face unseen real-life scenarios.
The most prevailing type of data leakage is the NTMC violation.
An NTMC breach is harder to detect in real-life datasets since it could contain hundreds of features that may or may not overlap with the target value.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

DERBEL MohamedAziz 21 Feb 2023

Algorithm Beginner Data Science Machine Learning

Understanding ML Data Leakage: A self-fulfilling Prophecy

Introduction

Table of Contents