Have you ever wondered why the machine learning model you trained for your data is giving excellent results during testing, but when applied in real life, provides poor results?
This is a common problem in the field of machine learning and data science, known as data leakage.
In July 2021, MIT Technology Review published an article titled "Hundreds of AI Tools Have Been Built to Catch Covid. None of Them Helped." The article highlights several examples where machine learning models that performed well during evaluation failed to be useful in actual production settings.
In one instance, scientists trained their model on a mix of scans taken when patients were lying down and standing up. Since patients scanned while lying down were more likely to be seriously ill, the model learned to predict serious COVID-19 risk from a person's position.
In other cases, models were picking up on the font used by certain hospitals to label the scans, resulting in fonts from hospitals with more severe caseloads becoming predictors of COVID-19 risk.
Both of these are examples of data leakage.
Data leakage happens when a kind of label "leaks" into the set of features used for making predictions, but this information is not available during inference. Data leakage is a challenging problem because it is often non-obvious and can cause models to fail spectacularly, even after extensive evaluation and testing.
Understanding Data Leakage in Machine Learning
Data leakage can seriously compromise the effectiveness of machine learning models. Here, we shed light on various causes for this pervasive issue and provide guidelines on how to sidestep them.
1. Time-Correlated Data and Leakage
Machine learning research often promotes randomly splitting data into training, validation, and test sets. However, this approach can be flawed when data has time dependencies.
Stock Prices: Stock prices of similar stocks move in tandem. Training a model with data that includes future stock prices can inadvertently reveal market conditions on that future day.
Music Recommendations: Listening patterns aren't solely based on musical preference but may also be influenced by daily events, like an artist's death. A model trained with data from a day with a significant event can gain undue advantage from that day's listening patterns.
Solution: Prioritize time-based data splitting. For instance, with five weeks of data, use the first four for training and allocate the fifth week for validation and testing.
2. Proper Scaling to Avoid Leakage
While scaling is crucial, executing it prematurely can lead to leakage.
Pitfall: Generating global statistics (mean, variance) for scaling using the entire dataset prior to splitting.
Solution: First, split the data, and then use the statistics from the training set to scale all data subsets. It's even advisable to split before any exploratory data analysis to prevent accidental insights about the test data.
3. Handling Missing Data
Filling missing data values is common, but using inappropriate data sources for this task can introduce leakage.
Pitfall: Filling gaps using the mean or median calculated from the entire dataset.
Solution: Only use statistics from the training data to address missing values in all subsets.
4. The Threat of Data Duplication
Duplication can introduce significant biases if not managed correctly.
Popular datasets like CIFAR-10 and CIFAR-100 had test data that duplicated training data.
In a 2021 study, duplicated datasets were unintentionally merged to detect COVID-19.
Solution: Vigilantly check for duplicates pre and post-splitting. If you oversample, do it post-split.
5. Addressing Group Leakage
Instances, where tightly correlated labels get divided among different splits, can cause group leakage.
Example: Multiple CT scans from a patient might show similar results, but if they’re placed in different subsets, the model could inadvertently learn patient-specific details. Solution: Awareness of data generation can help detect such issues. Ensure you have an understanding of how data is collected and divided.
6. Potential Leakage from Data Collection
At times, the data generation process itself can introduce leakage.
Example: Different CT scan machines might have different output qualities, making it easier for a model to distinguish between them, leading to biases.
Solution: Understand data sources and collection processes. Normalizing data across sources ensures uniformity. Collaborate with subject matter experts to glean insights into data collection, providing a holistic understanding of potential pitfalls.
Strategies to Detect Data Leakage in ML Projects
Data leakage can compromise the authenticity of a machine learning model's performance. Detecting data leakage is crucial to ensure genuine and effective results. Here's a comprehensive guide on how to stay vigilant throughout your project's lifecycle.
1. Assessing Feature Predictive Power
Why: Some features might have an unusually strong correlation with the target variable, hinting at potential data leakage.
Analyze the correlation between each feature and the target variable.
If a feature or combination of features exhibits an unusually high predictive power, delve deeper into its origins.
Take note of combined features. Individually, they might not reveal much, but together, they can be revealing. For instance, an employee's start and end dates might, in combination, indicate their tenure.
2. Conducting Ablation Studies
Why: Determining the significance of each feature helps in identifying suspiciously influential ones, which might be potential sources of leakage.
Measure the model's performance with and without specific features.
If excluding a feature leads to a dramatic drop in performance, investigate its importance further.
While it might be impractical to study every feature in vast datasets, focusing on suspicious or new ones can be illuminating.
Utilize downtime effectively by running these studies offline.
3. Monitoring New Feature Additions
Why: A sudden and significant performance boost after adding a new feature might be a red flag.
Observe the impact on model performance after introducing a new feature.
A substantial improvement can hint at two scenarios: the feature is genuinely beneficial, or it has inadvertently introduced leakage.
4. Handling the Test Split with Care
Why: Inadvertently using the test set for anything other than the final performance evaluation can introduce leakage.
Ensure the test set remains untouched throughout the model's development phase.
Avoid using the test split for brainstorming features or tuning hyperparameters. Its sole purpose should be the model's final assessment.
Data leakage can be a subtle foe. But with awareness, understanding, and a structured approach, it is possible to sidestep this pitfall and ensure that your machine learning models are robust, genuine, and effective in real-world applications.
AI-Generated Picture Prompt: Cartoon illustration of a detective in a trench coat, magnifying glass in hand, inspecting a complex system of black pipes and gears. Data in the form of water is leaking from one black pipe and being directed into another parallel black pipe. The detective is closely observing this leakage, emphasizing the concept of data leakage in machine learning.
Reference: Huyen, C. (2022). Designing Machine Learning Systems: An Iterative Process for Production-ready Applications. Japan: O'Reilly Media, Incorporated.