Despite tabular data being the bread and butter of industry data science, data shifts are often overlooked when analyzing model performance.
We’ve all been there: You develop a machine learning model, achieve great results on your validation set, and then deploy it (or test it) on a new, real-world dataset. Suddenly, performance drops.
So, what is the problem?
Usually, we point the finger at Covariance Shift. The distribution of features in the new data is different from the training data. We use this as a “Get Out of Jail Free” card: “The data changed, so naturally, the performance is lower. It’s the data’s fault, not the model’s.”
But what if we stopped using covariance shift as an excuse and started using it as a tool?
I believe there is a better way to handle this …
Despite tabular data being the bread and butter of industry data science, data shifts are often overlooked when analyzing model performance.
We’ve all been there: You develop a machine learning model, achieve great results on your validation set, and then deploy it (or test it) on a new, real-world dataset. Suddenly, performance drops.
So, what is the problem?
Usually, we point the finger at Covariance Shift. The distribution of features in the new data is different from the training data. We use this as a “Get Out of Jail Free” card: “The data changed, so naturally, the performance is lower. It’s the data’s fault, not the model’s.”
But what if we stopped using covariance shift as an excuse and started using it as a tool?
I believe there is a better way to handle this and to create a “gold standard” for analyzing model performance. That method will allows us to estimate performance accurately, even when the ground shifts beneath our feet.
The Problem: Comparing Apples to Oranges
Let’s look at a simple example from the medical world.
Imagine we trained a model on patients aged 40-89. However, in our new target test data, the age range is stricter: 50-80.
If we simply run the model on the test data and compare it to our original validation scores, we are misleading ourselves. To compare “apples to apples,” a good data scientist would go back to the validation set, filter for patients aged 50-80, and recalculate the baseline performance.
But let’s make it harder
Suppose our test dataset contains millions of records aged 50-80, and one single patient aged 40.
- Do we compare our results to the validation 40-80 range?
- Do we compare to the 50-80 range?
If we ignore the specific age distribution (which most standard analyses do), that single 40-year-old patient theoretically shifts the definition of the cohort. In practice, we might just delete that outlier. But what if there were 100 or 1,000 patients aged below 50? Can we do better? Can we automate this process to handle differences in multiple variables simultaneously without manually filtering data? Furthermore, filtering data is not a good solution. It only accounts for the right range but ignores the distribution shift within that range.
The Solution: Inverse Probability Weighting
The solution is to mathematically re-weight our validation data to look like the test data. Instead of binary inclusion/exclusion (keeping or dropping a row), we assign a continuous weight to each record in our validation set. It is like an extension of the above simple filtering method to match the same age range.
- Weight = 1: Standard analysis.
- Weight = 0: Exclude the record (filtering).
- Weight is non-negative float: Down-sample or Up-sample the record’s influence.
The Intuition
In our example (Test: Age 50-80 + one 40yo), the solution is to mimic the test cohort within our validation set. We want our validation set to “pretend” it has the exact same age distribution as the test set.
Note: While it is possible to transform these weights into binary inclusion/exclusion via random sub-sampling, this generally offers no statistical advantage over using the weights directly. Sub-sampling is primarily useful for intuition or if your specific performance analysis tools cannot handle weighted data.
The Math
Let’s formalize this. We need to define two probabilities:
- Pt(x): The probability of seeing feature value x (e.g., Age) in the Target Test data.
- Pv(x): The probability of seeing feature value x in the Validation data.
The weight w for any given record with feature x is the ratio of these probabilities:
w(x) := Pt(x) / Pv(x)
This is intuitive. If 60 year olds are rare in training (Pv is low) but common in production (Pt is high), the ratio is large. We weight these records up in our evaluation to match reality. On the other hand, in our example where the test set is strictly aged 50-80, any validation patients outside this range will receive a weight of 0 (since Pt(Age)=0). This is effectively the same as excluding them, exactly as needed.
This is a statistical technique often called Importance Sampling or Inverse Probability Weighting (IPW).
By applying these weights when calculating metrics (like Accuracy, AUC, or RMSE) on your validation set, you create a synthetic cohort that perfectly matches the test domain. You can now compare apples to apples without complaining about the shift.
The Extension: Handling High-Dimensional Shifts
Doing this for one variable (Age) is easy. You can just use histograms/bins. But what if the data shifts across dozens of different variables simultaneously? We cannot build a dozen dimensional histogram. The solution is a clever trick using a binary classifier.
We train a new model (a “Propensity Model,” let’s call it Mp) to distinguish between the two datasets.
- Input: The features of the record (Age, BMI, Blood Pressure, etc.) or our desired variables to control for.
- Target: 0 if the record is from Validation, 1 if the record is from the Test set.
If this model can easily tell the data apart (AUC > 0.5), it means there is a covariate shift. The AUC of Mp also serves as a diagnostic tool. It interprets how different your test data from the validation set and how important was to account for it. Crucially, the probabilistic output of this model gives us exactly what we need to calculate the weights.
Using Bayes’ theorem, the weight for a sample x becomes the odds that the sample belongs to the test set:
w(x) := Mp(x) / (1 – Mp(x))
- If Mp(x) ~ 0.5, the data points are indistinguishable, and the weight is 1.
- If Mp(x) -> 1, the model is very sure this looks like Test data, and the weight increases.
Image by author (created with Mermaid).
Note: Applying these weights does not necessarily lead to drop in the expected performance. In some cases, the test distribution might shift toward subgroups where your model is actually more accurate. In that scenario, the method will scale up those instances and your estimated performance will reflect that.
Does it work?
Yes, like magic. If you take your validation set, apply these weights, and then plot the distributions of your variables, they will perfectly overlay the distributions of your target test set.
It is even more powerful than that: it aligns the joint distribution of all variables, not just their individual distribution. Your weighted validation data becomes practically indistinguishable from the target test data when the predictor is optimal.
This is a generalization of the single variable we saw earlier and yield the exact same result for a single variable. Intuitively Mp learns the differences between our test and validation datasets. We then utilize this learned ‘understanding’ to mathematically counter the difference.
You can for example look at this code snippet for generating 2 age distributions: one uniform(validation set), the other normal distribution (target test set), with our weights.
Image by author (created by the code snippet). Code Snippet
import pandas as pd
import numpy as np
import plotly.graph_objects as go
df = pd.DataFrame({"Age": np.random.randint(40,89, 10000) })
df2 = pd.DataFrame({"Age": np.random.normal(65, 10, 10000) })
df2["Age"] = df2["Age"].round().astype(int)
df2 = df2[df2["Age"].between(40,89)].reset_index(drop=True)
df3 = df.copy()
def get_fig(df:pd.DataFrame, title:str):
if 'weight' not in df.columns:
df["weight"] = 1
age_count = df.groupby("Age")["weight"].sum().reset_index().sort_values("Age")
tot = df["weight"].sum()
age_count["Percentage"] = 100 * age_count["weight"] / tot
f = go.Bar(x=age_count["Age"], y=age_count["Percentage"], name=title)
return f, age_count
f1, age_count1 = get_fig(df, "ValidationSet")
f2, age_count2 = get_fig(df2, "TargetTestSet")
age_stats = age_count1[["Age", "Percentage"]].merge(age_count2[["Age", "Percentage"]].rename(columns={"Percentage": "Percentage2"}), on=["Age"])
age_stats["weight"] = age_stats["Percentage2"] / age_stats["Percentage"]
df3 = df3.merge(age_stats[["Age", "weight"]], on=["Age"])
f3, _ = get_fig(df3, "ValidationSet-Weighted")
fig = go.Figure(layout={"title":"Age Distribution"})
fig.add_trace(f1)
fig.add_trace(f2)
fig.add_trace(f3)
fig.update_xaxes(title_text='Age') # Set the x-axis title
fig.update_yaxes(title_text='Percentage') # Set the y-axis title
fig.show()
Limitations
While this is a powerful technique, it doesn’t always work. There are three main statistical limitations:
- Hidden Confounders: If the shift is caused by a variable you didn’t measure (e.g., a genetic marker you don’t have in your tabular data), you cannot weigh for it. However, as model developers, we usually try to use the most predictive features in our model when possible.
- Ignorability (Lack of Overlap): You cannot divide by zero. If Pv(x) is zero (e.g., your training data has no patients over 90, but the test set does), the weight explodes to infinity.
- The Fix: Identify these non-overlapping groups. If your validation set literally contains zero information about a specific sub-population, you must explicitly exclude that sub-population from the comparison and flag it as “unknown territory”.
- Propensity Model Quality: Since we rely on a model (Mp) to estimate weights, any inaccuracies or poor calibration in this model will introduce noise. For low-dimensional shifts (like a single ‘Age’ variable), this is negligible, but for high-dimensional complex shifts, ensuring Mp is well-calibrated is critical.
Even though the propensity model is not perfect in practice, applying these weights significantly reduces the distribution shift. This provides a much more accurate proxy for real world performance than doing nothing at all.
A Note on Statistical Power
Be aware that using weights changes your Effective Sample Size. High variance weights reduce the stability of your estimates.
Bootstrapping: If you use bootstrapping, you are safe as long as you incorporate the weights into the resampling process itself.
Power Calculations: Do not use the raw number of rows (N). Please refer to the Effective Sample Size formula (Kish’s ESS) to understand the true power of your weighted analysis.
What about images and texts?
The propensity model method works in those domains as well. However, the main issue from a practical perspective is often ignorability. There is a complete separation between our validation and the target test set which leads to inability to counter the shift. It doesn’t mean our model will perform poorly on those datasets. It simply means we cannot estimates its performance based on your current validation which is completely different.
Summary
The best practice for evaluating model performance on tabular data is to strictly account for covariance shift. Instead of using shift as an excuse for poor performance, use Inverse Probability Weighting to estimate how your model should perform in the new environment.
This allows you to answer one of the hardest question in deployment: “Is the performance drop due to the data changing, or is the model actually broken?”
If you utilize this method, you can explain the gap between training and production metrics.
If you found this useful, let’s connect on LinkedIn