Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline

, cleaned the data, made a few transformations, modeled it, and then deployed your model to be used by the client.

That’s a lot of work for a data scientist. But the job is not completed once the model hits the real world.

Everything looks perfect on your dashboard. But under the hood, something’s wrong. Most models don’t fail loudly. They don’t “crash” like a buggy app. Instead, they just… drift.

Remember, you still need to monitor it to ensure the results are accurate.

One of the simplest ways to do that is by checking if the data is drifting.

In other words, you will measure if the distribution of the new data hitting your model is similar to the distribution of the data used to train it.

Why Models Don’t Scream

When you deploy a model, you’re betting that…

, cleaned the data, made a few transformations, modeled it, and then deployed your model to be used by the client.

That’s a lot of work for a data scientist. But the job is not completed once the model hits the real world.

Everything looks perfect on your dashboard. But under the hood, something’s wrong. Most models don’t fail loudly. They don’t “crash” like a buggy app. Instead, they just… drift.

Remember, you still need to monitor it to ensure the results are accurate.

One of the simplest ways to do that is by checking if the data is drifting.

In other words, you will measure if the distribution of the new data hitting your model is similar to the distribution of the data used to train it.

Why Models Don’t Scream

When you deploy a model, you’re betting that the future looks like the past. You expect that the new data will have similar patterns when compared to the data used to train it.

Let’s think about that for a minute: if I trained my model to recognize apples and oranges, what would happen if suddenly all my model receives are pineapples?

Yes, the real-world data is messy. User behavior changes. Economic shifts happen. Even a small change in your data pipeline can mess things up.

If you wait for metrics like accuracy or RMSE to drop, you’re already behind. Why? Because labels often take weeks or months to arrive. You need a way to catch trouble before the damage is done.

PSI: The Data Smoke Detector

The Population Stability Index (PSI) is a classic tool. It was born in the credit risk world to monitor loan models.

Population stability index (PSI) is a statistical measure with a basis in information theory that quantifies the difference between one probability distribution from a reference probability distribution.

[1]

It doesn’t care about your model’s accuracy. It only cares about one thing: Is the data coming in today different from the data used during training?

This metric is a way to quantify how much “mass” moved between buckets. If your training data had 10% of users in a certain age group, but production has 30%, PSI will flag it.

Interpret it: What the Numbers are Telling You

We usually follow these rule-of-thumb thresholds:

PSI < 0.10: Everything is fine. Your data is stable.
0.10 ≤ PSI < 0.25: Something’s changing. You should probably investigate.
PSI ≥ 0.25: Major shift. Your model might be making bad guesses.

Code

The Python script in this exercise will perform the following steps.

Break the data into “buckets” (quantiles).
It calculates the percentage of data in each bucket for both your training set and your production set.
The formula then compares these percentages. If they’re nearly identical, the PSI stays near zero. The more they diverge, the higher the score climbs.

Here is the code for the PSI calculation function.

def psi(ref, new, bins=10):

# Data to array
ref, new = np.array(ref), np.array(new)

# Generate 10 equal buckets between 0% and 100%
quantiles = np.linspace(0, 1, bins + 1)
breakpoints = np.quantile(ref, quantiles)

# Counting the number of samples in each bucket
ref_counts = np.histogram(ref, breakpoints)[0]
new_counts = np.histogram(new, breakpoints)[0]

# Calculating the percentage
ref_pct = ref_counts / len(ref)
new_pct = new_counts / len(new)

# If any bucket is zero, add a very small number
# to prevent division by zero
ref_pct = np.where(ref_pct == 0, 1e-6, ref_pct)
new_pct = np.where(new_pct == 0, 1e-6, new_pct)

# Calculate PSI and return
return np.sum((ref_pct - new_pct) * np.log(ref_pct / new_pct))

It’s fast, cheap, and doesn’t require “true” labels to work, meaning that you don’t have to wait a few weeks to have enough predictions to calculate metrics such as RMSE. That’s why it’s a production favorite.

PSI checks if your model’s current data has changed too much compared to the data used to build it. Comparing today’s data to a baseline, it helps ensure your model remains stable and reliable.

Where PSI Shines

PSI is great because it’s easy to automate
You can run it daily on every feature.

Where It Doesn’t

It can be sensitive to how you choose your buckets.
It doesn’t tell you why the data changed, only that it did.
It looks at features one by one.
It might miss subtle interactions between multiple variables.

How Pro Teams Use It

Mature teams don’t just look at a single PSI value. They track the trend over time.

A single spike might be a glitch. A steady upward crawl is a sign that it’s time to retrain your model. Pair PSI with other metrics like a good old summary stats (mean, variance) for a full picture.

Let’s quickly look at this toy example of data that drifted. First, we generate some random data.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# 1. Generate Reference Data
# np.random.seed(42)
X,y = make_regression(n_samples=1000, n_features=3, noise=5, random_state=42)
df = pd.DataFrame(X, columns= ['var1', 'var2', 'var3'])
df['y'] = y

# Separate X and y
X_ref, y_ref = df.drop('y', axis=1), df.y

# View data head
df.head()

Reference data generated for a regression model. Image by the author.

Then, we train the model.

# 2. Train Regression Model
model = LinearRegression().fit(X_ref, y_ref)

Now, let’s generate some drifted data.

# Generate the Drift Data
X,y = make_regression(n_samples=500, n_features=3, noise=5, random_state=42)
df2 = pd.DataFrame(X, columns= ['var1', 'var2', 'var3'])
df2['y'] = y

# Add the drift
df2['var1'] = 5 + 1.5 * X_ref.var1 + np.random.normal(0, 5, 1000)

# Separate X and y
X_new, y_new = df2.drop('y', axis=1), df2.y

# View
df2.head()

Next, we can use our function to calculate the PSI. You should notice the huge variance in PSI for variable 1.

# 4. Calculate PSI for the drifted feature
for v in df.columns[:-1]:
psi_value= psi(X_ref[v], X_new[v])
print(f"PSI Score for Feature {v}: {psi_value:.4f}")

PSI Score for Feature var1: 2.3016
PSI Score for Feature var2: 0.0546
PSI Score for Feature var3: 0.1078

And, finally, let us check the impact it has on the estimated y.

# 5. Generate Estimates to see the impact
preds_ref = model.predict(X_ref[:5])
preds_drift = model.predict(X_new[:5])

print("\nSample Predictions (Reference vs Drifted):")
print(f"Ref Preds: {preds_ref.round(2)}")
print(f"Drift Preds: {preds_drift.round(2)}")

Sample Predictions (Reference vs Drifted):
Ref Preds: [-104.22  -57.58  -32.69  -18.24   24.13]
Drift Preds: [ 508.33  621.61 -241.88   13.19  433.27]

We can also visualize the differences by variable. We create a simple function to plot the histograms overlaid.

def drift_plot(ref, new):
fig = plt.hist(ref)
fig = plt.hist(new, color='r', alpha=.5);

return plt.show(fig)

# Calculate PSI for the drifted feature
for v in df.columns[:-1]:
psi_value= psi(X_ref[v], X_new[v])
print(f"PSI Score for Feature {v}: {psi_value:.4f}")
drift_plot(X_ref[v], X_new[v])

Here are the results.

Data drift for the 3 variables. Image by the author.

The difference is huge for variable 1!

Before You Go

We saw how simple it is to calculate PSI, and how it can show us where the drift is happening. We quickly identified var1 as our problematic variable. Monitoring your model without monitoring your data is a huge blind spot.

We have to make sure that the same data distribution identified when the model was trained is still valid, so the model can keep using the pattern from the reference data to estimate over new data.

Production ML is less about building the “perfect” model and more about maintaining alignment with reality.

The best models don’t just predict well. They know when the world has changed.

If you liked this content, find me on my website. https://gustavorsantos.me