Teacher models and the demo-to-production gap

Notes on LLM RecSys Product – Edition 2 of a newsletter focused on building LLM powered products.

The central thesis of this newsletter is that we are moving from deterministic workflows to **model-centered products **– products built around LLM-powered recommender systems as the core primitive.

We’re not fully in that future yet. Most AI products today are still tools that tackle narrow use cases better – generate text, summarize content, answer questions. They are improvements, but they are not yet systems that help people get things done.

Still, the direction of change is clear. As model capabilities continue to improve, user expectations will shift across domains – writing, pr…

Notes on LLM RecSys Product – Edition 2 of a newsletter focused on building LLM powered products.

Still, the direction of change is clear. As model capabilities continue to improve, user expectations will shift across domains – writing, presentations, job search, customer support -from “give me the tool” to “help me make progress.”

That is the real promise of agents and applied AI products. Not magical autonomy, but systems that understand context, infer intent, and assist meaningfully. In other words, they will need a recommender system at the core.

But not the recommender systems we’re used to. Especially because the open question this creates is how to build predictable, trustworthy systems when the output is inherently probabilistic.

Why traditional recommender systems break

For years, recommender systems were constrained in two fundamental ways.

First, they relied on structured data and fixed taxonomies. They required predefined taxonomies and candidate sets, even though real product data – messages, resumes, notes, goals, intent – is overwhelmingly unstructured.

Second, they were black boxes steered by blunt reward functions. Behavior was optimized indirectly through metrics like clicks, dwell time, or engagement. Improving the system meant tweaking the reward, not improving understanding or reasoning.

That architecture was sufficient for optimizing engagement. It is poorly suited for helping users achieve goals.

LLM-powered recommender systems are semantic and teacher-supervised

The breakthrough in LLM-powered recommender systems isn’t simply replacing parts of the old stack with LLMs. It’s the combination of semantic understanding with teacher supervision.

LLMs can ingest and produce semantic input and output, reason over unstructured data, and operate across large context windows. That dramatically expands what recommender systems can do. While this dramatically improves capability, it still leaves blind spots: when does the system work, when does it fail, and why?

The breakthrough is our ability to pair production models with a high-quality “teacher model” – a large model used to evaluate outputs at scale. This teacher is too expensive to run in the critical path, but ideal for judging behavior, surfacing errors, and identifying quality gaps.

Article content

This structure creates two powerful improvement levers:

Improve the teacher → the system’s judgment gets sharper
Improve the training data → production models inherit this better judgment.

The introduction of the teacher makes the system self-aware and coachable. Instead of guessing whether the system is improving, you can now see it. In sum, LLMs with semantic understanding give your stack *capability, *but the teacher model enables governance.

The demo–to–production gap: cost makes your system imperfect but self-aware and coachable

This is also the point where demos and real products diverge.

In a demo, you can run a large, frontier model on every request and get something that looks magical. At scale, that approach collapses quickly under latency and GPU cost. The moment you ship to production, you’re forced to use smaller, cheaper models to keep the system fast and scalable.

And that’s when imperfection shows up. As costs come down, different parts of the stack start making mistakes that weren’t visible in that demo. Your production grade AI-powered system is noticeably imperfect.

That’s the bad news.

From stumbling in the dark to painful self-awareness

The good news is that teacher supervision turns imperfection into something you can work with.

Before teacher-supervised systems, building AI products felt like stumbling in the dark. Teams relied on coarse engagement metrics and small-scale human review to infer how models behaved.

With a teacher model, teams live in painful self-awareness.

You can now evaluate millions of outputs every day. You will know what’s wrong at scale, where it’s wrong, and why. The reason this self-awareness is painful is because you won’t be able to fix everything immediately – cost, latency, and model size constraints are very real – but you can work deliberately, one acute failure mode at a time.

That is the shift: from chasing metrics to diagnosing behavior, from intuition to evaluation, from shiny demos to systems that can actually improve.

Evaluation is the mechanism that makes outcome-oriented, probabilistic systems tractable under real-world constraints – and the foundation for how AI-native teams will operate going forward.

All of this brings us to two counter-intuitive takeaways –

(1) Existing recommender systems are deeply limited in many ways. But the magic moment doesn’t arrive when you replace them with LLM-powered recommender systems. As soon as you confront scaling costs, you’re forced to use cheaper, imperfect models. This is where reality cuts differently from the shiny demo.

(2) Semantic models unlock capability, but teacher supervision unlocks governance. LLMs make systems powerful, evaluation makes them understandable and improvable. And while the system might be imperfect, a teacher model enables painful self-awareness and **coachability **– the kind that reveals problems at scale and lets teams improve deliberately, one acute failure mode at a time.

That sets us up for the next edition – building painful self-awareness and coachability via the evaluation loop.

Why traditional recommender systems break

LLM-powered recommender systems are semantic and teacher-supervised

The demo–to–production gap: cost makes your system imperfect but self-aware and coachable

From stumbling in the dark to painful self-awareness

Similar Posts