What 1,200 Production Deployments Reveal About LLMOps in 2025

The LLMOps Database crossed 1,200 case studies this month. Since we last wrote one of these summaries, we’ve catalogued another 400 production deployments. These are real systems handling real traffic, built by teams navigating the gap between "it works in a notebook" and "it works at 2am when the on-call engineer is asleep."

This article distils what we’re seeing across that growing corpus. (If you prefer a succinct post with just the high-level takeaways, the executive summary highlights the core trends.) Rather than predicting where the field is heading, this article focuses on patterns emerging from where it already is. The trends that follow derive directly from w…

What follows covers this terrain in detail: the shift from demos to real engineering, the emergence of context engineering as a distinct discipline, the stabilisation of MCP as integration infrastructure, the maturation of evaluation and guardrail practices, the uncomfortable truth that software engineering skills matter more than AI expertise, and the persistent allure of frontier models that don’t actually solve production problems.

This is a practical assessment of what’s working, what isn’t, and what the teams shipping production systems have learned along the way.

1. Real Engineering Replaces POC Demos

When we first started the LLMOps Database, much of what we catalogued fell into the "interesting experiment" category: proof-of-concept deployments, weekend RAG chatbots, and systems that quietly disappeared when confronted with real traffic. That has changed. Companies have moved beyond experimenting with AI as a productivity add-on to rebuilding core business processes around LLM capabilities, and the evidence shows up in the metrics that matter: revenue impact, operational scale, and measurable outcomes.

Real Business Outcomes

The clearest signal that we’ve moved past the experimentation phase is the emergence of LLM systems handling genuinely critical business functions. These are core revenue-generating processes rather than adjacent tools.

Take Stripe’s approach to fraud detection. They’ve built a domain-specific foundation model that processes payments representing roughly 1.3% of global GDP. Unlike a support chatbot, this is infrastructure that sits in the critical path of every transaction. Their architecture treats each payment as a token and user behavior sequences as context windows, ingesting tens of billions of transactions. The practical result is that card-testing fraud detection accuracy improved from 59% to 97% for their largest merchants.

Amazon’s Rufus provides another data point. During Prime Day, the system scaled to 80,000 Trainium chips while serving conversational shopping experiences to 250 million users. The team reported 140% year-over-year monthly user growth and a 60% increase in purchase completion rates. What’s worth noting here is the architectural evolution: Amazon moved from a custom in-house LLM to a multi-model approach orchestrating Amazon Nova, Claude, and specialized models.

Similarly, DoorDash rebuilt their recommendation engine to handle their expansion beyond restaurant delivery. Scaling from 100-item menus to 100,000+ item retail catalogues creates cold-start problems. Their hybrid retrieval system, which infers grocery preferences from restaurant order history, delivered double-digit improvement in click-through rates and directly addressed the personalisation challenges that come with entering new verticals.

Processing at Scale

The systems now making it into the database are operating at volumes that would have seemed aspirational even a year ago.

ByteDance processes billions of videos daily for content moderation across TikTok and other platforms. They’ve deployed multimodal LLMs on AWS Inferentia2 chips across multiple global regions, implementing tensor parallelism, INT8 quantization, and static batching to achieve 50% cost reduction while maintaining the latency requirements of a real-time social platform.

Shopify’s product classification system handles 30 million predictions daily, sorting products into over 10,000 categories with 85% merchant acceptance rate. Their Sidekick assistant evolved from simple tool-calling into a sophisticated agentic platform, but the journey wasn’t smooth. They encountered what they call the "tool complexity problem" when scaling from 20 tools to 50+ with overlapping functionality. Their solution uses Just-in-Time instructions that provide relevant guidance exactly when needed.

In the developer tools space, Cursor’s Tab feature now handles over 400 million requests per day. Beyond the volume, their approach is instructive: they implemented an online reinforcement learning pipeline that updates based on user acceptance rates within hours, achieving a 28% increase in code acceptance. Their recent work adapting to OpenAI’s Codex models uncovered that dropping reasoning traces caused 30% performance degradation.

Quantified Revenue Impact

We’re increasingly seeing organisations move past vague "efficiency gains" to report specific financial outcomes.

nib, an Australian health insurer, has been running their Nibby chatbot since 2018, now enhanced with modern LLMs. The system has handled over 4 million interactions and generates approximately $22 million in documented savings. They achieve 60% chat deflection, and their call summarisation feature reduced after-call work by 50%. These are measured results rather than projections.

The PGA Tour’s content generation system offers a different angle. They reduced article generation costs by 95% to $0.25 per article, now producing 800 articles per week across eight content types. Their AI-generated content has become their highest-engagement material on non-tournament days, driving billions of page views annually. The multi-agent architecture, with specialised agents for research, data extraction, writing, validation, and image selection, demonstrates what production LLMOps actually looks like versus a demo.

In financial services, Riskspan transformed private credit deal analysis from a 3-4 week manual process to 3-5 days. They reduced per-deal processing costs by 90x to under $50, which matters considerably when addressing a $9 trillion market opportunity. Their system uses Claude to dynamically generate code that models investment waterfalls to produce executable financial calculations rather than just extracting information.

CBRE, the world’s largest commercial real estate firm, deployed a unified search assistant across 10 distinct data sources. They reduced SQL query generation time by 67% (from 12 seconds to 4 seconds) and improved database query performance by 80%. For property managers who previously navigated fragmented systems containing millions of documents, this represents a meaningful change in daily operations.

Autonomous Agents Doing Real Work

Perhaps the most notable shift is agents moving from "drafting assistance" to completing complex, multi-step workflows without human intervention.

Western Union and Unum partnered with AWS and Accenture/Pega to modernise mainframe systems, converting 2.5 million lines of COBOL code in approximately 1.5 hours. For Unum, this reduced a project timeline from an estimated 7 years to 3 months, while eliminating 7,000 annual manual hours in claims management. The architecture uses composable agents working through orchestration layers.

Ramp’s policy agent now handles over 65% of expense approvals autonomously. Their design emphasises explainable reasoning with citations, built-in uncertainty handling that explicitly allows the agent to defer to humans when uncertain, and user-controlled autonomy levels. Their separate merchant classification agent processes requests in under 10 seconds (versus hours manually) and handles nearly 100% of requests, up from less than 3% that human teams could previously manage.

Harman International faced a familiar enterprise challenge: documenting 30,000 custom SAP objects accumulated over 25 years with minimal documentation, essential for their S/4HANA migration. Manual documentation by 12 consultants was projected to take 15 months with inconsistent results. Using AWS Bedrock and Amazon Q Developer with Claude, they reduced the timeline from 15 months to 2 months and cut costs by over 70%.

Search and Retrieval Remains Central

Despite periodic "RAG is dead" declarations, the most successful production systems we’re tracking rely heavily on sophisticated retrieval architectures.

LinkedIn rebuilt their GenAI stack with a RAG-based pipeline at its core, supporting multi-agent orchestration. Their system routes queries to specialised agents (job assessment, company understanding, post takeaways), retrieves data from internal APIs and Bing, then generates contextual responses. One observation from their work that resonates with what we’ve seen elsewhere: reaching 80% quality happened quickly, but pushing past 95% required the majority of development time. This pattern, where the final stretch from "demo quality" to "production quality" consumes disproportionate effort, appears consistently across the database.

The organisations extracting real value aren’t necessarily the ones with the most innovative demos—they’re the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty, and treating their LLM systems with the same rigour they’d apply to any critical infrastructure.

2. Context Engineering > Prompt Engineering

If 2023 was the year of prompt engineering (learning how to talk to models), then 2024 and 2025 have marked the rise of context engineering: learning how to architect the information models consume. We’ve watched this become one of the clearest dividing lines between teams that ship reliable LLM systems and those still wrestling with inconsistent results.

The shift is reflected in how practitioners describe their work. The phrase "context engineering" has emerged as shared vocabulary for the architecture required to keep agents focused. Dropbox uses the term "context engineering" to describe the architecture required to prevent what they call "analysis paralysis" in their Dash AI assistant. Anthropic’s engineering team distinguishes it from prompt engineering, defining it as the management of everything that goes into the context window: system prompts, tool definitions, conversation history, and retrieval strategy. The underlying thesis is straightforward: just because you can fit everything into a model’s context window doesn’t mean you should.

The Problem with More Context

The naive approach to building agents is to stuff all history, tools, and documentation into the context window. We’ve catalogued dozens of cases where this fails in production.

Manus, a Singapore-based agent platform, found that their typical tasks require around 50 tool calls, with production agents spanning hundreds of conversational turns. Every tool call generates observations that append to the message history, creating unbounded growth. They reference Anthropic’s research noting that "context rot" often begins between 50k–150k tokens, regardless of a model’s theoretical million-token maximum. Even with prompt caching reducing cost and latency, performance still degrades. You’re processing the same bloated context, just faster.

Dropbox encountered what they call "analysis paralysis" when exposing too many tools to their Dash agent. The more retrieval options available in the context, the more time the model spent deciding which tool to use rather than actually acting.

These are not edge cases but result predictably from treating context as a dumping ground.

Just-in-Time Context

The most common pattern we’re seeing in production systems is what teams call "just-in-time" injection: dynamically assembling context based on the user’s immediate state rather than loading everything upfront.

Shopify’s Sidekick assistant collocates instructions with tool outputs rather than loading all instructions at the start. If a tool returns search results, the specific instructions on how to process those results appear right next to the data. This maintains cache efficiency and keeps the model focused on what’s actually happening now.

Elyos AI, which builds voice agents for home services companies, takes this further. For emergency call-outs, their first step provides context to identify whether the situation qualifies as an emergency. Once that determination is made, they remove that context entirely and replace it with a single deterministic fact: "this is an emergency." The conversation history about how they reached that conclusion is no longer needed. They describe this as "just-in-time in, just-in-time out," actively cleaning context that’s served its purpose.

Tool Masking and Schema Shrinking

When you can’t reduce the number of tools, you can at least reduce their complexity.

Databook’s "tool masking" approach places a configuration layer between agents and the underlying tool handlers. Instead of exposing a full API with 100 fields, a mask might only reveal the 3 fields relevant to a particular task. Their example: a stock quote API that normally returns dozens of metrics gets masked to return only symbol, market price, and currency. The input schema is similarly simplified: the agent only needs to provide a ticker symbol, and everything else is either hardcoded or system-provided.

This approach treats tool definitions as prompts in their own right. Databook’s head of applied AI describes it as the evolution from prompt engineering to context engineering, where context engineering includes engineering the surface of the tools themselves. The same underlying API can present different masks for different agents or different stages of a workflow.

Manus implements something similar with what they call logit masking. Rather than deleting tools from the context (which breaks caching), they mathematically prevent the model from selecting irrelevant tools during specific conversation states. The tools remain in the definition but are effectively invisible to the decision-making process.

Compaction Versus Summarisation

Managing context over long-running sessions requires distinguishing between reversible and irreversible reduction.

Manus makes a crucial distinction: compaction is reversible, summarisation is not. Compaction converts verbose tool outputs into minimal representations while keeping the full information recoverable. A file write confirmation might compact from path plus full content to just the path so the agent can read the file again if needed. Summarisation, by contrast, loses information permanently. They use it only as a last resort when compaction yields minimal gains.

Their approach is staged: trigger compaction first, typically on the oldest 50% of tool calls while keeping newer ones in full detail so the model retains fresh examples of proper tool usage. Only when multiple compaction rounds yield diminishing returns do they summarise, and even then they preserve the last few tool calls in full to maintain behavioural continuity.

LangChain’s Lance Martin adds an isolation pattern: token-heavy sub-tasks get offloaded to specialised sub-agents that process their context independently and return only a summary or result to the main agent, preventing context contamination.

The File System as Context

Some teams are pushing context engineering into territory that might seem regressive but turns out to be remarkably effective.

Manus runs agents inside full virtual machine sandboxes, and they discovered that for many use cases, you don’t need a vector database at all. The Linux file system itself becomes the context. The agent uses grep, cat, and ls to retrieve its own context on demand, effectively treating the operating system as its long-term memory. Token-heavy tool outputs get dumped to files; the context window holds only minimal references. When the model needs that information again, it reads the file.

Claude Code and similar coding assistants use this pattern: the codebase is the context, and file operations are the retrieval mechanism. The file system is already indexed, already persistent, and doesn’t require building infrastructure on the fly.

That said, this isn’t universal. For integrating enterprise knowledge bases or long-term memory across sessions, vector indexes become necessary. The scale determines the approach. But it’s worth noting how many teams have found that simpler retrieval mechanisms work better than sophisticated semantic search when the context is naturally bounded.

Dual Embeddings and Specialised Representations

When retrieval is required, we’re seeing teams move beyond single-embedding approaches.

Glowe, a skincare recommendation system built on Weaviate, creates two distinct embeddings for the same product. One embedding captures descriptive metadata (what the product is), and a second embedding captures user reviews and effects (what the product does). They use TF-IDF weighting to ensure rare but meaningful effects aren’t drowned out by generic descriptions in the context. When recommending products for specific skin concerns, they search the effect embeddings rather than the product embeddings.

This pattern of separating concerns at the embedding level allows more targeted retrieval. The model doesn’t receive everything about a product. Instead, it receives what’s relevant to the current query. It’s another form of context engineering: controlling not just what goes into the context but how that information is represented and retrieved.

Why Teams Are Investing in This

The business case for context engineering shows up in three dimensions.

Cost: Shopify noted that tool outputs consume 100x more tokens than user messages, so aggressive context pruning directly correlates to margin.

Latency: Elyos AI targets sub-400ms response times, which requires keeping context minimal.

Reliability: Leaner contexts make models smarter, not just faster and cheaper.

The Discipline Takes Shape

What’s emerging across these case studies is a recognisable engineering discipline with its own patterns, tradeoffs, and best practices.

The core principle: everything retrieved shapes the model’s reasoning, so relevance filtering is critical. The practical techniques: just-in-time injection, tool masking, staged compaction, context isolation through sub-agents, file system offloading, and specialised embeddings. The evaluation criteria: not just whether the model can process the context, but whether the context helps or hinders the model’s actual task.

Manus has refactored their context engineering architecture five times since launching in March. LangChain’s Lance Martin emphasises that production teams should "build less and understand more". Their biggest performance improvements came from simplifying architecture rather than adding complexity.

The million-token context window serves less as a feature to exploit and more as a ceiling to stay well under. The teams shipping reliable LLM systems have internalised this, and context engineering has become the discipline that makes it practical.

3. The Frontier: Where Production Meets Experimentation

While the previous sections cover patterns that have solidified into recognisable best practices, two areas remain in active flux: agent infrastructure (harnesses and the reinforcement learning loops that improve them), and memory systems for long-running agents. Both represent genuine production needs, but neither has stabilised into consensus approaches. What we’re seeing is parallel experimentation rather than industry convergence.

Agent Infrastructure: Harnesses and Learning Loops

The orchestration layer wrapping an LLM to make it function as an agent requires surprisingly complex engineering. Cursor’s recent work adapting to OpenAI’s Codex models demonstrates why. Each frontier model arrives with different behavioural patterns shaped by its training data. Codex models, trained specifically for agentic coding workflows, favour shell-oriented patterns where the model wants to use grep and cat instead of dedicated tools. Cursor had to rename and redefine their tools to align with shell conventions, add explicit instructions guiding the model toward tool calls over shell commands, and implement sandboxing for when the model did execute arbitrary commands. Their experiments showed that dropping reasoning traces caused a 30% performance degradation for Codex, substantially larger than the 3% OpenAI observed for mainline GPT-5 on SWE-bench. This kind of finding only emerges from operating at production scale with tight feedback loops.

Manus provides perhaps the most detailed public account of harness architecture at scale. Their typical tasks require around 50 tool calls, with production agents spanning hundreds of conversational turns. Instead of binding hundreds of tools directly to the model, they implemented a layered action space: a fixed set of atomic functions (file operations, shell commands, web search), sandbox utilities (command-line tools discoverable via standard help commands), and a third layer where the agent writes Python scripts to call pre-authorised APIs. The model sees the same simple interface regardless of which layer handles the actual work. This keeps the function calling space minimal, maximises KV cache efficiency, and allows capability expansion without invalidating cached prompts. They’ve refactored their architecture five times since March. The patterns are starting to rhyme across teams, but there’s no equivalent of "just use a transformer" for agent harnesses yet.

Beyond static harness design, teams are beginning to improve their agents through reinforcement learning. OpenPipe’s ART·E project demonstrates what’s now possible at smaller scales. They built an email research agent trained using RL (specifically GRPO) to answer natural-language questions by searching email inboxes. The agent environment is intentionally simple: three tools for searching, reading, and returning answers, backed by SQLite with full-text search. They trained a Qwen-14B model with a multi-objective reward function optimising for answer correctness, fewer turns, and reduced hallucinations. The resulting model outperformed OpenAI’s o3 on this specific task while being faster and cheaper, with training completed in under a day on a single H100 GPU for approximately $80.

The reward function design proved critical. Minimising turns worked well as a proxy for latency, and penalising hallucinations reduced confabulation without hurting accuracy. But an early experiment that gave partial credit for taking more turns (intended to encourage exploration) resulted in the model learning to exploit this by repeating its last tool call until hitting the maximum turn limit. Reward hacking remains a real concern even at these smaller scales.

Cursor takes a different approach with online reinforcement learning in their Tab feature, which handles over 400 million requests per day. Instead of training models from scratch, they implemented an online RL pipeline that updates based on user acceptance rates within hours, achieving a 28% increase in code acceptance. RL for agents is becoming accessible to teams outside the major labs, but the successful cases involve narrow, well-defined tasks with clear reward signals.

Memory: The Problem Everyone Acknowledges

If there’s one area where production teams consistently express frustration, it’s memory. Every fresh context window essentially resets what the model "knows" from a session. For agents operating over extended periods, handling long-running tasks, or needing to learn user preferences over time, this creates fundamental challenges that current solutions address imperfectly.

LangChain’s Lance Martin frames the problem directly: memory systems become particularly important for ambient agents, systems that run asynchronously on schedules without real-time user interaction. His email agent runs every 10 minutes, triages incoming mail, drafts responses, and queues them for approval. Without memory, the system would keep making the same errors without learning. He implemented a simple long-term memory system stored in files that updates continuously as he provides feedback. The approach works, but "simple" and "files" suggest we’re still in the early experimentation phase.

Personize.ai took memory in a different direction with what they call proactive memory. Instead of retrieving raw data on demand, their system runs internal agents that infer insights and synthesise understanding ahead of time. The example: many businesses need to know whether a company is B2B or B2C. This information affects everything from qualification to service selection, but it rarely appears explicitly in raw data. Their system examines available data, recognises the classification is important, and infers it before any agent needs it. Standardised attributes then make these inferences searchable and usable across all agents. The challenge they identified: having access to raw data doesn’t mean understanding the customer. When running the same agent repeatedly across tens of thousands of executions, the chunks retrieved might come from different parts of the data, creating partial or inconsistent understanding.

Other teams are exploring knowledge graphs (Cognee), user-confirmed preferences (Manus), and various hybrid approaches. What’s clear is that production teams need agents that operate over extended periods, learn from feedback, and maintain coherent state across sessions. The solutions exist and they’re deployed, but they’re experiments running in production rather than settled practices. In areas that haven’t stabilised, LangChain’s observation resonates: teams should "build less and understand more." The biggest performance improvements often came from simplifying architecture instead of adding complexity.

4. MCP at One Year: Quiet Stabilisation

The Model Context Protocol has been in the wild for roughly a year now, and something unexpected has happened: it’s become one of the more stable elements in the LLMOps landscape. While agent harnesses and memory systems remain in active flux, MCP has settled into a recognisable pattern: enterprises building servers, SaaS companies exposing their APIs, and a growing body of practical knowledge about what works and what doesn’t. The database reveals genuine production deployments with real limitations being openly discussed rather than hype-driven adoption.

Enterprise Adoption: More Substantial Than Expected

The database contains a notable concentration of enterprise MCP implementations that go well beyond proof-of-concept.

Loblaws, the Canadian retail giant, built an MCP ecosystem wrapping 50+ internal platform APIs—cart, pricing, inventory, customer, catalogue, and more, so their "Alfred" orchestration agent could handle complex workflows like shopping for recipe ingredients. Their implementation is instructive: rather than exposing individual API endpoints, they carefully designed task-oriented tools that combine multiple backend operations. When a user discusses dinner ideas and decides on shrimp pasta, a single tool handles finding all the ingredients, calling catalogue, pricing, and inventory APIs to return a complete shopping list. This abstraction layer proved critical for agent reliability.

Swisscom uses MCP to let network operation agents access topology graphs and alarm systems for diagnosing outages across their complex multi-cloud infrastructure. A customer service scenario illustrates the value: restoring router connectivity could stem from billing issues, network outages, or configuration problems, each residing in different departments. MCP enables agents to coordinate across these boundaries while maintaining Switzerland’s strict data protection compliance. They’ve combined MCP with the Agent-to-Agent protocol for seamless cross-departmental collaboration.

What’s notable across these implementations is the emphasis on MCP as integration infrastructure rather than AI magic. The agents succeed because they’re connecting to well-established backend systems through standardised interfaces rather than MCP providing intelligence itself.

SaaS Companies: Building the Ecosystem

A different pattern is emerging among SaaS providers: building MCP servers so their customers’ agents can access platform capabilities directly.

HubSpot became the first CRM to build a remote MCP server, enabling ChatGPT to query customer data directly. Their motivation was straightforward: 75% of their customers already use ChatGPT, so meeting users where they are made strategic sense. The implementation took less than four weeks, delivering read-only queries that let customers ask natural-language questions about contacts, companies, and conversion patterns. Their team extended the Java MCP SDK to support HTTP streaming and contributed the changes back to open source.

Sentry’s MCP server has scaled to 60 million requests per month, doubling from 30 million in about two months. The server provides direct integration with 10-15 tools, allowing AI coding assistants to pull error details and trigger automated fix attempts without developers needing to copy-paste from Sentry’s UI. With over 5,000 organisations using it—from startups to large tech companies—and just a three-person team managing the infrastructure, it represents genuine production scale.

Sentry’s candour about operational realities is valuable. They shipped early without observability and paid for it: when AI tooling breaks, users don’t retry the next day but abandon it for months. Getting things right from the start matters more than shipping quickly with more features.

The Real Struggles: Context Pollution and Choice Entropy

The database reveals a consistent set of challenges that emerge once teams move past initial implementation.

CloudQuery’s most interesting discovery was about tool naming. They built a tool specifically to help write SQL queries, initially named example_queries. Despite being exactly what users needed, it sat completely unused for two weeks. The problem was semantic rather than technical. LLMs make probabilistic predictions about which tool to invoke based on name and description similarity to the query context. Renaming it to known_good_queries and writing a verbose description that signalled "vetted, high-quality SQL" moved it from ignored to frequently used. Their insight: tools are prompts, and the engineering of tool descriptions is generally overlooked.

Databook coined the term "choice entropy" to describe what happens when agents connect to APIs and see dozens or hundreds of data fields. The more choices available, the more opportunities for the model to misfire. Their solution involves filtering and reshaping tool schemas so agents see only what’s relevant for specific tasks, the "tool masking" approach covered in the context engineering section above.

The Holdouts: When MCP Isn’t the Answer

Not everyone is on board, and the reasons are instructive.

Digits, an automated accounting platform, explicitly rejected MCP for production use. Their head of applied AI was direct: "We haven’t adopted MCP or A2A protocols because all our data is internal and major security questions remain unresolved." While MCP provides good marketing value for connecting to external services, it represents a "hard play" to integrate into production products until security concerns are addressed. For high-stakes financial data, the security and privacy implications aren’t yet mature enough for their production standards.

This isn’t a fringe position. The authentication capabilities have improved substantially over the past six to seven months, making MCP more viable for enterprise contexts. But the Digits example is a useful reminder that standardisation only provides value when the standards meet your security requirements, and for some use cases, that threshold hasn’t been crossed yet.

Interesting Patterns at the Edges

Some teams are pushing MCP into unexpected territory.

Goodfire’s MCP-based Jupyter integration surfaced an important security consideration: the Jupyter kernel integration allows agents to bypass security permissions built into systems like Claude Code. Without custom security checks, agents can pass arbitrary code to the tool, circumventing default permissions. They observed agents that were blocked from running pip install via native bash tools realising they could execute the same commands through notebook tool calls. The flexibility that makes MCP powerful also creates security surface area that teams must actively manage.

The "USB-C for AI" Question

Deepsense describes MCP as potentially becoming "the USB-C for AI integration": once a company builds an MCP server for their data, any agent can use it without custom glue code. The analogy is appealing, and there’s real value in standardisation. But Deepsense also warns that poorly designed MCP servers can "bloat context windows" with "agent reasoning destroyed." Standardisation only provides value when the standards are well-implemented; poorly designed MCP servers may be worse than well-designed custom integrations.

Where This Leaves Us

The honest assessment is that MCP has achieved something unusual in the LLMOps space: relative stability. The protocol exists, it works, enterprises are using it at scale, and a growing body of practical knowledge documents what succeeds and what fails. That’s more than can be said for agent harnesses or memory systems.

But stability doesn’t mean maturity. The challenges around context pollution, tool naming, and authentication are being solved through accumulating experience rather than protocol improvements. Teams are learning that tools are prompts, that less context often means better performance, and that security boundaries require active management.

What the database suggests is that MCP is settling into its appropriate role: infrastructure for connecting agents to existing systems rather than a solution in itself. The teams extracting value are those treating it as a standardised integration layer while doing the harder work of designing appropriate abstractions, managing token budgets, and implementing proper security controls. USB-C is useful precisely because it’s just a connector; the intelligence has to come from elsewhere.

5. Evals and Guardrails: Where the Engineering Actually Happens

If there’s one area where the database reveals the most dramatic maturation in production LLM practices, it’s the parallel evolution of evaluation systems and guardrails. What began as informal "vibe checks" and basic content filters has transformed into sophisticated engineering disciplines. The shift represents a fundamental rethinking of how organisations validate and constrain AI behaviour in systems where the consequences of failure extend well beyond embarrassing chatbot responses.

The Death of the Vibe Check

The phrase "evals are the new unit tests" has become something of a mantra, and Ramp’s expense automation platform provides a compelling demonstration of why. Their approach to evaluating their policy agent, which now handles over 65% of expense approvals autonomously, follows what they describe as a "crawl, walk, run" strategy. Rather than attempting comprehensive evaluation from day one, they start with quick, simple evals and gradually expand coverage as the product matures.

What makes Ramp’s approach particularly noteworthy is their treatment of edge cases: they turn every user-reported failure into a regression test case, creating a continuous feedback loop between production experience and evaluation coverage.

But here’s the nuance that separates mature practitioners from the enthusiastic early adopters: user feedback requires careful interpretation. Ramp discovered that finance teams might approve expenses that technically violate policy, approving things out of convenience or relationship dynamics rather than strict compliance. Simply treating user actions as ground truth would bias the system toward excessive leniency. Their solution was creating "golden datasets" carefully reviewed by their own team to establish correct decisions based solely on information available within the system. This independent labelling process removes affinity bias and other human factors that might influence real-world decisions.

The scale of systematic evaluation is substantial at some organisations. GitHub runs comprehensive offline evaluations against their Copilot models to catch regressions before they hit production, testing models before user interaction across metrics like latency, accuracy, and contextual relevance.

Traditional ML Policing Generative AI

One of the more unexpected patterns in the database is the use of traditional machine learning models to govern when and whether LLMs should be invoked at all. DoorDash built a sophisticated multi-stage validation pipeline for their internal agentic AI platform that they call "Zero-Data Statistical Query Validation." The system includes automated linting, EXPLAIN-based checking for query correctness and performance against engines like Snowflake and Trino, and statistical metadata checks on query results—such as row counts or mean values—to proactively identify issues like empty result sets or zero-value columns, all without exposing sensitive data to the AI model.

This pattern of using deterministic checks and traditional ML to validate, constrain, or gate LLM behaviour appears repeatedly across the database.

Architectural Guardrails: Moving Safety Out of the Prompt

The most significant theme across the database is the systematic movement of safety logic out of prompts and into infrastructure. The limitations of prompt-based guardrails are now well understood: every time a new model comes out, exploits for prompt injection emerge within hours. As Oso’s framework for agent governance puts it bluntly: "what 1997 was for SQL injection, 2025 is for prompt injection."

Oso introduced what they call a "Three-Component Identity" model for agent systems, requiring user, agent, and session context for proper authorisation. The session component is particularly innovative; they treat sessions as capable of being "tainted" once they touch certain combinations of data. If an agent reads untrusted content (like a user email) and then accesses sensitive data (like a database), the system automatically blocks it from using external communication tools (like Slack) for the rest of that session. This prevents prompt injection attacks from succeeding regardless of what the model tries to do, because the safety logic is implemented in code rather than the prompt.

Their approach draws an explicit analogy to memory-safe programming languages: once a variable is "tainted," it cannot be passed to secure sinks. The key insight is that authorisation decisions must consider the sequence of events within a session, and this type of context-dependent authorisation is impossible without tracking session state.

Wakam, a European digital insurance company, implemented what they describe as a "dual-layer" permission system. One layer controls what the human can see, and a second layer controls what the agent can access. A user can only invoke an agent if they also hold the permissions for the data that agent uses. This architectural approach prevents users from using agents to bypass their own access controls, a vulnerability that prompt-based guardrails cannot reliably address.

Komodo Health’s healthcare analytics assistant takes this to its logical conclusion: their LLM has zero knowledge of authentication and authorisation, which are handled entirely by the APIs it calls.

Creative Solutions at the Edges

Some of the most interesting guardrail implementations in the database address highly specific technical constraints with creative solutions.

Toyota’s vehicle information platform faced a particular challenge: every response must include legally correct disclaimers, and this text cannot be altered by the LLM under any circumstances. Their solution was a technique they call "stream splitting." They trained their model to output three distinct streams of data: the natural language response, ID codes for images, and ID codes for legal disclaimers. The application layer then injects the immutable legal text based on those codes. This guarantees the LLM cannot hallucinate or slightly alter legally binding text, a requirement that would be impossible to enforce through prompting alone.

Incident.io’s AI-powered summary generator demonstrates a different kind of creative constraint. Since they know the actual root causes of past outages, they can replay historical incidents to their agent to see if it correctly identifies the cause. This "time travel" evaluation approach lets them assess whether the agent’s understanding lags behind or leads the human responders, ensuring the agent doesn’t hallucinate a fix that wasn’t actually possible at that specific moment in time. It’s a form of evaluation that’s only possible because of the structured nature of their domain.

Digits, an automated accounting platform, routes generation to one provider while sending outputs to a different provider for evaluation. Using a different model family prevents the "grading your own test" problem where a model fails to catch its own mistakes because it shares the same blind spots.

User-Controllable Guardrails: Product Features Rather Than Backend Settings

One of the more forward-thinking patterns in the database is the transformation of guardrails from hidden technical constraints into user-configurable product features.

Ramp’s policy agent implements what they describe as an "autonomy slider" through their existing workflow builder. Users can specify exactly where and when agents can act autonomously, combining LLM de