Teams that have run Software as a Service (SaaS) products know the routine: An alert goes off, and everyone scrambles to the observability stack first. Metrics, logs, and traces have been the signals that help engineers figure out what broke, why users are stuck or where a service-level agreement (SLA) may have slipped. And for years, these tools have worked well enough.
But then AI showed up.
Behind all the hype and potential surrounding copilots, chat interfaces, and intelligent assistants, engineering teams have quietly run into something more complicated: Large language model (LLM)-powered applications don’t behave like traditional software, and the tools we’ve relied on can’t always fully explain what’s happening under the hood.
Why LLMs break traditional observability
If microservices are like puzzle pieces that fit together, LLMs are more like improv actors. They take direction, but the outcomes aren’t entirely predictable. This unpredictability changes the entire equation for reliability.
LLM workloads are:
- **Probabilistic. **The same inputs don’t always produce the same output.
- Transient and multistep. A single user request might trigger retrieval, multiple model calls, tool execution, parsing and retries.
- Constantly evolving. Prompt templates change weekly, model versions get swapped out and quality fluctuates without warning.
A simple user search can trigger a cascade of steps, so when something goes wrong, where do you even start? Logs don’t explain why the model hesitated or how a prompt drifted over time. Metrics can’t tell you if a hallucination slipped into a response that ended up on a customer’s screen.
It’s not that the legacy tools are bad; they just weren’t built for systems that reason, adapt and change this quickly.
What teams actually end up monitoring
Once LLMs move into production, teams quickly realize they are watching a new set of signals every day:
- Token usage, because cost is directly tied to prompt and response size, often chosen by developers with little visibility.
- Latency, especially when AI sits in the critical path of a customer-facing API.
- Error rates, from model failures, tool calls, or upstream integrations.
- Response quality, including correctness and hallucinations, which traditional telemetry cannot measure.
These are reliability concerns, but they do not map cleanly to CPU, memory or request counts.