An Observability Veteran on AI’s ‘Intoxicating’ Potential

Troubleshooting software requires observability: We need to collect and analyze telemetry to formulate, disprove or validate hypotheses about why our software is behaving differently than we wanted.

Generative AI is growing to accompany us through the journey, and has the potential for it to take over more toil — especially with troubleshooting.

The Iterative Nature of Observability

A system is observable if we can figure out what it is doing based on data (telemetry) it emits. There’s many types of telemetry, called signals. The most commonly used are logs, [metrics and traces](https://thenewstack.io/metrics-traces-logs-a…

Troubleshooting software requires observability: We need to collect and analyze telemetry to formulate, disprove or validate hypotheses about why our software is behaving differently than we wanted.

Generative AI is growing to accompany us through the journey, and has the potential for it to take over more toil — especially with troubleshooting.

The Iterative Nature of Observability

A system is observable if we can figure out what it is doing based on data (telemetry) it emits. There’s many types of telemetry, called signals. The most commonly used are logs, metrics and traces.

Telemetry does not just happen. Our systems must generate it as part of their normal operations. The runtimes that host our applications can be configured to generate a wealth of telemetry out of the box, and so do our container orchestrations, operating systems and so on. We can also add to our applications dedicated logic, called instrumentations, that create additional telemetry. I think of it as application logic we pay forward to debug other application logic.

The iterative process of troubleshooting systems.

The telemetry generated by our applications is not always perfect and needs processing: We need to (spam) filter telemetry, because a lot of it is actually not that useful. We need to add context to telemetry, as the application that generates telemetry may not have access to enough information to properly provide all the necessary metadata.We also may need to ensure that the right telemetry is forwarded to the right observability backend in case we use different ones depending on the use case or the signal.

Once the telemetry gets to the observability backend, we must detect anomalies by looking for signs that something is amiss with our systems. And when anomalies are detected, we must troubleshoot the system.

And in every of these steps, AI is either already a helpful, powerful companion or has great potential to become one.

Artificial Intelligence in Instrumentation

AI coding assistants have a great potential to treat observability as the first-class functional requirement of systems it should be. Unfortunately, to date, that potential seems to be effectively untapped.

It’s not that AI is not capable of adding instrumentation: When you ask for it, it does a passable job. Yet code assistant tools do not generally add instrumentation by default, and they do not seem to know what telemetry is going to be useful given the kind of applications they work on.

In a sense, the invention is imitating the questionable habits of the inventor: Source code that humans write seldom comes with observability as a functional requirement. This is largely why we have many ways of automatically collecting telemetry from applications at runtime by adding instrumentation. And automatic instrumentation is perfectly fine: Much of the instrumentation related to the technologies we use does not need to be invented anew every time. The world needs exactly one set of metrics about Java garbage collection, and exactly one set of metadata about how to describe HTTP requests and responses.

In other words, about 80 to 90% of automatic, out-of-the-box, generic instrumentation is great and the best place to start your observability journey, but the remaining amount should be ad-hoc, application-specific telemetry that reflects the business aspects of your system.

Artificial Intelligence in Telemetry Processing

After telemetry is generated, it must be processed and routed for analysis. There are several things AI can help with in terms of processing telemetry:

**(Spam) filter telemetry: **Not all telemetry is equally valuable. Especially, the telemetry generated by auto-instrumentations is not consistently useful and tends to become indispensable only to explain anomalies detected elsewhere. I have not yet seen a system that uses AI for selecting which telemetry to keep beyond short-term storage, but I am very much looking forward to it.
Redact information: There are few systems that have never sent sensitive data over logs or telemetry metadata. AI should be able to detect many of these situations and act accordingly, though I have not seen this in practice yet.
Improve telemetry: Adding missing context, filling metadata gaps (like fixing missing severities in logs) and extracting important information as attributes that can be queued separately (for example, by automatically detecting log patterns).
Aggregate telemetry: Metrics are not a silver bullet: They are a way to frugally (with relatively few data points) represent important aspects of a system losing a lot of information in the process. Telemetry collection is the most likely area in observability where AI can shine. A lot of what observability looks like today is due to limitations we have as humans: Compared to software, we are slow, we mostly do one complex thing at a time, and we are in one place at one time. We collect swats of telemetry and are limited in how much of it we analyze. It can take us seconds or minutes to realize that something is amiss. We might not have the time to jump on a bug until next week, so we store a lot of telemetry for a long time.

But software scales way more than humans do. If (and that’s a big “if”) AI can both write and operate our systems autonomously, we will see a shift in which telemetry is collected and for how long. We’ll see dramatically less reliance on metrics and other pre-aggregated information, and much more event-like telemetry (logs, spans, etc.). We’ll see more collection on demand and telemetry stored for much less time.

There is, however, one qualitative difference between humans and AI consuming telemetry: AI needs radically more consistency. As humans, we can remember that we messed up the metadata and call the same thing in three different ways. If we come across team.id and team.identifier in the same troubleshooting, we know that something is up.

AI takes information at face value, since it lacks intuition and, to a large extent, the ability to amass experience. Moreover,AI generally does not return with clarification questions, although that may change. And this is why semantic conventions are so crucial for AI agents: They generally do not have built in the healthy realism about human fallibility that developers with experience have accumulated one disappointment at the time.

AI in Detecting Anomalies

In terms of observability, we live in captivating times. AI is poised to drastically change the way we generate and consume insights about what is wrong with our systems. It is a paradigm shift that goes well beyond “AI troubleshoots for you.” After a decade of unkept promises, it feels finally real.

For a long time AI has done a pretty good job of detecting anomalies, and I don’t see that changing much. Anomaly detection is a profoundly analytical, statistical and largely deterministic field.

The potential of generative AI here is largely to reduce false positives by running ad-hoc, additional sanity checks. That blends with the next step, and what everybody is currently excited about: troubleshooting.

AI in Troubleshooting

Troubleshooting is where AI truly is unlocking the next level of observability. Modern models with access to retrieval-augmented generation (RAG) and advanced, deterministic diagnostic tools can debug in a couple of minutes an issue that has left some of the most talented technologists stumped for half an hour.

GenAI can generate queries, dashboards or alerts, relieving the cognitive load of human operators during outages. This can democratize troubleshooting: It greatly lowers the bar, empowering all developers to be more effective toward issue resolution. This can free up time for the most experienced developers when problems can be solved without taking them away from other work.

The potential for AI to do a lot of heavy lifting in troubleshooting cannot be understated. But the most exciting part is that we have an entirely new paradigm of consuming observability insights.

Observability tool dashboards present a lot of numbers and charts in bright colors jostling for your attention. It is invariably overwhelming. Custom dashboards are only slightly more flexible. This is where the conversational aspect of Gen AI is at its best: When wielded well, it can tell the user in plain language exactly what they need to know. I yearn for the day that I will open my dashboard and read:

“The product catalog service has been having issues since the last deployment at 12:45, 2 minutes ago. The FindProduct API is consistently failing to retrieve information for a handful of product IDs. It does not look like a database issue. It is affecting on average 1024 unique users every minute and preventing them from completing the Checkout user flow.”

Imagine reading this, followed by a dynamically generated list of relevant visualizations presented as supporting evidence in a logical sequence. It could show hypotheses that were formulated and discarded, with narration explaining its reasoning just one mouse-click away. That future is not far away.

This does not mean that dashboards will go away entirely, but in a world where a narrative about an ongoing issue is available, a static dashboard seems a relic of the past.

It could even make observability a good experience on the small screens of mobile phones. Because GenAI can explain things sequentially, we will consume troubleshooting reports like we read post-mortem blogs.

Once a track record of reliability is built up, we might even eventually trust AI to make changes independently.

Thoughts About Design For Observability in the Age of AI

Interestingly enough, there are unexpected synergies between designing AI for observability and improving the observability experience for humans.

AI troubleshoots like humans, but at an industrial scale. Large language models, because they are trained on human content, emulate the way we do things. They just can do infinitely more of it. This means that if humans have better primitives to troubleshoot problems, the better AI gets at troubleshooting. (These primitives, in the current world of AI, are usually tools in an MCP server.) But the opposite is also true: If AI is missing some advanced capabilities in our observability tools, humans likely miss them too.

AI is a power user. Troubleshooting complex systems almost always falls to a few knowledgeable people, making them in high demand (and stressed). AI has the potential to explain, enable and educate people to further the spread of advanced knowledge.

AI can reduce cognitive load. Instead of dashboards full of charts and numbers, AI can present concise analysis, ideally in plain language and offer supporting evidence on demand.

So observability tools must also be designed for AI as a consumer:

Accessibility for AIs. An increasing number of observability tools are introducing built-in AI agents, sometimes built on Model Context Protocol (MCP) servers, some using proprietary APIs not available to the outside world.

In the future, we could have networks of specialized agents that collaborate (for instance, using the A2A protocol) on solving issues: The observability agent troubleshoots, collaborates with the GitHub agent to open a pull request and with the Linear agent to document the progress of handling the incident.

I am very curious to find out is which level of openness we, as an industry, will settle on as “table stakes” in the agentic world. The answer is probably further toward open than the current state of APIs: Compared to “normal” software, the integration cost for an AI agent to use new tools is effectively zero, so there will be way higher expectations that agentic AI will eagerly use the APIs available to them.

AI-driven troubleshooting must be grounded in determinism.. Large language models are not. Given the same inputs, they will generate different output, which gives way to hallucinations. However, observability has structure to help humans cope with avalanches of telemetry from complex systems: We have signals, semantic conventions, documentation and capabilities to analyze data that are effectively math deployed at massive scale. The more advanced, deterministic tools, such as via an MCP server, we give to Gen AI, the fewer bad things happen.

A Personal Retrospective

I have been working in observability for the last two decades. I have witnessed moments of intense excitement, like when Prometheus and OpenTelemetry became a thing, or when Google showed the world that continuous Production Profiling was both possible and viable at massive scale.

However, little compares with the realistic, pragmatic potential of AI to advance our practice of observability, lifting many of the limitations we have come to accept and taking over toil that we have been chafing under.

The potential for AI is intoxicating.

The Iterative Nature of Observability

The Iterative Nature of Observability

Artificial Intelligence in Instrumentation

Artificial Intelligence in Telemetry Processing

AI in Detecting Anomalies

AI in Troubleshooting

Thoughts About Design For Observability in the Age of AI

A Personal Retrospective

Similar Posts