TLDR: Most RAG observability focuses on the model. This experiment focuses on the source. By instrumenting a Docusaurus-backed RAG pipeline with FastAPI, Prometheus, and trace IDs, I show how “hallucinations” can be measured, traced, and turned into actionable documentation work.
Over the past few months, I’ve been experimenting with building production-grade AI documentation systems using RAG (Retrieval-Augmented Generation), observing the performance and metrics of pipelines. I previously wrote a blog post about the experience here.
When a system gives a wrong answer, the instinctual response is to tweak hyperparameters: “Is retrieval working? “Adjust the temperature,” “Try a different embedding,” or “Rewrite the prompt.”
These metrics and subsequent parameter tweaks help …
TLDR: Most RAG observability focuses on the model. This experiment focuses on the source. By instrumenting a Docusaurus-backed RAG pipeline with FastAPI, Prometheus, and trace IDs, I show how “hallucinations” can be measured, traced, and turned into actionable documentation work.
Over the past few months, I’ve been experimenting with building production-grade AI documentation systems using RAG (Retrieval-Augmented Generation), observing the performance and metrics of pipelines. I previously wrote a blog post about the experience here.
When a system gives a wrong answer, the instinctual response is to tweak hyperparameters: “Is retrieval working? “Adjust the temperature,” “Try a different embedding,” or “Rewrite the prompt.”
These metrics and subsequent parameter tweaks help engineers properly identify and optimize the RAG system to produce better answers.
At least that’s the assumption.
But if the problem is the source (documentation), no amount of model tuning will fix the output.
This post documents an experiment built around that idea. It’s not a polished solution, neither is it a product pitch. In a way, it’s a proof of concept that treats documentation as a system with failure modes, and asks a simple question:
What would it look like if we could observe documentation health the same way we observe infrastructure health?
Before building anything, I wrote down the specific failure modes I wanted to detect (“Content Debt”) — what I started calling content debt.
A common assumption in RAG observability is that you need a “LLM-as-a-judge” layer to detect every issue. That approach introduces latency, cost, and non-determinism. For this proof of concept, I delibrately avoided it.
Instead, I focused on a small set of signals that directly represent documentation failure modes:
unanswered questions
internal contradictions across versions
unsupported feature demand
weak or missing grounding
vocabulary gaps between queries and content
Each of these signals maps directly to a type of documentation work. I felt that if I could observe how the AI struggles with the docs, I could generate an evidence-based roadmap for the writers.
Where possible, I relied on deterministic metadata checks rather than model-based evaluation.
Version Conflicts: Programmatic facts detected when retrieved chunks belong to mutually exclusive product versions (e.g., v1.0 and v2.0) within the same context window.
**Unsupported Features: **Identified by mapping user queries against a known “unsupported” list in the application logic.
Beyond binary failures like version conflicts, I needed to catch “soft” failures where the RAG system thinks it has an answer, but the evidence is weak.
To catch these cases, I implemented three lightweight heuristics:
Weak Evidence: If the distance of the top citation exceeds a strict threshold (e.g., 0.55), the system flags it. This usually means the user asked a question that is completely outside the domain of the documentation.
Low Relevance: If the average distance of the top k results is high (e.g., > 0.65), it implies the retriever is struggling to find a cluster of relevant information, likely returning scattered noise.
Low Coverage: This is a keyword-based check. I tokenize the user’s query (removing stopwords like “how”, “does”, “work”) and check if the remaining meaningful terms (e.g., filter_area) actually appear in the retrieved text. If they don’t, the system flags a missing definition.
A dashboard can tell you that something is wrong, but not what to fix.
I separated the two concerns:
Metrics (Prometheus): Answer the question, “Is there a problem?” They provide aggregated views and trends (e.g., version_conflicts_total).
Structured Logs (ELK/JSON): Answer the question, “Where, exactly, is the problem?” They offer the granular detail needed for deep dive diagnosis.
Every request generates a unique query_id (trace ID), linking a metric spike in Grafana to the exact log entry containing retrieved sections and detected signals.
Each log contains:
the original query and requested version (latest version is taken as default unless othewise stated).
retrieved documents and sections
triggered issue signals
To prove this works, I intentionally setup a flawed documentation designed to trigger my failure signals. For example:
Versioned Directories (v1.0/** & v1.1/):** I split the docs into versioned folders. v1.0 explicitly lacks “Feature X”, while v1.1 introduces it. This setup guarantees Version Conflicts when users ask generic questions about the feature.
Explicit Knowledge Gaps: v1.0/tls.md explicitly states that “configuration details are not yet documented.” This forces the system to trigger an Unanswered signal when asked for configuration steps, rather than hallucinating.
Vocabulary Gaps: I defined filter_mode in v1.1/introduction.md but intentionally omitted the term filter_area. This allows me to verify that the Low Coverage detector correctly flags queries for terms that don’t exist in the corpus.
This setup allowed me to issue targeted queries and watch the Grafana dashboard light up with content-level failures rather than system errors.
Here is what happened when I threw specific curveballs at the system, as shown in the Grafana dashboard image below:
Interestingly, most the queries I ran fell into the following scenarios:
the system returned an answer
citations looked reasonable
but the evidence was weak, incomplete, or missing the exact term the user asked about
And just like, I was able to identify a class of failure that quietly increases support load and makes AI assistants feel untrustworthy.
Below are a few real queries from the PoC, along with the issue signals that were triggered.
**Weak evidence:**I asked: “How do I rotate API keys?”
Instead, the system returned answer_mode=answered, but the top citations all had high distances (~0.63). In other words, it retrieved “security-adjacent” text, but nothing strongly matched the question. Basically, the query sounds legitimate, but the docs don’t actually support it.
It fired the weak_evidence signal.
{
"query": "How do I rotate API keys?",
"issue_types": ["weak_evidence"],
"requested_version": "1.1",
"top_citations": [
{"source": "data/docs/v1.1/security.md", "distance": 0.6289},
{"source": "data/docs/v1.1/configuration.md", "distance": 0.6369},
{"source": "data/docs/v1.1/tls.md", "distance": 0.6398}
],
"answer_mode": "answered"
}
From a writer’s perspective, this doesn’t mean “retrieval is broken.” It usually means: *users expect API key rotation to exist, but the docs don’t explicitly cover it. *
Low coverage:
I asked: “How does filter_area work?”
This is the cleanest example of vocabulary gaps. Retrieval found plausible pages (introduction, security, commands), but the term filter_area wasn’t present in the retrieved context. The system responded with “I don’t know… the documentation does not mention it,” and low_coverage was logged.
{
"query": "How does filter_area work?",
"issue_types": ["low_coverage"],
"requested_version": "1.1",
"answer_mode": "answered"
}
This is exactly the kind of scenario where many RAG systems hallucinate confidently because “filter” and “area” sound like concepts the model can invent around. In this PoC, the detector turns it into a documentation task: define the term, or remove it from user-facing surfaces if it’s not real.
Low relevance:
I asked: *“How does filter_area work?”
*
This triggered both weak_evidence and low_relevance, with distances in the 0.82–0.85 range—meaning the retriever is scraping the bottom of the barrel:
{
"query": "describe_index",
"issue_types": ["weak_evidence", "low_relevance"],
"requested_version": "1.1",
"top_citations": [
{"source": "data/docs/v1.1/commands.md", "distance": 0.8235},
{"source": "data/docs/v1.1/introduction.md", "distance": 0.8397},
{"source": "data/docs/v1.1/security.md", "distance": 0.8465}
]
}
This is a strong indicator of either:
a missing command reference page, or
a mismatch between how users phrase the command and how docs name it.
**Version-specific drift:**This is where the experiment started to feel most useful. I asked:
“What does compact do?” (v1.1)
“What does compact do in v1.0?” (v1.0)
For v1.1, the system produced a specific answer and cited relevant sections:
{
"answer": "According to the documentation for v1.1, the `compact` command ...",
"requested_version": "1.1",
"citations": [
{"source": "data/docs/v1.1/commands.md", "distance": 0.5034}
]
}
For v1.0, the system essentially admitted it couldn’t answer:
{
"answer": "I don't know what \"compact\" does in v1.0.",
"requested_version": "1.0"
}
That is documentation drift made visible: either compact doesn’t exist in v1.0, or the v1.0 docs never defined it. Either way, the assistant becomes a probe for version coverage.
**Unsupported feature demand **
I asked: “Is Feature X supported in v1.0?”
The system answered “no” based on the v1.0 feature list, but it also emitted an unsupported_feature issue type. That distinction matters: it becomes measurable demand rather than a one-off question.
In the image provided earlier, you can see the trend spikes in Grafana.
However, to make this data truly actionable for non-technical users, I introduced a simple aggregation layer: a dedicated /issues endpoint. This endpoint summarizes detected problems over a specified time window, providing a human-readable JSON report:
COUNT TYPE VERSION SOURCE HEADING
16 low_coverage 1.1 data/docs/v1.1/introduction.md Introduction (v1.1)
15 low_coverage 1.1 data/docs/v1.1/security.md Security (v1.1)
15 weak_evidence 1.1 data/docs/v1.1/introduction.md Introduction (v1.1)
15 weak_evidence 1.1 data/docs/v1.1/security.md Security (v1.1)
8 weak_evidence 1.1 data/docs/v1.1/tls.md TLS (v1.1)
8 low_coverage 1.1 data/docs/v1.1/tls.md TLS (v1.1)
8 weak_evidence 1.1 data/docs/v1.1/commands.md Supported Commands (v1.1)
6 low_coverage 1.1 data/docs/v1.1/commands.md Supported Commands (v1.1)
This experiment confirmed something I had suspected for a while: many so-called RAG failures are not AI failures at all. They are documentation failures that only become visible once users interact with the content through an AI interface.
By treating documentation bugs as system signals — version conflicts, weak evidence, unanswered questions, vocabulary gaps — we close the loop between users and writers. We stop guessing why an AI answer feels wrong and start fixing the underlying source text.
To put it simply: the real black box in AI documentation systems is not the neural network. It is the state of the knowledge base itself.
By instrumenting the content, AI stops behaving like a magic trick and starts behaving like what it should have been all along: the most effective documentation linter we’ve ever had.
Complete AI docs observability repo is available here.
No posts