Quick Reference: Terms Youβll Encounter
Technical Acronyms:
- RAG: Retrieval-Augmented Generationβenhancing LLM responses with retrieved context
- LLM: Large Language Modelβtransformer-based text generation system
- RAGAS: RAG Assessmentβpopular open-source evaluation framework
- BLEU: Bilingual Evaluation Understudyβn-gram overlap metric
- ROUGE: Recall-Oriented Understudy for Gisting Evaluationβsummary comparison metric
Statistical & Mathematical Terms:
- Precision: Relevant items retrieved / Total items retrieved
- Recall: Relevant items retrieved / Total relevant items
- F1 Score: Harmonic mean of precision and recall
- Ground Truth: Known correct answers for evaluation
- Inter-rater Reliability: Agreement between human evaluators β¦
Quick Reference: Terms Youβll Encounter
Technical Acronyms:
- RAG: Retrieval-Augmented Generationβenhancing LLM responses with retrieved context
- LLM: Large Language Modelβtransformer-based text generation system
- RAGAS: RAG Assessmentβpopular open-source evaluation framework
- BLEU: Bilingual Evaluation Understudyβn-gram overlap metric
- ROUGE: Recall-Oriented Understudy for Gisting Evaluationβsummary comparison metric
Statistical & Mathematical Terms:
- Precision: Relevant items retrieved / Total items retrieved
- Recall: Relevant items retrieved / Total relevant items
- F1 Score: Harmonic mean of precision and recall
- Ground Truth: Known correct answers for evaluation
- Inter-rater Reliability: Agreement between human evaluators
Introduction: You Canβt Improve What You Canβt Measure
Imagine youβre a restaurant owner. A customer complains: "The food was bad." Thatβs useless feedback. Was it too salty? Undercooked? Wrong dish entirely? You need specific, measurable criteria to improve.
RAG systems face the same challenge. "The answer was wrong" doesnβt tell you whether:
- The retrieval failed (wrong documents)
- The generation failed (right documents, wrong interpretation)
- The question was ambiguous
- The knowledge base was incomplete
RAG evaluation is like a medical diagnosis. You donβt just ask "is the patient sick?" You measure temperature, blood pressure, heart rate, and specific biomarkers. Each metric isolates a different potential problem, guiding treatment.
Hereβs another analogy: Evaluation metrics are quality control checkpoints on an assembly line. You donβt just inspect the final carβyou check the engine, the transmission, the electrical system separately. A failed brake test tells you exactly where to look.
A third way to think about it: Metrics are unit tests for AI systems. Just as you wouldnβt ship code without tests, you shouldnβt ship RAG without evaluation. The difference is that AI "tests" are probabilistic, not deterministic.
The RAG Evaluation Stack: Four Layers of Quality
RAG quality breaks down into four distinct layers, each requiring different metrics:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 4: End-to-End Quality β
β "Did we solve the user's actual problem?" β
β Metrics: Task success, user satisfaction β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 3: Answer Quality β
β "Is the final answer correct and useful?" β
β Metrics: Correctness, completeness, relevance β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: Faithfulness β
β "Does the answer match the retrieved context?" β
β Metrics: Faithfulness, hallucination rate β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 1: Retrieval Quality β
β "Did we find the right documents?" β
β Metrics: Precision, recall, MRR, nDCG β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Critical insight: Problems cascade upward. Bad retrieval guarantees bad answers. But good retrieval doesnβt guarantee good answersβthe generation can still fail. You need metrics at every layer.
Layer 1: Retrieval MetricsβDid We Find the Right Documents?
Context Precision
What it measures: Of the documents we retrieved, how many were actually relevant?
Why it matters: Low precision means youβre stuffing the context window with noise. The LLM has to work harder to find the signal, increasing hallucination risk.
The analogy: Youβre researching a legal case. Context precision asks: "Of the 10 documents your assistant pulled, how many are actually relevant to this case?" If 3 are relevant, precision is 30%.
Calculation:
Context Precision = Relevant chunks retrieved / Total chunks retrieved
Target: > 0.7 for most applications. Below 0.5 suggests retrieval needs work.
Context Recall
What it measures: Of all the relevant documents that exist, how many did we find?
Why it matters: Low recall means youβre missing important information. The answer might be technically accurate but incomplete.
The analogy: Youβre studying for an exam on World War II. Context recall asks: "Of all the important facts you need to know, how many did your study materials cover?" Missing the D-Day invasion means low recall.
Calculation:
Context Recall = Relevant chunks retrieved / Total relevant chunks in corpus
Target: > 0.8 for comprehensive answers. Can be lower for simple factual queries.
Mean Reciprocal Rank (MRR)
What it measures: How high does the first relevant document appear in results?
Why it matters: If the best document is ranked #47, the LLM might never see it (context window limits). Position matters enormously.
The analogy: You Google a question. MRR measures whether the answer is in the first result (score: 1.0), the second (score: 0.5), the tenth (score: 0.1), or buried on page 5 (score: ~0).
Calculation:
MRR = Average of (1 / rank of first relevant result) across queries
Target: > 0.6. Below 0.4 means relevant results are buried too deep.
Normalized Discounted Cumulative Gain (nDCG)
What it measures: Overall ranking quality, accounting for position and graded relevance.
Why it matters: Not all relevant documents are equally relevant. nDCG captures whether highly relevant documents rank above somewhat relevant ones.
The analogy: Youβre ranking restaurants. nDCG rewards putting the 5-star restaurant first, the 4-star second, and the 3-star thirdβnot just having all three somewhere in the list.
Target: > 0.7 for production systems.
Layer 2: FaithfulnessβDoes the Answer Match the Context?
Faithfulness is the hallucination detector. It measures whether the generated answer is actually supported by the retrieved documents.
The Faithfulness Problem
Consider this scenario:
- Retrieved context: "The company was founded in 2015 in Austin, Texas."
- Generated answer: "The company was founded in 2015 in Austin, Texas by John Smith."
The founderβs name is hallucinatedβitβs not in the context. Faithfulness metrics catch this.
Measuring Faithfulness
Claim decomposition approach:
- Break the answer into atomic claims
- For each claim, check if itβs supported by the context
- Faithfulness = Supported claims / Total claims
Example:
Answer: "Python was created by Guido van Rossum in 1991. It's the most popular programming language."
Claims:
1. "Python was created by Guido van Rossum" β Check context β Supported β
2. "Python was created in 1991" β Check context β Supported β
3. "Python is the most popular programming language" β Check context β NOT FOUND β
Faithfulness = 2/3 = 0.67
Hallucination Categories
Not all hallucinations are equal:
| Type | Severity | Example |
|---|---|---|
| Fabricated facts | High | Inventing statistics, names, dates |
| Exaggeration | Medium | "Always" when context says "often" |
| Conflation | Medium | Mixing details from different sources |
| Extrapolation | Low | Reasonable inference not explicitly stated |
Target faithfulness: > 0.9 for factual applications. Financial, medical, and legal domains should aim for > 0.95.
Layer 3: Answer QualityβIs the Response Actually Good?
Answer Relevance
What it measures: Does the answer actually address the question asked?
The problem it catches: The answer might be faithful to the context but completely miss the point.
Example:
- Question: "What is the return policy?"
- Retrieved: Company FAQ about returns
- Answer: "Our company was founded in 2010 and has grown to serve millions of customers."
Technically faithful (if thatβs in the FAQ), but completely irrelevant.
Measurement approach:
- Generate questions that the answer would address
- Compare to the original question
- Higher similarity = higher relevance
Answer Correctness
What it measures: Is the answer factually correct according to ground truth?
When you can measure it: Only when you have known correct answers (golden dataset).
The challenge: Ground truth is expensive to create and maintain. But without it, youβre flying blind.
Answer Completeness
What it measures: Does the answer cover all aspects of the question?
Example:
- Question: "What are the pros and cons of React?"
- Incomplete answer: "React has a large ecosystem and component reusability."
- Complete answer: Lists both advantages AND disadvantages
Measurement approach: Compare answer coverage against a reference answer or checklist of expected points.
Layer 4: End-to-End MetricsβDid We Actually Help?
Task Completion Rate
What it measures: Did the user accomplish their goal?
Why itβs the ultimate metric: All other metrics are proxies. This is the outcome that matters.
How to measure:
- Explicit signals: User clicks "resolved," completes purchase, etc.
- Implicit signals: User doesnβt ask follow-up questions, doesnβt contact support
User Satisfaction
What it measures: Subjective quality as perceived by users.
Methods:
- Thumbs up/down on responses
- Follow-up surveys
- Implicit signals (session length, return rate)
The challenge: Low response rates and selection bias. Users who bother to rate skew negative.
Building Golden Evaluation Sets
A golden set is your ground truthβquestions with known correct answers that you use to benchmark your system.
What Makes a Good Golden Set
Diversity: Cover different question types, topics, and difficulty levels.
Question Types to Include:
βββ Factual lookup ("What is X's revenue?")
βββ Comparison ("How does A differ from B?")
βββ Procedural ("How do I configure X?")
βββ Reasoning ("Why did X happen?")
βββ Multi-hop ("What's the CEO's alma mater's mascot?")
βββ Unanswerable ("What will revenue be in 2030?")
Realistic distribution: Match your production traffic. If 60% of questions are "how do I," your golden set should reflect that.
Edge cases: Include the hard stuffβambiguous questions, questions requiring multiple documents, questions with no good answer.
Golden Set Size Guidelines
| Use Case | Minimum Size | Recommended |
|---|---|---|
| Quick sanity check | 20-50 | 50 |
| Development iteration | 100-200 | 200 |
| Pre-release validation | 300-500 | 500 |
| Comprehensive benchmark | 500-1000 | 1000+ |
Rule of thumb: More is better, but 200 well-chosen examples beats 1000 random ones.
Creating Ground Truth
Option 1: Expert annotation
- Have domain experts write ideal answers
- Most accurate, most expensive
- Best for high-stakes domains
Option 2: User feedback mining
- Extract from support tickets, chat logs
- "Real" questions with known resolutions
- Watch for privacy concerns
Option 3: Synthetic generation
- Use LLMs to generate Q&A pairs from your documents
- Scale easily but quality varies
- Always human-validate a sample
Option 4: Adversarial generation
- Deliberately create hard cases
- Questions that sound similar but have different answers
- Edge cases that have broken the system before
LLM-as-Judge: Using AI to Evaluate AI
When you canβt afford human evaluation at scale, LLMs can serve as automated judges.
How It Works
Prompt to Judge LLM:
"You are evaluating a RAG system response.
Question: {question}
Retrieved Context: {context}
Generated Answer: {answer}
Rate the following on a scale of 1-5:
1. Faithfulness: Is the answer supported by the context?
2. Relevance: Does the answer address the question?
3. Completeness: Does the answer cover all aspects?
Provide scores and brief justification."
Strengths and Weaknesses
Strengths:
- Scales infinitely
- Consistent (no inter-rater variability)
- Can evaluate nuanced criteria
Weaknesses:
- Biased toward verbose, confident-sounding answers
- May miss subtle factual errors
- Canβt catch errors the judge LLM would also make
Calibrating LLM Judges
Critical step: Validate LLM judgments against human judgments.
Process:
1. Have humans rate 100-200 examples
2. Have LLM judge the same examples
3. Calculate correlation
4. If correlation < 0.7, adjust prompts or criteria
5. Document known blind spots
Best practice: Use a stronger model as judge than the model being evaluated. GPT-4 judging GPT-3.5, Claude Opus judging Claude Haiku, etc.
Evaluation Pipeline Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Evaluation Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Golden βββββΆβ RAG βββββΆβ Metrics β β
β β Dataset β β System β β Compute β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββ βββββββββββββββ β
β β Ground β β Results β β
β β Truth β β Store β β
β βββββββββββββββ βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β Dashboard β β
β β & Alerts β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Components
Golden Dataset Store: Version-controlled, with metadata about when/how each example was created.
Metrics Compute: Calculates all metrics for each evaluation run. Should be deterministic (same inputs = same outputs).
Results Store: Historical record of all evaluation runs. Enables trend analysis and regression detection.
Dashboard & Alerts: Visualize metrics over time. Alert when metrics drop below thresholds.
Metric Selection: What to Track When
For Development (Daily)
Fast metrics that catch obvious regressions:
| Metric | Target | Why |
|---|---|---|
| Context Precision@5 | > 0.6 | Quick retrieval sanity check |
| Faithfulness (sampled) | > 0.85 | Catch hallucination spikes |
| Answer Relevance | > 0.7 | Ensure answers address questions |
For Release Validation (Weekly)
Comprehensive evaluation before deployments:
| Metric | Target | Why |
|---|---|---|
| Full retrieval suite | Various | Complete retrieval quality picture |
| Faithfulness (full) | > 0.9 | No hallucination regressions |
| Answer Correctness | > 0.85 | Accuracy against ground truth |
| Latency p95 | < 3s | Performance hasnβt degraded |
For Production Monitoring (Continuous)
Lightweight signals that work without ground truth:
| Signal | Alert Threshold | Why |
|---|---|---|
| User feedback ratio | < 0.7 thumbs up | Direct user sentiment |
| Follow-up question rate | > 0.3 | Users arenβt getting answers |
| "I donβt know" rate | Significant change | Retrieval may be failing |
| Avg response length | Significant change | Generation behavior shift |
Common Evaluation Pitfalls
Pitfall 1: Goodhartβs Law
"When a measure becomes a target, it ceases to be a good measure."
If you optimize purely for faithfulness, the system learns to give vague, hedged answers that are technically faithful but useless.
Solution: Balance multiple metrics. No single metric should dominate.
Pitfall 2: Test Set Leakage
Your golden set accidentally overlaps with training data or retrieval corpus in ways that inflate scores.
Solution: Strict separation. Date-based splits where possible. Regular audits for overlap.
Pitfall 3: Distribution Shift
Your golden set was created six months ago. User questions have evolved. Metrics look great but users complain.
Solution: Continuously add new examples from production traffic. Retire stale examples.
Pitfall 4: Over-Reliance on Automatic Metrics
BLEU, ROUGE, and embedding similarity are cheap to compute but poorly correlated with human judgment for open-ended generation.
Solution: Always include some human evaluation. Use automatic metrics for quick feedback, not final decisions.
Pitfall 5: Ignoring Confidence Calibration
Your system says itβs 90% confident but is only right 60% of the time.
Solution: Track calibration (accuracy at each confidence level). Well-calibrated confidence enables smart escalation.
Implementing an Evaluation Framework
Hereβs a conceptual frameworkβthe GitHub repo will have the full implementation:
class RAGEvaluator:
"""
Core evaluation framework structure.
Evaluates: Retrieval quality, faithfulness, answer quality
Supports: Golden sets, LLM-as-judge, custom metrics
"""
def __init__(self, config: EvalConfig):
self.retrieval_metrics = RetrievalMetrics()
self.faithfulness_checker = FaithfulnessChecker()
self.answer_evaluator = AnswerEvaluator()
self.golden_set = GoldenDataset(config.golden_set_path)
def evaluate_retrieval(self, query, retrieved, relevant) -> dict:
"""Layer 1: Did we find the right documents?"""
return {
"precision": self.retrieval_metrics.precision(retrieved, relevant),
"recall": self.retrieval_metrics.recall(retrieved, relevant),
"mrr": self.retrieval_metrics.mrr(retrieved, relevant),
"ndcg": self.retrieval_metrics.ndcg(retrieved, relevant)
}
def evaluate_faithfulness(self, answer, context) -> dict:
"""Layer 2: Is the answer grounded in context?"""
claims = self.faithfulness_checker.extract_claims(answer)
supported = self.faithfulness_checker.verify_claims(claims, context)
return {
"faithfulness": len(supported) / len(claims),
"unsupported_claims": [c for c in claims if c not in supported]
}
def evaluate_answer(self, question, answer, ground_truth=None) -> dict:
"""Layer 3: Is the answer good?"""
result = {
"relevance": self.answer_evaluator.relevance(question, answer)
}
if ground_truth:
result["correctness"] = self.answer_evaluator.correctness(answer, ground_truth)
return result
def run_full_evaluation(self) -> EvalReport:
"""Run evaluation on entire golden set."""
results = []
for example in self.golden_set:
# Run RAG system
retrieved, answer = self.rag_system.query(example.question)
# Evaluate all layers
result = {
"retrieval": self.evaluate_retrieval(
example.question, retrieved, example.relevant_docs
),
"faithfulness": self.evaluate_faithfulness(answer, retrieved),
"answer": self.evaluate_answer(
example.question, answer, example.ground_truth
)
}
results.append(result)
return EvalReport(results)
Data Engineerβs ROI Lens: The Business Impact
The Cost of Not Measuring
| Failure Mode | Business Impact | Detection Without Metrics |
|---|---|---|
| Retrieval degradation | Wrong answers increase 40% | Weeks (user complaints) |
| Hallucination spike | Trust erosion, potential liability | Days to weeks |
| Relevance drift | User satisfaction drops | Months (gradual) |
| Completeness issues | Support tickets increase | Weeks |
The Value of Good Evaluation
Scenario: E-commerce product Q&A system handling 50,000 queries/day.
Without evaluation:
- Undetected hallucination rate: 8%
- Bad answers per day: 4,000
- Support tickets generated: 400/day
- Cost per ticket: $15
- Daily cost: $6,000
- Monthly cost: $180,000
With evaluation:
- Hallucination detected in 2 days, fixed in 1 week
- Hallucination rate after fix: 1%
- Bad answers per day: 500
- Support tickets: 50/day
- Daily cost: $750
- Monthly cost: $22,500
Monthly savings: $157,500
Evaluation system cost: ~$5,000/month (compute + maintenance)
Net monthly benefit: $152,500
ROI Calculation
def calculate_eval_roi(
daily_queries: int,
error_rate_without_eval: float,
error_rate_with_eval: float,
cost_per_error: float,
eval_system_monthly_cost: float
) -> dict:
monthly_queries = daily_queries * 30
errors_without = monthly_queries * error_rate_without_eval
errors_with = monthly_queries * error_rate_with_eval
cost_without = errors_without * cost_per_error
cost_with = errors_with * cost_per_error + eval_system_monthly_cost
return {
"monthly_savings": cost_without - cost_with,
"error_reduction": f"{(1 - error_rate_with_eval/error_rate_without_eval)*100:.0f}%",
"roi": f"{(cost_without - cost_with) / eval_system_monthly_cost:.0f}x"
}
# Example
roi = calculate_eval_roi(
daily_queries=50000,
error_rate_without_eval=0.08,
error_rate_with_eval=0.01,
cost_per_error=15,
eval_system_monthly_cost=5000
)
# Result: 30x ROI
Key Takeaways
Evaluate every layer: Retrieval, faithfulness, answer quality, and end-to-end outcomes each require different metrics. 1.
Faithfulness is non-negotiable: Hallucination detection must be part of every RAG evaluation. 1.
Golden sets are investments: Spend time building high-quality evaluation data. It pays dividends forever. 1.
LLM-as-judge scales, humans validate: Use AI for volume, humans for calibration. 1.
Multiple metrics prevent gaming: No single metric captures quality. Balance retrieval, generation, and outcome metrics. 1.
Continuous evaluation catches drift: Production quality degrades silently. Regular evaluation makes it visible. 1.
The ROI is clear: Catching errors before users do saves orders of magnitude more than evaluation costs.
Start with a 50-example golden set and three metrics (precision, faithfulness, relevance). Expand as you learn what breaks in your specific domain.
Next in this series: Production AI: Monitoring, Cost Optimization, and Operationsβbuilding observable, efficient AI systems that scale reliably.