Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration (opens in new tab)

Covers LLM Evals: Everything You Need to Know – Hamel’s BlogDiscussed on DEV

I built a RAG system for financial document Q&A. It answers questions about SEC filings (revenue, margins, debt ratios) using 84 public company documents from the FinanceBench benchmark. After running 100 queries, my LLM judge said 74% of answers were correct. The actual number was 27%. This post is about how I found that gap, why it exists, and what I did about it. The setup The pipeline is straightforward: embed 84 SEC filings (10-K, 10-Q, earnings reports) into Qdrant with text-embedding-3...

Read the original article