Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration (opens in new tab)
I built a RAG system for financial document Q&A. It answers questions about SEC filings (revenue, margins, debt ratios) using 84 public company documents from the FinanceBench benchmark. After running 100 queries, my LLM judge said 74% of answers were correct. The actual number was 27%. This post is about how I found that gap, why it exists, and what I did about it. The setup The pipeline is straightforward: embed 84 SEC filings (10-K, 10-Q, earnings reports) into Qdrant with text-embedding-3...
Read the original article