Production RAG: what I learned from processing 5M+ documents

October 20, 2025 • 3 min read

I’ve spent the past 8 months the RAG trenches, I want to share what actually worked vs. wasted our time. We built RAG for Usul AI (9M pages) and an unnamed legal AI enterprise (4M pages).

Langchain + Llamaindex

We started out with youtube tutorials. First Langchain -> Llamaindex. Got to a working prototype in a couple of days and were optimistic with the progress. We run tests on subset of the data (100 documents) and the results looked great. We spend the next few days running the pipeline on the production dataset and got everything working in a week — incredible.

Except it wasn’t, the results were subpar and only the end users could tell. We spent the following few months rewriting pieces of the system, one at a time, until the perfo…

October 20, 2025 • 3 min read

I’ve spent the past 8 months the RAG trenches, I want to share what actually worked vs. wasted our time. We built RAG for Usul AI (9M pages) and an unnamed legal AI enterprise (4M pages).

Langchain + Llamaindex

Except it wasn’t, the results were subpar and only the end users could tell. We spent the following few months rewriting pieces of the system, one at a time, until the performance was at the level we wanted. Here are things we did ranked by ROI.

What moved the needle

Query Generation: not all context can be captured by the user’s last query. We had an LLM review the thread and generate a number of semantic + keyword queries. We processed all of those queries in parallel, and passed them to a reranker. This made us cover a larger surface area and not be dependent on a computed score for hybrid search.
Reranking: the highest value 5 lines of code you’ll add. The chunk ranking shifted a lot. More than you’d expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.
Chunking Strategy: this takes a lot of effort, you’ll probably be spending most of your time on it. We built a custom flow for both enterprises, make sure to understand the data, review the chunks, and check that a) chunks are not getting cut mid-word or sentence b) ~each chunk is a logical unit and captures information on its own
Metadata to LLM: we started by passing the chunk text to the LLM, we ran an experiment and found that injecting relevant metadata as well (title, author, etc.) improves context and answers by a lot.
Query routing: many users asked questions that can’t be answered by RAG (e.g. summarize the article, who wrote this). We created a small router that detects these questions and answers them using an API call + LLM instead of the full-blown RAG set-ups.

Our stack

Vector database: Azure -> Pinecone -> Turbopuffer (cheap, supports keyword search natively)
Document Extraction: Custom
Chunking: Unstructured.io by default, custom for enterprises (heard that Chonkie is good)
Embedding: text-embedding-large-3, haven’t tested others
Reranker: None -> Cohere 3.5 -> Zerank (less known but actually good)
LLM: GPT 4.1 -> GPT 5 -> GPT 4.1, covered by Azure credits

Going Open-source

We put all our learning into an open-source project: agentset-ai/agentset under an MIT license. Feel free to reach out if you have any questions.

Langchain + Llamaindex

Langchain + Llamaindex

What moved the needle

Our stack

Going Open-source

Similar Posts