A Guide to Production RAG pipeline
8 min read12 hours ago
–
You spent months building a Retrieval-Augmented Generation (RAG) pipeline. You carefully selected a vector database, integrated with a shinny LLM, split documents into fixed-size chunks, generated embeddings, stored them in vector store and retrieved the top-K nearest neighbors using cosine similarity.
On paper you followed every step, so the system should work. But in reality, it quickly breaks down. As query volume increases, you start seeing:
- **Hallucinations **caused by semantically similar but contextually incorrect chunks
- **Increased latency **due to large top-K retrievals and oversized prompts
- **Context window overflow **when multiple chunks are blindly concatenated
- **High embedding and inference cost…
A Guide to Production RAG pipeline
8 min read12 hours ago
–
You spent months building a Retrieval-Augmented Generation (RAG) pipeline. You carefully selected a vector database, integrated with a shinny LLM, split documents into fixed-size chunks, generated embeddings, stored them in vector store and retrieved the top-K nearest neighbors using cosine similarity.
On paper you followed every step, so the system should work. But in reality, it quickly breaks down. As query volume increases, you start seeing:
- **Hallucinations **caused by semantically similar but contextually incorrect chunks
- **Increased latency **due to large top-K retrievals and oversized prompts
- **Context window overflow **when multiple chunks are blindly concatenated
- **High embedding and inference costs **driven by inefficient retrieval strategies
Press enter or click to view image in full size
Image created using AI
What went wrong?
Most of the tutorials, courses and “how-to” articles tell you to run the same kind of copy-paste LangChain code to build RAG systems, which is good to build a demo project with sample documents, but breaks in real enterprise world. I’ve seen people wasting months solving the same problem swapping algorithms for Embedding, Chunking, and Retrieval; hoping the next tool will solve their data quality issues. But it rarely does.
The reason is “underlying failure is not due to the choice of algorithm”, it is due to the absence of engineering discipline around ingestion, retrieval, evaluation, and observability. Without these foundations, even the most advanced models behave unpredictable, expensive, and impossible to improve systematically.
Press enter or click to view image in full size
Image by Author
After weeks of debugging hallucinations, memory pressure, slow responses, and frequent timeouts, a realization sets in: Choosing the right technology stack can help you retrieve and generate answers but it does not automatically make your system efficient, scalable, or cost-effective.
At this point, it becomes clear that a naïve chunk → embed → top-K retrieve approach is a proof of concept, not a production-ready architecture.
Real World RAG System
In a real-world RAG system:
- **Chunking **is not a preprocessing step, it’s a retrieval optimization problem. Within chunks you need logically coherent units of information that preserve context. Fixed token windows fail to preserve semantic boundaries such as sections, clauses, tables, or logical hierarchies. Chunking strategy is not a one-time decision; it requires experimentation with size, overlap, and structure to optimize retrieval quality.
- General-purpose embeddings are often insufficient. You need domain-specific embedding strategies* that truly capture the semantic meaning of your data. Whether it’s insurance policies, financial transactions, or legal text; it requires specialized embedding strategies,* including fine-tuned models or hybrid lexical-semantic representations to capture precise meaning.
- Vector database selection should be aligned with your performance, scaling, and cost constraints, based on real workload patterns rather than marketing benchmarks. Approximate nearest-neighbor algorithms, index types, and storage tiers materially impact production behavior.
- Top-K similarity search is only the first stage. Production systems require hybrid search, reranking, and filtering pipelines* *to balance relevance, latency, and context-window constraints. Image generated using AI
RAG systems fail in production not because the tools are wrong, but because the retrieval architecture is underspecified.
Because building RAG is easy. Building RAG that works in production is not.
From Naive RAG to a Production-Grade RAG Pipeline
After trying multiple times the key realization is that RAG is fundamentally a retrieval systems problem, not an LLM problem. If retrieval fails, generation quality collapses no matter how good the model is.
So let’s break down step by step through a **production-grade RAG pipeline **where I’ll explain-
- Why each stage exists
- What breaks if you skip it
- How to implemented it
- How to evaluate and improve it
Press enter or click to view image in full size
Image generated using AI
1. Multi-Format Document Ingestion
The real enterprise data is messy. It does not live in a single format. It can be scattered across PDFs, Docx, CSV or ODS files, HTML or HTM files, JSON sources and many other file formats. That’s why a production RAG system should normalize all formats into clean, structured text while preserving metadata.
Your production pipeline implementation should have:
- each file type as a dedicated loader
- extracted text normalized into a common representation, and
- metadata such as file name, source type, and a stable document identifier attached
Pipeline should compute a content-based document ID using a hash of the file’s contents. This avoids brittle dependencies on filenames and enables robust evaluation even when documents are renamed or versioned.
**2. **Stable Document Identity
While creating production RAG system, filenames are not reliable identifiers. Files can be renamed, copied, versioned, and re-uploaded. And the RAG evaluation, citations, and debugging all can break if identity depends on filenames.
A RAG system needs a stable, content-based identity.
That’s why during ingestion, a SHA-1 hash of the file content should be computed where document_id would:
- uniquely identify the document’s content
- survive renaming
- enable robust citations and evaluation, and
- allow regression testing as documents evolve
3. Logical Structured Chunking
Chunking is the most underestimated part of a RAG pipeline and one of the most important. Chunking determines what retrieval can succeed. If chunks are poorly constructed, no retriever or re-ranker can compensate.
Naive chunking (Fixed Token Windows) splits text purely by token count. This often destroys semantic meaning as:
- clauses are separated from their conditions
- tables are broken into fragments
- headers lose their associated content
Instead, chunking must respect document structure. RAG pipeline should chunk along logical boundaries such as paragraphs, sections, table rows, and list items, while still enforcing token limits to stay within model constraints.
It should also introduce overlap between chunks to preserve continuity across boundaries. This dramatically improves retrieval quality because each chunk carries enough context to stand on its own.
4. Token-Aware Embedding
Embedding APIs impose strict limits on how many tokens can be processed per request. When embedding large document collections, it’s easy to exceed these limits causing runtime failures.
To make embedding reliable at scale, RAG pipeline should:
- estimate token counts per chunk batch embedding requests under a safe token budget
- truncate pathological chunks defensively
This ensures predictable costs, stable ingestion, and zero surprises when scaling to thousands of documents.
5. Persistent Vector Storage with Chroma
Vector stores are infrastructure, not caches. Re-embedding documents on every restart is expensive and unnecessary. And Persistence is essential for:
- reproducibility
- faster iteration
- stable evaluation
Using vector store implementation (I am using Chroma) with disk persistence where chunks, embeddings, and metadata are stored permanently allows the system to resume instantly after restarts.
6. Hybrid Retrieval (BM25 + Vector) Instead of Pure Vector Search
Vector search is powerful for semantic similarity, but it struggles with:
- exact terms and identifiers
- legal or regulatory language
- numeric constraints
- domain-specific phrasing
On other hand, lexical search (BM25) handles above issues very well but lacks semantic flexibility.
Get Charu Makhijani’s stories in your inbox
Join Medium for free to get updates from this writer.
To address this, use hybrid retrieval, combining lexical search (BM25) with dense vector search (k-NN). Lexical search captures exact matches and domain terms, while vector search handles semantic similarity.
BM25 query-
Vector k-NN query-
This approach significantly increases recall and ensures that critical facts are not missed simply because they were phrased differently.
7. Cross-Encoder Reranking for Precision
Even with hybrid retrieval, the top-K results often contain noise. Vector similarity and BM25 are recall-oriented techniques. They are designed to retrieve many candidates, not necessarily the best ones.
To improve precision, add a cross-encoder reranker. Unlike bi-encoder embeddings, a cross-encoder scores the query and candidate text together, producing a much more accurate relevance score.
I would prefer using a local, open-source cross-encoder model rather than an API-based re-ranker. This avoids additional latency, cost, and vendor lock-in which is an important consideration in regulated domains like insurance and finance.
8. Context Window Optimization
LLMs have limited context windows. Passing too much context increases cost, latency, and cognitive load for the model. Quality improves when only the most relevant chunks are passed. That’s why instead of passing everything:
- rank retrieved chunks by relevance
- include only the highest-utility chunks
- enforce a strict context token budget
This keeps prompts concise, focused, and cost-efficient while maximizing signal-to-noise ratio for the LLM.
9. Guardrails and Grounded Answer Generation
In general LLMs attempts to answer even when evidence is weak. That’s why no matter how strong retrieval is, LLMs can still overgeneralize. This becomes a problem in real world production systems.
That’s why it’s important to enforce guardrails such as:
- requiring answers to be grounded in retrieved context
- mandatory citations for every answer
- explicit refusal when evidence is insufficient
- hooks for PII masking and compliance checks
These guardrails shift the system from being merely “helpful” to being trustworthy and auditable.
10. Observability and Debuggability
Without observability, RAG systems are impossible to improve systematically.
You must instrument the pipeline with tracing for:
- retrieval latency
- reranking time
- token usage
- generation latency
This makes RAG behave like a proper distributed system, where performance bottlenecks and failures can be diagnosed rather than guessed.
11. Evaluate RAG Pipeline
The final and often missing piece is evaluation. Without metrics, you cannot tell whether a change improved or degraded the system. I am using RAGAs (Retrieval Augmented Generation Assessment) to evaluate my RAG system. RAGAs provides a principled way to evaluate RAG pipelines using metrics such as:
- faithfulness (does the answer stick to context?)
- answer relevancy (does it answer the question?)
- context precision (how much retrieved context is useful?)
- context recall (did we retrieve the necessary evidence?)
These metrics turn RAG tuning from guesswork into an engineering discipline.
Making Evaluation Robust Over Time
Evaluation datasets must survive document renames, versioning, and content changes. To achieve this:
- rely on stable document identifiers
- support clause-level and semantic matching
- allow multiple evaluation modes depending on data maturity
This ensures that evaluation remains meaningful as the system evolves.
Final Thoughts
A production-grade RAG pipeline is not about picking the “best” vector database or the “largest” LLM or any fancy tools. It is about engineering a retrieval system that consistently delivers the right context for real-world data.
A production-grade RAG pipeline requires careful ingestion, intelligent chunking, hybrid retrieval, reranking, appropriate guardrails, observability, and rigorous evaluation. When built correctly, RAG becomes:
- robust
- explainable
- auditable
- scalable
- measurable
And most importantly, it becomes something you can trust in production.
To access the complete source code from this article, please refer to the GitHub link.
Thank you for reading until the end. Before you go:
- 👏 C**lap **50 times to show your support!
- 💬 Leave a comment with your feedback or questions.
- ⭐ **Follow me on **Medium for more deep dives in Data and AI.
- 💻 **Follow me on **GitHub for hands-on code and real-world projects.
- 🔁 **Share this article **with someone who might benefit.
- 📬 Subscribe for free to get notified whenever I publish something new.
- 🔗 **Let’s connect on **LinkedIn, I’d love to continue the conversation.