How I Rebuilt a RAG System that Actually Works

A Guide to Production RAG pipeline

8 min read12 hours ago

–

You spent months building a Retrieval-Augmented Generation (RAG) pipeline. You carefully selected a vector database, integrated with a shinny LLM, split documents into fixed-size chunks, generated embeddings, stored them in vector store and retrieved the top-K nearest neighbors using cosine similarity.

On paper you followed every step, so the system should work. But in reality, it quickly breaks down. As query volume increases, you start seeing:

**Hallucinations **caused by semantically similar but contextually incorrect chunks
**Increased latency **due to large top-K retrievals and oversized prompts
**Context window overflow **when multiple chunks are blindly concatenated
**High embedding and inference cost…

A Guide to Production RAG pipeline

8 min read12 hours ago

–

On paper you followed every step, so the system should work. But in reality, it quickly breaks down. As query volume increases, you start seeing:

**Hallucinations **caused by semantically similar but contextually incorrect chunks
**Increased latency **due to large top-K retrievals and oversized prompts
**Context window overflow **when multiple chunks are blindly concatenated
**High embedding and inference costs **driven by inefficient retrieval strategies

Press enter or click to view image in full size

Image created using AI

What went wrong?

Most of the tutorials, courses and “how-to” articles tell you to run the same kind of copy-paste LangChain code to build RAG systems, which is good to build a demo project with sample documents, but breaks in real enterprise world. I’ve seen people wasting months solving the same problem swapping algorithms for Embedding, Chunking, and Retrieval; hoping the next tool will solve their data quality issues. But it rarely does.

The reason is “underlying failure is not due to the choice of algorithm”, it is due to the absence of engineering discipline around ingestion, retrieval, evaluation, and observability. Without these foundations, even the most advanced models behave unpredictable, expensive, and impossible to improve systematically.

Press enter or click to view image in full size

Image by Author

After weeks of debugging hallucinations, memory pressure, slow responses, and frequent timeouts, a realization sets in: Choosing the right technology stack can help you retrieve and generate answers but it does not automatically make your system efficient, scalable, or cost-effective.

At this point, it becomes clear that a naïve chunk → embed → top-K retrieve approach is a proof of concept, not a production-ready architecture.

Real World RAG System

In a real-world RAG system:

**Chunking **is not a preprocessing step, it’s a retrieval optimization problem. Within chunks you need logically coherent units of information that preserve context. Fixed token windows fail to preserve semantic boundaries such as sections, clauses, tables, or logical hierarchies. Chunking strategy is not a one-time decision; it requires experimentation with size, overlap, and structure to optimize retrieval quality.
General-purpose embeddings are often insufficient. You need domain-specific embedding strategies* that truly capture the semantic meaning of your data. Whether it’s insurance policies, financial transactions, or legal text; it requires specialized embedding strategies,* including fine-tuned models or hybrid lexical-semantic representations to capture precise meaning.
Vector database selection should be aligned with your performance, scaling, and cost constraints, based on real workload patterns rather than marketing benchmarks. Approximate nearest-neighbor algorithms, index types, and storage tiers materially impact production behavior.
Top-K similarity search is only the first stage. Production systems require hybrid search, reranking, and filtering pipelines* *to balance relevance, latency, and context-window constraints. Image generated using AI

RAG systems fail in production not because the tools are wrong, but because the retrieval architecture is underspecified.

Because building RAG is easy. Building RAG that works in production is not.

From Naive RAG to a Production-Grade RAG Pipeline

After trying multiple times the key realization is that RAG is fundamentally a retrieval systems problem, not an LLM problem. If retrieval fails, generation quality collapses no matter how good the model is.

So let’s break down step by step through a **production-grade RAG pipeline **where I’ll explain-

Why each stage exists
What breaks if you skip it
How to implemented it
How to evaluate and improve it

Press enter or click to view image in full size

Image generated using AI

1. Multi-Format Document Ingestion

The real enterprise data is messy. It does not live in a single format. It can be scattered across PDFs, Docx, CSV or ODS files, HTML or HTM files, JSON sources and many other file formats. That’s why a production RAG system should normalize all formats into clean, structured text while preserving metadata.

Your production pipeline implementation should have:

each file type as a dedicated loader
extracted text normalized into a common representation, and
metadata such as file name, source type, and a stable document identifier attached

Pipeline should compute a content-based document ID using a hash of the file’s contents. This avoids brittle dependencies on filenames and enables robust evaluation even when documents are renamed or versioned.

2. Stable Document Identity

While creating production RAG system, filenames are not reliable identifiers. Files can be renamed, copied, versioned, and re-uploaded. And the RAG evaluation, citations, and debugging all can break if identity depends on filenames.

A RAG system needs a stable, content-based identity.

That’s why during ingestion, a SHA-1 hash of the file content should be computed where document_id would:

uniquely identify the document’s content
survive renaming
enable robust citations and evaluation, and
allow regression testing as documents evolve

3. Logical Structured Chunking

Chunking is the most underestimated part of a RAG pipeline and one of the most important. Chunking determines what retrieval can succeed. If chunks are poorly constructed, no retriever or re-ranker can compensate.

Naive chunking (Fixed Token Windows) splits text purely by token count. This often destroys semantic meaning as:

clauses are separated from their conditions
tables are broken into fragments
headers lose their associated content

Instead, chunking must respect document structure. RAG pipeline should chunk along logical boundaries such as paragraphs, sections, table rows, and list items, while still enforcing token limits to stay within model constraints.

It should also introduce overlap between chunks to preserve continuity across boundaries. This dramatically improves retrieval quality because each chunk carries enough context to stand on its own.

4. Token-Aware Embedding

Embedding APIs impose strict limits on how many tokens can be processed per request. When embedding large document collections, it’s easy to exceed these limits causing runtime failures.

To make embedding reliable at scale, RAG pipeline should:

estimate token counts per chunk batch embedding requests under a safe token budget
truncate pathological chunks defensively

This ensures predictable costs, stable ingestion, and zero surprises when scaling to thousands of documents.

5. Persistent Vector Storage with Chroma

Vector stores are infrastructure, not caches. Re-embedding documents on every restart is expensive and unnecessary. And Persistence is essential for:

reproducibility
faster iteration
stable evaluation

Using vector store implementation (I am using Chroma) with disk persistence where chunks, embeddings, and metadata are stored permanently allows the system to resume instantly after restarts.

6. Hybrid Retrieval (BM25 + Vector) Instead of Pure Vector Search

Vector search is powerful for semantic similarity, but it struggles with:

exact terms and identifiers
legal or regulatory language
numeric constraints
domain-specific phrasing

On other hand, lexical search (BM25) handles above issues very well but lacks semantic flexibility.

Get Charu Makhijani’s stories in your inbox

Join Medium for free to get updates from this writer.

To address this, use hybrid retrieval, combining lexical search (BM25) with dense vector search (k-NN). Lexical search captures exact matches and domain terms, while vector search handles semantic similarity.

BM25 query-

Vector k-NN query-

This approach significantly increases recall and ensures that critical facts are not missed simply because they were phrased differently.

7. Cross-Encoder Reranking for Precision

Even with hybrid retrieval, the top-K results often contain noise. Vector similarity and BM25 are recall-oriented techniques. They are designed to retrieve many candidates, not necessarily the best ones.

To improve precision, add a cross-encoder reranker. Unlike bi-encoder embeddings, a cross-encoder scores the query and candidate text together, producing a much more accurate relevance score.

I would prefer using a local, open-source cross-encoder model rather than an API-based re-ranker. This avoids additional latency, cost, and vendor lock-in which is an important consideration in regulated domains like insurance and finance.

8. Context Window Optimization

LLMs have limited context windows. Passing too much context increases cost, latency, and cognitive load for the model. Quality improves when only the most relevant chunks are passed. That’s why instead of passing everything:

rank retrieved chunks by relevance
include only the highest-utility chunks
enforce a strict context token budget

This keeps prompts concise, focused, and cost-efficient while maximizing signal-to-noise ratio for the LLM.

9. Guardrails and Grounded Answer Generation

In general LLMs attempts to answer even when evidence is weak. That’s why no matter how strong retrieval is, LLMs can still overgeneralize. This becomes a problem in real world production systems.

That’s why it’s important to enforce guardrails such as:

requiring answers to be grounded in retrieved context
mandatory citations for every answer
explicit refusal when evidence is insufficient
hooks for PII masking and compliance checks

These guardrails shift the system from being merely “helpful” to being trustworthy and auditable.

10. Observability and Debuggability

Without observability, RAG systems are impossible to improve systematically.

You must instrument the pipeline with tracing for:

retrieval latency
reranking time
token usage
generation latency

This makes RAG behave like a proper distributed system, where performance bottlenecks and failures can be diagnosed rather than guessed.

11. Evaluate RAG Pipeline

The final and often missing piece is evaluation. Without metrics, you cannot tell whether a change improved or degraded the system. I am using RAGAs (Retrieval Augmented Generation Assessment) to evaluate my RAG system. RAGAs provides a principled way to evaluate RAG pipelines using metrics such as:

faithfulness (does the answer stick to context?)
answer relevancy (does it answer the question?)
context precision (how much retrieved context is useful?)
context recall (did we retrieve the necessary evidence?)

These metrics turn RAG tuning from guesswork into an engineering discipline.

Making Evaluation Robust Over Time

Evaluation datasets must survive document renames, versioning, and content changes. To achieve this:

rely on stable document identifiers
support clause-level and semantic matching
allow multiple evaluation modes depending on data maturity

This ensures that evaluation remains meaningful as the system evolves.

Final Thoughts

A production-grade RAG pipeline is not about picking the “best” vector database or the “largest” LLM or any fancy tools. It is about engineering a retrieval system that consistently delivers the right context for real-world data.

A production-grade RAG pipeline requires careful ingestion, intelligent chunking, hybrid retrieval, reranking, appropriate guardrails, observability, and rigorous evaluation. When built correctly, RAG becomes:

robust
explainable
auditable
scalable
measurable

And most importantly, it becomes something you can trust in production.

To access the complete source code from this article, please refer to the GitHub link.

Thank you for reading until the end. Before you go:

👏 C**lap **50 times to show your support!
💬 Leave a comment with your feedback or questions.
⭐ **Follow me on **Medium for more deep dives in Data and AI.
💻 **Follow me on **GitHub for hands-on code and real-world projects.
🔁 **Share this article **with someone who might benefit.
📬 Subscribe for free to get notified whenever I publish something new.
🔗 **Let’s connect on **LinkedIn, I’d love to continue the conversation.

A Guide to Production RAG pipeline

A Guide to Production RAG pipeline

What went wrong?

Real World RAG System

From Naive RAG to a Production-Grade RAG Pipeline

1. Multi-Format Document Ingestion

**2. **Stable Document Identity

3. Logical Structured Chunking

4. Token-Aware Embedding

5. Persistent Vector Storage with Chroma

6. Hybrid Retrieval (BM25 + Vector) Instead of Pure Vector Search

Get Charu Makhijani’s stories in your inbox

7. Cross-Encoder Reranking for Precision

8. Context Window Optimization

9. Guardrails and Grounded Answer Generation

10. Observability and Debuggability

11. Evaluate RAG Pipeline

Final Thoughts

Similar Posts

2. Stable Document Identity