How to Build an Over-Engineered Retrieval System

you’ll stumble upon when doing AI engineering work is that there’s no real blueprint to follow.

Yes, for the most basic parts of retrieval (the “R” in RAG), you can chunk documents, use semantic search on a query, re-rank the results, and so on. This part is well known.

But once you start digging into this area, you begin to ask questions like: how can we call a system intelligent if it’s only able to read a few chunks here and there in a document? So, how do we make sure it has enough information to actually answer intelligently?

Soon, you’ll find yourself going down a rabbit hole, trying to discern what others are doing in their own orgs, because none of this is properly documented, and people are still building their own setups.

This will lead you to implement various opti…

you’ll stumble upon when doing AI engineering work is that there’s no real blueprint to follow.

Yes, for the most basic parts of retrieval (the “R” in RAG), you can chunk documents, use semantic search on a query, re-rank the results, and so on. This part is well known.

This will lead you to implement various optimization strategies: building custom chunkers, rewriting user queries, using different search methods, filtering with metadata, and expanding context to include neighboring chunks.

Hence why I’ve now built a rather bloated retrieval system to show you how it works. So, let’s walk through it so we can see the results of each step, but also to discuss the trade-offs.

To demo this system in public, I decided to embed 150 recent ArXiv papers (2,250 pages) that mention RAG. This means the system we’re testing here is designed for scientific papers, and all the test queries will be RAG-related.

I have collected the raw outputs for each step for a few queries in this repository, if you want to look at the whole thing in detail.

For the tech stack, I’m using Qdrant and Redis to store data, and Cohere and OpenAI for the LLMs. I do not rely on any framework to build the pipelines (as it makes it harder to debug).

As always, I do a quick review of what we’re doing for beginners, so if RAG is already familiar to you, feel free to skip the first section.

Recap retrieval & RAG

When you work with AI knowledge systems like Copilot (where you feed it your custom docs to answer from) you work with a RAG system.

RAG stands for Retrieval Augmented Generation and is separated into two parts, the retrieval part and the generation part.

Retrieval refers to the process of fetching information in your files, using keyword and semantic matching, based on a user query. The generation part is where** the LLM comes in and answers** based on the provided context and the user query.

For anyone new to RAG it may seem like a chunky way to build systems. Shouldn’t an LLM do most of the work on its own?

Unfortunately, LLMs are static, and we need to engineer systems so that each time we call on them, we give them everything they need upfront so they can answer the question.

I have written about building RAG bots for Slack before. This one uses standard chunking methods, if you’re keen to get a sense of how people build something simple.

This article goes a step further and tries to rebuild the entire retrieval pipeline without any frameworks, to do some fancy stuff like build a multi-query optimizer, fuse results, and expand the chunks to build better context for the LLM.

As we’ll see though, all of** these fancy additions we’ll have to pay for in latency and additional work.**

Processing different documents

As with any data engineering problem, your first hurdle will be to architect how to store data. With retrieval, we focus on something called chunking, and how you do it and what you store with it is essential to building a well-engineered system.

When we do retrieval, we search text, and to do that we need to separate the text into different chunks of information. These pieces of text are what we’ll later search to find a match for a query.

Most simple systems use general chunkers, simply splitting the full text by length, paragraph, or sentence.

But every document is different, so by doing this you risk losing context.

To understand this, you should look at different documents to see how they all follow different structures. You’ll have an HR document with clear section headers, and API docs with unnumbered sections using code blocks and tables.

If you applied the same chunking logic to all of these, you’d risk splitting each text the wrong way. This means that once the LLM gets the chunks of information, it will be incomplete, which may cause it to fail at producing an accurate answer.

Furthermore, for each chunk of information, you also need to think about the data you want it to hold.

Should it contain certain metadata so the system can apply filters? Should it link to similar information so it can connect data? Should it hold context so the LLM understands where the information comes from?

This means the architecture of how you store data becomes the most important part. If you start storing information and later realize it’s not enough, you’ll have to redo it. If you realize you’ve complicated the system, you’ll have to start from scratch.

This system will ingest Excel and PDFs, focusing on adding context, keys, and neighbors. This will allow you to see what this looks like when doing retrieval later.

For this demo, I have stored data in Redis and Qdrant. We use Qdrant to do semantic, BM25, and hybrid search, and to expand content we fetch data from Redis.

Ingesting tabular files

First we’ll go through how you can chunk tabular data, add context, and keep information connected with keys.

When dealing with already structured tabular data, like in Excel files, it might seem like the obvious approach is to let the system search it directly. But semantic matching is actually quite effective for messy user queries.

SQL or direct queries only work if you already know the schema and exact fields. For instance, if you get a query like “Mazda 2023 specs” from a user, semantically matching rows will give us something to go on.

I’ve talked to companies that wanted their system to match documents across different Excel files. To do this, we can store keys along with the chunks (without going full KG).

So for instance, if we’re working with Excel files containing purchase data, we could ingest data for each row like so:

{
"chunk_id": "Sales_Q1_123::row::1",
"doc_id": "Sales_Q1_123:1234"
"location": {"sheet_name": "Sales Q1", "row_n": 1},
"type": "chunk",
"text": "OrderID: 1001234f67 \n Customer: Alice Hemsworth \n Products: Blue sweater 4, Red pants 6",
"context": "Quarterly sales snapshot",
"keys": {"OrderID": "1001234f67"},
}

If we decide later in the retrieval pipeline to connect information, we can do standard search using the keys to find connecting chunks. This allows us to make quick hops between documents without adding another router step to the pipeline.

Very simplified — connecting keys between tabular documents | Image by author

We can also set a summary for each document. This acts as a gatekeeper to chunks.

{
"chunk_id": "Sales_Q1::summary",
"doc_id": "Sales_Q1_123:1234"
"location": {"sheet_name": "Sales Q1"},
"type": "summary",
"text": "Sheet tracks Q1 orders for 2025, type of product, and customer names for reconciliation.",
"context": ""
}

The gatekeeper summary idea might be a bit complicated to understand at first, but it also helps to have the summary stored at the document level if you need it when building the context later.

When the LLM sets up this summary (and a brief context string), it can suggest the key columns (i.e. order IDs and so on).

As a note, always set the key columns manually if you can, if that’s not possible, set up some validation logic to make sure the keys aren’t just random (it can happen that an LLM will choose weird columns to store while ignoring the most vital ones).

For this system with the ArXiv papers, I’ve ingested two Excel files that contain information on title and author level.

The chunks will look something like this:

{
"chunk_id": "titles::row::8817::250930134607",
"doc_id": "titles::250930134607",
"location": {
"sheet_name": "titles",
"row_n": 8817
},
"type": "chunk",
"text": "id: 2507 2114\ntitle: Gender Similarities Dominate Mathematical Cognition at the Neural Level: A Japanese fMRI Study Using Advanced Wavelet Analysis and Generative AI\nkeywords: FMRI; Functional Magnetic Resonance Imaging; Gender Differences; Machine Learning; Mathematical Performance; Time Frequency Analysis; Wavelet\nabstract_url: https://arxiv.org/abs/2507.21140\ncreated: 2025-07-23 00:00:00 UTC\nauthor_1: Tatsuru Kikuchi",
"context": "Analyzing trends in AI and computational research articles.",
"keys": {
"id": "2507 2114",
"author_1": "Tatsuru Kikuchi"
}
}

These Excel files were strictly not necessary (the PDF files would have been enough), but they’re a way to demo how the system can look up keys to find connecting information.

I created summaries for these files too.

{
"chunk_id": "titles::summary::250930134607",
"doc_id": "titles::250930134607",
"location": {
"sheet_name": "titles"
},
"type": "summary",
"text": "The dataset consists of articles with various attributes including ID, title, keywords, authors, and publication date. It contains a total of 2508 rows with a rich variety of topics predominantly around AI, machine learning, and advanced computational methods. Authors often contribute in teams, indicated by multiple author columns. The dataset serves academic and research purposes, enabling catego",
}

We also store information in Redis at document level, which tells us what it’s about, where to find it, who is allowed to see it, and when it was last updated. This will allow us to update stale information later.

Now let’s turn to PDF files, which are the worst monster you’ll deal with.

Ingesting PDF docs

To process PDF files, we do similar things as with tabular data, but chunking them is much harder, and we store neighbors instead of keys.

To start processing PDFs, we have several frameworks to work with, such as LlamaParse and Docling, but none of them are perfect, so we have to build out the system further.

PDF documents are very hard to process, as most don’t follow the same structure. They also often contain figures and tables that most systems can’t handle correctly.

Nevertheless, a tool like Docling can help us at least parse normal tables properly and map out each element to the correct page and element number.

From here, we can create our own programmatic logic by mapping sections and subsections for each element, and smart-merging snippets so chunks read naturally (i.e. don’t split mid-sentence).

We also make sure to group chunks by section, keeping them together by linking their IDs in a field called neighbors.

This allows us to keep the chunks small but still expand them after retrieval.

The end result will be something like below:

{
"chunk_id": "S3::C02::251009105423",
"doc_id": "2507.18910v1",
"location": {
"page_start": 2,
"page_end": 2
},
"type": "chunk",
"text": "1 Introduction\n\n1.1 Background and Motivation\n\nLarge-scale pre-trained language models have demonstrated an ability to store vast amounts of factual knowledge in their parameters, but they struggle with accessing up-to-date information and providing verifiable sources. This limitation has motivated techniques that augment generative models with information retrieval. Retrieval-Augmented Generation (RAG) emerged as a solution to this problem, combining a neural retriever with a sequence-to-sequence generator to ground outputs in external documents [52]. The seminal work of [52] introduced RAG for knowledge-intensive tasks, showing that a generative model (built on a BART encoder-decoder) could retrieve relevant Wikipedia passages and incorporate them into its responses, thereby achieving state-of-the-art performance on open-domain question answering. RAG is built upon prior efforts in which retrieval was used to enhance question answering and language modeling [48, 26, 45]. Unlike earlier extractive approaches, RAG produces free-form answers while still leveraging non-parametric memory, offering the best of both worlds: improved factual accuracy and the ability to cite sources. This capability is especially important to mitigate hallucinations (i.e., believable but incorrect outputs) and to allow knowledge updates without retraining the model [52, 33].",
"context": "Systematic review of RAG's development and applications in NLP, addressing challenges and advancements.",
"section_neighbours": {
"before": [
"S3::C01::251009105423"
],
"after": [
"S3::C03::251009105423",
"S3::C04::251009105423",
"S3::C05::251009105423",
"S3::C06::251009105423",
"S3::C07::251009105423"
]
},
"keys": {}
}

When we set up data like this, we can consider these chunks as seeds. We are searching for where there may be relevant information based on the user query, and expanding from there.

The difference from simpler RAG systems is that we try to take advantage of the LLM’s growing context window to send in more information (but there are obviously trade offs to this).

You’ll be able to see a messy solution of what this looks like when building the context in the retrieval pipeline later.

Building the retrieval pipeline

Since I’ve built this pipeline piece by piece, it allows us to test each part and go through why we make certain choices in how we retrieve and transform information before handing it over to the LLM.

We’ll go through semantic, hybrid, and BM25 search, building a multi-query optimizer, re-ranking results, expanding content to build the context, and then handing the results to an LLM to answer.

We’ll end the section with some discussion on latency, unnecessary complexity, and what to cut to make the system faster.

If you want to look at the output of several runs of this pipeline, go to this repository.

Semantic, BM25 and hybrid search

The first part of this pipeline is to make sure we are getting back relevant documents for a user query. To do this, we work with semantic, BM25, and hybrid search.

For simple retrieval systems, people will usually just use semantic search. To perform semantic search, we embed dense vectors for each chunk of text using an embedding model.

If this is new to you, note that embeddings represent each piece of text as a point in a high-dimensional space. The position of each point reflects how the model understands its meaning, based on patterns it learned during training.

Texts with similar meanings will then end up close together.

This means that if the model has seen many examples of similar language, it becomes better at placing related texts near each other, and therefore better at matching a query with the most relevant content.

*I have written about this before, using clustering on various embeddings models to see how they performed for a use case, if you’re keen to learn more. *

To create dense vectors, I used OpenAI’s Large embedding model, since I’m working with scientific papers.

This model is more expensive than their small one and perhaps not ideal for this use case.

I would look into specialized models for specific domains or consider fine-tuning your own. Because remember if the embedding model hasn’t seen many examples similar to the texts you’re embedding, it will be harder to match them to relevant documents.

To support hybrid and BM25 search, we also build a lexical index (sparse vectors). BM25 works on exact tokens (for example, “ID 826384”) instead of returning “similar-meaning” text the way semantic search does.

To test semantic search, we’ll set up a query that I think the papers we’ve ingested can answer, such as: “Why do LLMs get worse with longer context windows and what to do about it?”

[1] score=0.5071 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts serve as hard negatives. Conventional RAG, i.e. , simply appending * Corresponding author 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the number of questions correctly answered without retrieval in a closed-book setting. Blue and yellow bars show performance when provided with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 1 8k 25.1% 49.1% 39.6% 47.5% 6k 4k 1 2k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and correct parametric knowledge (Ren et al., 2025). This misalignment leads to overriding correct internal representations, resulting in substantial performance degradation on questions that the model initially answered correctly. As shown in Figure 1, we observed significant performance drops of 25.149.1% across state-of-the-
[2] score=0.5022 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is simple and effective, its underlying influence on LLM remains unclear. Furthermore, long contexts containing noise documents create computational overhead. Therefore, it is important to design more principled strategies that can achieve similar benefits without incurring excessive cost.
[3] score=0.4982 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
text: 4 Experiments 4.3 Analysis Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating four decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, both GD(0) and DoLA generate incorrect answers (e.g., '18 minutes'), suggesting limited capacity to integrate contextual evidence. Similarly, while CS produces a partially relevant response ('Texas Revolution'), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.
[4] score=0.4857 doc=docs_ingestor/docs/arxiv/2507.23588.pdf chunk=S6::C03::251009122456
text: 4 Results Figure 4: Change in attention pattern distribution in different models. For DiffLoRA variants we plot attention mass for main component (green) and denoiser component (yellow). Note that attention mass is normalized by the number of tokens in each part of the sequence. The negative attention is shown after it is scaled by λ . DiffLoRA corresponds to the variant with learnable λ and LoRa parameters in both terms. BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY 0 0.2 0.4 0.6 BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY BOS CONTEXT 1 MAGIC NUMBER CONTEXT 2 QUERY Llama-3.2-1B LoRA DLoRA-32 DLoRA, Tulu-3 perform similarly as the initial model, however they are outperformed by LoRA. When increasing the context length with more sample demonstrations, DiffLoRA seems to struggle even more in TREC-fine and Banking77. This might be due to the nature of instruction tuned data, and the max_sequence_length = 4096 applied during finetuning. LoRA is less impacted, likely because it diverges less
[5] score=0.4838 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C03::251009131027
text: 1 Introduction To mitigate context-memory conflict, existing studies such as adaptive retrieval (Ren et al., 2025; Baek et al., 2025) and the decoding strategies (Zhao et al., 2024; Han et al., 2025) adjust the influence of external context either before or during answer generation. However, due to the LLM's limited capacity in detecting conflicts, it is susceptible to misleading contextual inputs that contradict the LLM's parametric knowledge. Recently, robust training has equipped LLMs, enabling them to identify conflicts (Asai et al., 2024; Wang et al., 2024). As shown in Figure 2(a), it enables the LLM to dis-
[6] score=0.4827 doc=docs_ingestor/docs/arxiv/2508.05266.pdf chunk=S27::C03::251009123532
text: B. Subclassification Criteria for Misinterpretation of Design Specifications Initially, regarding long-context scenarios, we observed that directly prompting LLMs to generate RTL code based on lengthy contexts often resulted in certain code segments failing to accurately reflect high-level requirements. However, by manually decomposing the long context-retaining only the key descriptive text relevant to the erroneous segments while omitting unnecessary details-the LLM regenerated RTL code that correctly matched the specifications. As shown in Fig 23, after manual decomposition of the long context, the LLM successfully generated the correct code. This demonstrates that redundancy in long contexts is a limiting factor in LLMs' ability to generate accurate RTL code.
[7] score=0.4798 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C02::251009132038
text: 1 Introductions Figure 1: Illustration for layer-wise behavior in LLMs for RAG. Given a query and retrieved documents with the correct answer ('Real Madrid'), shallow layers capture local context, middle layers focus on answer-relevant content, while deep layers may over-rely on internal knowledge and hallucinate (e.g., 'Barcelona'). Our proposal, LFD fuses middle-layer signals into the final output to preserve external knowledge and improve accuracy. Shallow Layers Middle Layers Deep Layers Who has more la liga titles real madrid or barcelona? …Nine teams have been crowned champions, with Real Madrid winning the title a record 33 times and Barcelona 25 times … Query Retrieved Document …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Short-context Modeling Focus on Right Answer Answer is barcelona Wrong Answer LLMs …with Real Madrid winning the title a record 33 times and Barcelona 25 times … …with Real Madrid winning the title a record 33 times and Barcelona 25 times … Internal Knowledge Confou

From the results above, we can see that it’s able to match some interesting passages where they discuss topics that can answer the query.

If we try BM25 (which matches exact tokens) with the same query, we get back these results:

[1] score=22.0764 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets within the same project are valuable for code completion, even if they are not entirely replicable. In this step, we also retrieve similar code snippets. Following RepoCoder, we no longer use the unfinished code as the query but instead use the code draft, because the code draft is closer to the ground truth compared to the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a list sorted by scores. Due to the potentially large differences in length between code snippets, we no longer use the top-k method. Instead, we get code snippets from the highest to the lowest scores until the preset context length is filled.
[2] score=17.4931 doc=docs_ingestor/docs/arxiv/2508.09105.pdf chunk=S20::C08::251009124222
text: C. Ablation Studies Ablation result across White-Box attribution: Table V shows the comparison result in methods of WhiteBox Attribution with Noise, White-Box Attrition with Alternative Model and our current method Black-Box zero-gradient Attribution with Noise under two LLM categories. We can know that: First, The White-Box Attribution with Noise is under the desired condition, thus the average Accuracy Score of two LLMs get the 0.8612 and 0.8073. Second, the the alternative models (the two models are exchanged for attribution) reach the 0.7058 and 0.6464. Finally, our current method Black-Box Attribution with Noise get the Accuracy of 0.7008 and 0.6657 by two LLMs.
[3] score=17.1458 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S4::C03::251009123245
text: Preliminaries Based on this, inspired by existing analyses (Zhang et al. 2024c), we measure the amount of information a position receives using discrete entropy, as shown in the following equation: which quantifies how much information t i receives from the attention perspective. This insight suggests that LLMs struggle with longer sequences when not trained on them, likely due to the discrepancy in information received by tokens in longer contexts. Based on the previous analysis, the optimization of attention entropy should focus on two aspects: The information entropy at positions that are relatively important and likely contain key information should increase.

Here, the results are lackluster for this query — but sometimes queries include specific keywords we need to match, where BM25 is the better choice.

We can test this by changing the query to* “papers from Anirban Saha Anik” *using BM25.

[1] score=62.3398 doc=authors.csv chunk=authors::row::1::251009110024
text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307
[2] score=56.4007 doc=titles.csv chunk=titles::row::24::251009110138
text: id: 2509.01058 title: Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL keywords: Controlled-Literacy; Health Misinformation; Public Health; RAG; RL; Reinforcement Learning; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2509.01058 created: 2025-09-10 00:00:00 UTC author_1: Xiaoying Song author_2: Anirban Saha Anik author_3: Dibakar Barua author_4: Pengcheng Luo author_5: Junhua Ding author_6: Lingzi Hong
[3] score=56.2614 doc=titles.csv chunk=titles::row::106::251009110138
text: id: 2507.07307 title: Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation keywords: Evidence Enhancement; Health Misinformation; LLMs; Large Language Models; RAG; Response Refinement; Retrieval Augmented Generation abstract_url: https://arxiv.org/abs/2507.07307 created: 2025-07-27 00:00:00 UTC author_1: Anirban Saha Anik author_2: Xiaoying Song author_3: Elliott Wang author_4: Bryan Wang author_5: Bengisu Yarimbas author_6: Lingzi Hong

All the results above mention “Anirban Saha Anik,” which is exactly what we’re looking for.

If we ran this with semantic search, it would return not just the name “Anirban Saha Anik” but similar names as well.

[1] score=0.5810 doc=authors.csv chunk=authors::row::1::251009110024
text: author_name: Anirban Saha Anik n_papers: 2 article_1: 2509.01058 article_2: 2507.07307
[2] score=0.4499 doc=authors.csv chunk=authors::row::55::251009110024
text: author_name: Anand A. Rajasekar n_papers: 1 article_1: 2508.0199
[3] score=0.4320 doc=authors.csv chunk=authors::row::59::251009110024
text: author_name: Anoop Mayampurath n_papers: 1 article_1: 2508.14817
[4] score=0.4306 doc=authors.csv chunk=authors::row::69::251009110024
text: author_name: Avishek Anand n_papers: 1 article_1: 2508.15437
[5] score=0.4215 doc=authors.csv chunk=authors::row::182::251009110024
text: author_name: Ganesh Ananthanarayanan n_papers: 1 article_1: 2509.14608

This is a good example of how semantic search isn’t always the ideal method — similar names don’t necessarily mean they’re relevant to the query.

So, there are cases where semantic search is ideal, and others where BM25 (token matching) is the better choice.

We can also use hybrid search, which combines semantic and BM25.

You’ll see the results below from running hybrid search on the original query: “why do LLMs get worse with longer context windows and what to do about it?”

[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2508.15253.pdf chunk=S3::C02::251009131027
text: 1 Introduction This challenge is exacerbated when incorrect yet highly ranked contexts serve as hard negatives. Conventional RAG, i.e. , simply appending * Corresponding author 1 https://github.com/eunseongc/CARE Figure 1: LLMs struggle to resolve context-memory conflict. Green bars show the number of questions correctly answered without retrieval in a closed-book setting. Blue and yellow bars show performance when provided with a positive or negative context, respectively. Closed-book w/ Positive Context W/ Negative Context 1 8k 25.1% 49.1% 39.6% 47.5% 6k 4k 1 2k 4 Mistral-7b LLaMA3-8b GPT-4o-mini Claude-3.5 retrieved context to the prompt, struggles to discriminate between incorrect external context and correct parametric knowledge (Ren et al., 2025). This misalignment leads to overriding correct internal representations, resulting in substantial performance degradation on questions that the model initially answered correctly. As shown in Figure 1, we observed significant performance drops of 25.149.1% across state-of-the-
[2] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.20888.pdf chunk=S4::C27::251009115003
text: 3 APPROACH 3.2.2 Project Knowledge Retrieval Similar Code Retrieval. Similar snippets within the same project are valuable for code completion, even if they are not entirely replicable. In this step, we also retrieve similar code snippets. Following RepoCoder, we no longer use the unfinished code as the query but instead use the code draft, because the code draft is closer to the ground truth compared to the unfinished code. We use the Jaccard index to calculate the similarity between the code draft and the candidate code snippets. Then, we obtain a list sorted by scores. Due to the potentially large differences in length between code snippets, we no longer use the top-k method. Instead, we get code snippets from the highest to the lowest scores until the preset context length is filled.
[3] score=0.4133 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S3::C03::251009132038
text: 1 Introductions Despite these advances, LLMs might underutilize accurate external contexts, disproportionately favoring internal parametric knowledge during generation [50, 40]. This overreliance risks propagating outdated information or hallucinations, undermining the trustworthiness of RAG systems. Surprisingly, recent studies reveal a paradoxical phenomenon: injecting noise-random documents or tokens-to retrieved contexts that already contain answer-relevant snippets can improve the generation accuracy [10, 49]. While this noise-injection approach is simple and effective, its underlying influence on LLM remains unclear. Furthermore, long contexts containing noise documents create computational overhead. Therefore, it is important to design more principled strategies that can achieve similar benefits without incurring excessive cost.
[4] score=0.1813 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S6::C18::251009132038
text: 4 Experiments 4.3 Analysis Experiments Qualitative Study In Table 4, we analyze a case study from the NQ dataset using the Llama2-7B model, evaluating four decoding strategies: GD(0), CS, DoLA, and LFD. Despite access to groundtruth documents, both GD(0) and DoLA generate incorrect answers (e.g., '18 minutes'), suggesting limited capacity to integrate contextual evidence. Similarly, while CS produces a partially relevant response ('Texas Revolution'), it exhibits reduced factual consistency with the source material. In contrast, LFD demonstrates superior utilization of retrieved context, synthesizing a precise and factually aligned answer. Additional case studies and analyses are provided in Appendix F.

I found semantic search worked best for this query, which is why it can be useful to run multi-queries with different search methods to fetch the first chunks (though this also adds complexity).

So, let’s turn to building something that can transform the original query into several optimized versions and fuse the results.

Multi-query optimizer

For this part we look at how we can optimize messy user queries by generating multiple targeted variations and selecting the right search method for each. It can improve recall but it introduces trade-offs.

All the agent abstraction systems you see usually transform the user query when performing search. For example, when you use the QueryTool in LlamaIndex, it uses an LLM to optimize the incoming query.

We can rebuild this part ourselves, but instead we give it the ability to create multiple queries, while also setting the search method. When you’re working with more documents, you could also have it set filters at this stage.

As for creating a lot of queries, I would try to keep it simple, as issues here will cause low-quality outputs in retrieval. The more unrelated queries the system generates, the more noise it introduces into the pipeline.

The function I’ve created here will generate 1–3 academic-style queries, along with the search method to be used, based on a messy user query.

Original query:
why is everyone saying RAG doesn't scale? how are people fixing that?

Generated queries:
- hybrid: RAG scalability issues
- hybrid: solutions to RAG scaling challenges

We will get back results like these:

Query 1 (hybrid) top 20 for query: RAG scalability issues

[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.
[2] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247
text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to enhance the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues associated with naive RAG implementations by incorporating techniques such as knowledge graphs, a hybrid retrieval approach, and document summarization to reduce training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for companies seeking robust question-answering systems.

[...]

Query 2 (hybrid) top 20 for query: solutions to RAG scaling challenges

[1] score=0.5000 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.
[2] score=0.5000 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301
text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a robust and scalable solution for RAG systems dealing with long-context scenarios. Our main contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across varying context lengths, and allocates attention to important segments. It addresses the critical challenge of context expansion in RAG.

[...]

We can also test the system with specific keywords like names and IDs to make sure it chooses BM25 rather than semantic search.

Original query:
any papers from Chenxin Diao?

Generated queries:
- BM25: Chenxin Diao

This will pull up results where Chenxin Diao is clearly mentioned.

*I should note, BM25 may cause issues when users misspell names, such as asking for “Chenx Dia” instead of “Chenxin Diao.” So in reality you may just want to slap hybrid search on all of them (and later let the re-ranker take care of weeding out irrelevant results). *

If you want to do this even better, you can build a retrieval system that generates a few example queries based on the input, so when the original query comes in, you fetch examples to help guide the optimizer.

This helps because smaller models aren’t great at transforming messy human queries into ones with more precise academic phrasing.

To give you an example, when a user is asking why the LLM is lying, the optimizer may transform the query to something like “causes of inaccuracies in large language models” rather than directly look for “hallicunations.”

After we fetch results in parallel, we fuse them. The result will look something like this:

RRF Fusion top 38 for query: why is everyone saying RAG doesn't scale? how are people fixing that?

[1] score=0.0328 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.
[2] score=0.0313 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C42::251104142800
text: 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency advantages [10].
[3] score=0.0161 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=SDOC::SUM::251104135247
text: This paper proposes the KeyKnowledgeRAG (K2RAG) framework to enhance the efficiency and accuracy of Retrieval-Augment-Generate (RAG) systems. It addresses the high computational costs and scalability issues associated with naive RAG implementations by incorporating techniques such as knowledge graphs, a hybrid retrieval approach, and document summarization to reduce training times and improve answer accuracy. Evaluations show that K2RAG significantly outperforms traditional implementations, achieving greater answer similarity and faster execution times, thereby providing a scalable solution for companies seeking robust question-answering systems.
[4] score=0.0161 doc=docs_ingestor/docs/arxiv/2508.05100.pdf chunk=S3::C06::251104155301
text: Introduction Empirical analyses across multiple real-world benchmarks reveal that BEE-RAG fundamentally alters the entropy scaling laws governing conventional RAG systems, which provides a robust and scalable solution for RAG systems dealing with long-context scenarios. Our main contributions are summarized as follows: We introduce the concept of balanced context entropy, a novel attention reformulation that ensures entropy invariance across varying context lengths, and allocates attention to important segments. It addresses the critical challenge of context expansion in RAG.

[...]

We see that there are some good matches, but also a few irrelevant ones that we’ll need to filter out further.

As a note before we move on, this is probably the step you’ll cut or optimize once you’re trying to reduce latency.

I find LLMs aren’t great at creating key queries that actually pull up useful information all that well, so if it’s not done right, it just adds more noise.

Adding a re-ranker

We do get results back from the retrieval system, and some of these are good while others are irrelevant, so most retrieval systems will use a re-ranker of some sort.

A re-ranker takes in several chunks and gives each one a relevancy score based on the original user query. You have several choices here, including using something smaller, but I’ll use Cohere’s re-ranker.

We can test this re-ranker on the first question we used in the previous section: “Why is everyone saying RAG doesn’t scale? How are people fixing that?”

[... optimizer... retrieval... fuse...]

Rerank summary:
- strategy=cohere
- model=rerank-english-v3.0
- candidates=32
- eligible_above_threshold=4
- kept=4 (reranker_threshold=0.35)

Reranked Relevant (4/32 kept ≥ 0.35) top 4 for query: why is everyone saying RAG doesn't scale? how are people fixing that?

[1] score=0.7920 doc=docs_ingestor/docs/arxiv/2507.07695.pdf chunk=S4::C08::251104135247
text: 1 Introduction Scalability: Naive implementations of Retrieval-Augmented Generation (RAG) often rely on 16-bit floating-point large language models (LLMs) for the generation component. However, this approach introduces significant scalability challenges due to the increased memory demands required to host the LLM as well as longer inference times due to using a higher precision number type. To enable more efficient scaling, it is crucial to integrate methods or techniques that reduce the memory footprint and inference times of generator models. Quantized models offer more scalable solutions due to less computational requirements, hence when developing RAG systems we should aim to use quantized LLMs for more cost effective deployment as compared to a full fine-tuned LLM whose performance might be good but is more expensive to deploy due to higher memory requirements. A quantized LLM's role in the RAG pipeline itself should be minimal and for means of rewriting retrieved information into a presentable fashion for the end users
[2] score=0.4749 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C42::251104142800
text: 7 Challenges of RAG 7.5.5 Scalability Scalability challenges arise as knowledge corpora expand. Advanced indexing, distributed retrieval, and approximate nearest neighbor techniques facilitate efficient handling of large-scale knowledge bases [57]. Selective indexing and corpus curation, combined with infrastructure improvements like caching and parallel retrieval, allow RAG systems to scale to massive knowledge repositories. Research indicates that moderate-sized models augmented with large external corpora can outperform significantly larger standalone models, suggesting parameter efficiency advantages [10].
[3] score=0.4304 doc=docs_ingestor/docs/arxiv/2507.18910.pdf chunk=S22::C05::251104142800
text: 7 Challenges of RAG 7.2.1 Scalability and Infrastructure Deploying RAG at scale requires substantial engineering to maintain large knowledge corpora and efficient retrieval indices. Systems must handle millions or billions of documents, demanding significant computational resources, efficient indexing, distributed computing infrastructure, and cost management strategies [21]. Efficient indexing methods, caching, and multi-tier retrieval approaches (such as cascaded retrieval) become essential at scale, especially in large deployments like web search engines.
[4] score=0.3556 doc=docs_ingestor/docs/arxiv/2509.13772.pdf chunk=S11::C02::251104182521
text: 7. Discussion and Limitations Scalability of RAGOrigin: We extend our evaluation by scaling the NQ dataset's knowledge database to 16.7 million texts, combining entries from the knowledge database of NQ, HotpotQA, and MS-MARCO. Using the same user questions from NQ, we assess RAGOrigin's performance under larger data volumes. As shown in Table 16, RAGOrigin maintains consistent effectiveness and performance even on this significantly expanded database. These results demonstrate that RAGOrigin remains robust at scale, making it suitable for enterprise-level applications requiring large

Remember, at this point, we’ve already transformed the user query, done semantic or hybrid search, and fused the results before passing the chunks to the re-ranker.

If you look at the results, we can clearly see that it’s able to identify a few relevant chunks that we can use as seeds.

*Remember it only has 150 docs to go on in the first place. *

You can also see that it returns multiple chunks from the same document. We’ll set this up later in the context construction, but if you want unique documents fetched, you can add some custom logic here to set the limit for unique docs rather than chunks.

We can try this with another question: “hallucinations in RAG vs normal LLMs and how to reduce them”

[... optimizer... retrieval... fuse...]

Rerank summary:
- strategy=cohere
- model=rerank-english-v3.0
- candidates=35
- eligible_above_threshold=12
- kept=5 (threshold=0.2)

Reranked Relevant (5/35 kept ≥ 0.2) top 5 for query: hallucinations in rag vs normal llms and how to reduce them

[1] score=0.9965 doc=docs_ingestor/docs/arxiv/2508.19614.pdf chunk=S7::C03::251104164901
text: 5 Related Work Hallucinations in LLMs Hallucinations in LLMs refer to instances where the model generates false or unsupported information not grounded in its reference data [42]. Existing mitigation strategies include multi-agent debating, where multiple LLM instances collaborate to detect inconsistencies through iterative debates [8, 14]; self-consistency verification, which aggregates and reconciles multiple reasoning paths to reduce individual errors [53]; and model editing, which directly modifies neural network weights to correct systematic factual errors [62, 19]. While RAG systems aim to ground responses in retrieved external knowledge, recent studies show that they still exhibit hallucinations, especially those that contradict the retrieved content [50]. To address this limitation, our work conducts an empirical study analyzing how LLMs internally process external knowledge
[2] score=0.9342 doc=docs_ingestor/docs/arxiv/2508.05509.pdf chunk=S3::C01::251104160034
text: Introduction Large language models (LLMs), like Claude (Anthropic 2024), ChatGPT (OpenAI 2023) and the Deepseek series (Liu et al. 2024), have demonstrated remarkable capabilities in many real-world tasks (Chen et al. 2024b; Zhou et al. 2025), such as question answering (Allam and Haggag 2012), text comprehension (Wright and Cervetti 2017) and content generation (Kumar 2024). Despite the success, these models are often criticized for their tendency to produce hallucinations, generating incorrect statements on tasks beyond their knowledge and perception (Ji et al. 2023; Zhang et al. 2024). Recently, retrieval-augmented generation (RAG) (Gao et al. 2023; Lewis et al. 2020) has emerged as a promising solution to alleviate such hallucinations. By dynamically leveraging external knowledge from textual corpora, RAG enables LLMs to generate more accurate and reliable responses without costly retraining (Lewis et al. 2020; Figure 1: Comparison of three paradigms. LAG exhibits greater lightweight properties compared to GraphRAG while
[3] score=0.9030 doc=docs_ingestor/docs/arxiv/2509.13702.pdf chunk=S3::C01::251104182000
text: ABSTRACT Hallucination remains a critical barrier to the reliable deployment of Large Language Models (LLMs) in high-stakes applications. Existing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and post-hoc verification, are often reactive, inefficient, or fail to address the root cause within the generati