Vector databases are often introduced as tools for semantic similarity search. In practice, that understanding breaks down the moment you try to build a real RAG system. In this article, I explain what vector databases actually do inside modern retrieval pipelines, why pure semantic search is insufficient, and why hybrid search is not an optimization but a requirement for production systems. You will see why semantic search fails silently, keyword search fails noisily, and why hybrid retrieval is the only reliable compromise. We then build a hybrid-search RAG system step by step using Qdrant as the vector database, focusing on design decisions, trade-offs, and failure cases rather than surface-level APIs. The complete source code and technical references are given at the end of the article…
Vector databases are often introduced as tools for semantic similarity search. In practice, that understanding breaks down the moment you try to build a real RAG system. In this article, I explain what vector databases actually do inside modern retrieval pipelines, why pure semantic search is insufficient, and why hybrid search is not an optimization but a requirement for production systems. You will see why semantic search fails silently, keyword search fails noisily, and why hybrid retrieval is the only reliable compromise. We then build a hybrid-search RAG system step by step using Qdrant as the vector database, focusing on design decisions, trade-offs, and failure cases rather than surface-level APIs. The complete source code and technical references are given at the end of the article. Large language models are excellent at generating text. What they are not good at is knowing your data : internal documents, proprietary knowledge, or information that changes every week. RAG solves this by grounding model outputs in external data sources that you control, turning fluent text generation into answers that are actually useful. In real-world systems, retrieval is rarely powered by a single technique. Semantic search with vectors is usually combined with keyword search, metadata filters, and application-level rules. This hybrid setup helps keep results not only relevant, but also predictable, easier to debug, and aligned with actual product needs. Term based and Embedding based retrieval Before diving into implementation details, let’s step back and understand how vector databases actually work. This section sets the foundation needed to reason about hybrid search, not just copy an implementation. If you are already familiar with vector databases, feel free to skip ahead and jump straight to the project implementation. Vector Databases Vector database are retrieval system designed to store vector representations of data and efficiently rank them by semantic relevance to a query. Instead of matching documents using exact terms, it operates on embeddings and retrieves results based on similarity in a high-dimensional vector space. In practice, a vector database is responsible for indexing, storing, and searching embeddings using nearest-neighbor algorithms, enabling fast and scalable embedding-based retrieval. This makes it a foundational component in modern RAG systems, where relevance is determined by meaning rather than lexical overlap. RAG If that sounds abstract, that’s okay. The core idea is simple: vector databases are optimized for similarity search at scale. They let you ask questions like “find content most related to this input” and return fast, relevant results : especially in cases where traditional keyword-based search falls short. These db’s can be grouped based on licensing models and deployment options. By licensing Open source vector databases expose their internals and can be self-hosted and customized. They are chosen when teams need control over deployment, data locality, or retrieval behavior. Qdrant is a common example. Managed vector databases hide the implementation and run as hosted services. They prioritize fast setup, low operational overhead, and built-in scalability, at the cost of control. By deployment In-memory setups keep vectors in RAM for very low-latency search. They work well for small datasets and experimentation, but do not scale easily. Disk-backed and cloud deployments persist data and scale horizontally. Qdrant supports these modes, making it suitable for production RAG systems that need durability and growth. These categories overlap. Systems often start in memory and move to disk or cloud in production. What really matters is the indexing strategy and how similarity search scales in practice. Common Indexing Methods Indexing methods define how vectors are organized, stored, and searched efficiently. Two vector databases storing the same embeddings can perform very differently depending on the indexing strategy they use. Locality-Sensitive Hashing (LSH) LSH groups similar vectors into the same buckets so search only looks at a small part of the space. It speeds things up by relying on probability: nearby vectors usually land together, but there’s no guarantee. Because the hash functions are fixed and don’t adapt to the data, LSH stays fast, but recall is limited. Hierarchical Navigable Small World (HNSW) HNSW organizes vectors into a multi-layer graph where each vector is a node connected to nearby ones. Search starts in sparse upper layers and moves down into denser layers, quickly narrowing in on the nearest neighbors. HNSW keeps improving as the index grows. Adding more vectors often makes navigation easier rather than harder, which is why HNSW delivers both high recall and low latency at scale. This is used in Qdrant . Product Quantization (PQ) PQ speeds up search by compressing vectors instead of comparing them at full precision. Each vector is split into smaller parts, and those parts are approximated using learned codebooks, so distance calculations run on compact representations. PQ trades precision for scale, not speed alone. It unlocks massive memory savings, but recall depends heavily on how well the codebooks match the data. Inverted File Index (IVF) The Inverted File Index (IVF) clusters vectors, usually using K-means, and searches only the clusters closest to the query instead of the full dataset. This cuts down the search space dramatically. The key trade-off is cluster size: too few clusters slow queries, too many hurt recall. Annoy (Approximate Nearest Neighbors Oh Yeah) Annoy uses a tree-based approach to approximate nearest neighbor search. It builds multiple binary trees by recursively splitting vectors using randomly chosen hyperplanes. During querying, these trees are traversed to collect a candidate set of nearby vectors. This method is well-suited for read-heavy workloads and static datasets. Annoy was open sourced by Spotify and is widely used for recommendation and similarity search tasks. Other Notable Approaches These include Microsoft’s SPTAG , which combines tree and graph-based techniques, and FLANN , a library focused on fast approximate nearest neighbor search using multiple indexing strategies. For this system, here’s why I chose Qdrant. Qdrant The diagram above represents a high-level overview of some of the main components of Qdrant I chose Qdrant for four core reasons: First, it is vector-first by design. Embeddings are the primary data type, not an add-on. This keeps embedding ingestion fast and the retrieval pipeline simple and predictable. Second, it delivers consistently low-latency search at scale. HNSW indexing, vector quantization, and built-in metadata filtering ensure performance stays stable as the dataset grows. Third, it supports native hybrid search. Dense vectors, sparse signals, and filters work together in a single query, which is critical for production-grade RAG systems where semantic search alone is not enough. Fourth, it works well as a dedicated, production-ready vector service . Vector workloads can scale independently, making it easier to operate, debug, and evolve without coupling them to the rest of the system. Now that the core ideas are clear, we can move into the implementation. In the next section, I walk through building a hybrid search RAG system using Qdrant and FastEmbed . The system combines semantic vector search with keyword-based matching over a PDF corpus to produce more accurate and context-aware retrieval. The retrieved context is then used to augment responses generated by an LLM, such as GPT-4o-mini. Hybrid search Now let’s look at the workflow we will follow to build this system. workflow diagram Before moving further, it is important to note that Qdrant stores hybrid vectors in a single collection, as illustrated below. This diagram shows how Qdrant represents data internally using a single point. Each point has a unique ID, a vectors section that stores both semantic (dense) and lexical (sparse) embeddings, and a payload that holds the original text and metadata. Storing dense and sparse vectors together in one collection enables hybrid search without splitting data across multiple systems. Now, let’s start building. Prerequisites : Docker 20.10 + and Python 3.9 + Code repo : GitHub - anjaliikakde/hybrid-search-rag Go to the code repository and follow the steps outlined here: Step 1: Set up the project locally Go to the GitHub repository for this project and follow the instructions below to run it locally. Clone the repository Create a virtual environment Activate the virtual environment Install the required dependencies This prepares the application environment needed to interact with Qdrant and run the RAG pipeline. Step 2: Run Qdrant in Docker Open the Docker Desktop application and ensure it is running. Then pull and start Qdrant using the following commands: docker pull qdrant/qdrant docker run -p 6333:6333 \ -v $(pwd)/qdrant_storage:/qdrant/storage \ qdrant/qdrant Once the container starts successfully, you should see logs similar to the following: … [2021-02-05T00:08:51Z INFO actix_server::builder] Starting 12 workers [2021-02-05T00:08:51Z INFO actix_server::builder] Starting “actix-web-service-0.0.0.0:6333” service on 0.0.0.0:6333 You can verify that Qdrant is running by opening http://localhost:6333 in your browser. The page should display the Qdrant version information. All data uploaded to Qdrant is stored in the ./qdrant_storage directory and will persist even if the container is stopped or recreated. Step 3: Understanding why certain components are used Go through the code snippets below to understand why certain components are used and how they fit into the overall system. These snippets do not follow the exact execution order of the notebook. Instead, they highlight the major building blocks of the workflow. Smaller setup details, such as configuration checks and environment variable loading, can be reviewed directly in the notebook. Step(i) The process begins by loading the document and counting its pages. Text is then extracted one page at a time, with basic metadata like page numbers and source attached. Keeping the content page-wise makes the data easier to track, filter, and debug later, especially when it’s used in a RAG pipeline. import fitz from pathlib import Path from typing import List, Dict class PDFDocumentLoader: def init(self, pdf_path: str): self.pdf_path = Path(pdf_path) if not self.pdf_path.exists(): raise FileNotFoundError(f“PDF not found at path: {self.pdf_path}“) def get_total_pages(self) -> int: “”“Return total number of pages in the PDF”“” with fitz.open(self.pdf_path) as pdf_doc: return pdf_doc.page_count def extract_documents(self) -> List[Dict]: documents = [] with fitz.open(self.pdf_path) as pdf_doc: total_pages = pdf_doc.page_count for page_index in range(total_pages): page = pdf_doc.load_page(page_index) text = page.get_text(“text”) # Skip empty / non-text pages if not text or not text.strip(): continue documents.append( { “text”: text.strip(), “metadata”: { “source”: self.pdf_path.name, “page”: page_index + 1, “char_count”: len(text) } } ) return documents The parameters used for this process are loaded from the parameters.toml file, which looks like this: [llm] provider = “openai” chat_model = “gpt-4o-mini” [rag] corpus_path = “../data/AI Engineering.pdf” chunk_size = 512 chunk_overlap = 64 top_k = 5 [vector_store] type = “qdrant” mode = “local” url = “http://localhost:6333” collection_name = “knowledge_base_chunks” storage_path = “./qdrant_storage” [dense_vector] name = “dense” model = “sentence-transformers/all-MiniLM-L6-v2” distance = “cosine” [sparse_vector] enabled = true name = “sparse” model = “Qdrant/bm25” Step (ii) Before chunking, let’s inspect how the extracted documents look. from typing import List, Dict class DocumentInspector: def init(self, documents: List[Dict]): if not documents: raise ValueError(“No documents provided for inspection.”) self.documents = documents def preview( self, sample_size: int = 3, max_chars: int = 500 ) -> None: print(f“\nInspecting {min(sample_size, len(self.documents))} document(s):\n“) for idx, doc in enumerate(self.documents[:sample_size], start=1): text_preview = doc[“text”][:max_chars] print(“=” * 80) print(f“Sample #{idx}“) print(f“Source : {doc[‘metadata’].get(‘source’)}”) print(f“Page : {doc[‘metadata’].get(‘page’)}“) print(f“Chars : {doc[‘metadata’].get(‘char_count’)}”) print(“-” * 80) print(text_preview) print(“…” if len(doc[“text”]) > max_chars else “”) print(“=” * 80) Step (iii) The extracted text contains many unnecessary characters, so we apply text normalization to clean it. This step is crucial because normalized text leads to better chunking, more consistent embeddings, and improved retrieval quality during search. import re from typing import List, Dict class TextNormalizer: _MULTIPLE_NEWLINES = re.compile(r“\n{3,}“) _MULTIPLE_SPACES = re.compile(r”[ \t]{2,}“) _SPACE_BEFORE_NEWLINE = re.compile(r”[ \t]+\n“) _LINE_WRAP = re.compile(r“(? def normalize_text(self, text: str) -> str: if not text: return text # Normalize line endings text = text.replace(“\r\n”, “\n”).replace(“\r”, “\n”) # Remove trailing spaces before newlines text = self._SPACE_BEFORE_NEWLINE.sub(“\n”, text) # Remove line-wrapped newlines (inside paragraphs) text = self._LINE_WRAP.sub(“ “, text) # Collapse excessive newlines (keep paragraphs) text = self._MULTIPLE_NEWLINES.sub(”\n\n“, text) # Collapse repeated spaces text = self._MULTIPLE_SPACES.sub(“ “, text) return text.strip() def normalize_documents(self, documents: List[Dict]) -> List[Dict]: return [ { “text”: self.normalize_text(doc[“text”]), “metadata”: doc[“metadata”] } for doc in documents ] normalizer = TextNormalizer() normalized_documents = normalizer.normalize_documents(documents) normalized_documents[1] Step (iv) Before chunking, we filter out low-value or noisy pages using a document filtering step. This removes pages with very little content and common front-matter such as copyright notices or tables of contents. Filtering early helps reduce noise, improves embedding quality, and ensures that only meaningful content is indexed and used during retrieval. class DocumentFilter: “”“ Filters out low-value or noisy pages before chunking and embedding. “”“ def init(self, min_char_count: int = 200): self.min_char_count = min_char_count def is_useful(self, doc: dict) -> bool: text = doc[“text”].lower() # Filter very small pages if doc[“metadata”][“char_count”] return False # Filter common front-matter patterns noise_markers = [ “isbn”, “copyright”, “all rights reserved”, “table of contents”, “price”, “publisher” ] if any(marker in text for marker in noise_markers): return False return True def filter_documents(self, documents: list) -> list: return [doc for doc in documents if self.is_useful(doc)] Step (v) Split the cleaned documents into overlapping text chunks while preserving metadata. This improves retrieval accuracy and ensures chunks fit model context limits. from typing import List, Dict import math class TextChunker: def init( self, chunk_size: int = 300, overlap: int = 50 ): if overlap >= chunk_size: raise ValueError(“overlap must be smaller than chunk_size”) self.chunk_size = chunk_size self.overlap = overlap def _estimate_tokens(self, text: str) -> int: return math.ceil(len(text) / 4) def _split_into_chunks(self, text: str) -> List[str]: words = text.split() chunks = [] start = 0 while start end = start + self.chunk_size chunk_words = words[start:end] chunks.append(“ “.join(chunk_words)) start = end - self.overlap if start start = 0 return chunks def chunk_documents(self, documents: List[Dict]) -> List[Dict]: chunked_docs = [] for doc in documents: text = doc[“text”] base_metadata = doc[“metadata”] chunks = self._split_into_chunks(text) for idx, chunk_text in enumerate(chunks): chunked_docs.append( { “text”: chunk_text, “metadata”: { **base_metadata, “chunk_id”: f“{base_metadata[‘source’]}_p{base_metadata[‘page’]}_c{idx}“, “chunk_index”: idx, “chunk_char_count”: len(chunk_text), “chunk_token_estimate”: self._estimate_tokens(chunk_text) } } ) return chunked_docs Step (vi) Create a hybrid Qdrant collection with separate dense and sparse vector configurations. This step defines the schema for embeddings and prepares the vector store for hybrid retrieval before any data is ingested. from qdrant_client import QdrantClient, models class QdrantHybridCollectionManager: def init(self, client: QdrantClient, config: dict): self.client = client self.collection_name = config[“vector_store”][“collection_name”] self.dense_cfg = config[“dense_vector”] self.sparse_cfg = config[“sparse_vector”] def recreate_collection(self) -> None: if self.client.collection_exists(self.collection_name): self.client.delete_collection(self.collection_name) self.client.create_collection( collection_name=self.collection_name, vectors_config={ self.dense_cfg[“name”]: models.VectorParams( size=self._dense_vector_size(), distance=models.Distance[ self.dense_cfg[“distance”].upper() ], ) }, sparse_vectors_config={ self.sparse_cfg[“name”]: models.SparseVectorParams() }, ) def _dense_vector_size(self) -> int: return self.client.get_embedding_size( self.dense_cfg[“model”] ) Step (vii) Convert each text chunk into hybrid Qdrant documents by generating both dense and sparse embeddings using FastEmbed, while attaching the corresponding metadata as payloads for retrieval and filtering. from typing import List, Dict from qdrant_client import models class HybridDocumentBuilder: def init(self, dense_cfg: dict, sparse_cfg: dict): self.dense_name = dense_cfg[“name”] self.dense_model = dense_cfg[“model”] self.sparse_name = sparse_cfg[“name”] self.sparse_model = sparse_cfg[“model”] def build(self, chunks: List[Dict]): documents = [] payloads = [] for chunk in chunks: text = chunk[“text”] documents.append( { self.dense_name: models.Document( text=text, model=self.dense_model, ), self.sparse_name: models.Document( text=text, model=self.sparse_model, ), } ) payloads.append( { **chunk[“metadata”], “text”: text, } ) return documents, payloads Step (viii) Ingest the prepared hybrid documents into Qdrant by converting them into points with unique IDs, attaching both dense and sparse vectors, and storing the associated metadata as payloads. This makes the data available for dense, sparse, and hybrid retrieval. from qdrant_client import QdrantClient from qdrant_client.models import PointStruct import uuid class QdrantIngestor: def init(self, client: QdrantClient, collection_name: str): self.client = client self.collection = collection_name def ingest(self, documents: list, payloads: list) -> None: if len(documents) != len(payloads): raise ValueError(“Documents and payloads length mismatch”) points = [] for i in range(len(documents)): vector = documents[i] # validation if not isinstance(vector, dict): raise ValueError( f“Expected hybrid vector dict, got {type(vector)} at index {i}“ ) if “dense” not in vector or “sparse” not in vector: raise ValueError( f“Hybrid vector must contain ‘dense’ and ‘sparse’ keys at index {i}“ ) points.append( PointStruct( id=str(uuid.uuid4()), vector={ “dense”: vector[“dense”], “sparse”: vector[“sparse”], }, payload=payloads[i], ) ) self.client.upsert( collection_name=self.collection, points=points, ) Initialize the Qdrant client, build hybrid documents from the prepared text chunks, and ingest them into the configured Qdrant collection. This completes the indexing process and makes the data ready for retrieval. After ingestion, open http://localhost:6333/dashboard to view the newly created collection. Collection Point Step (ix) So far, we have covered all the major components individually. Now, this final step brings everything together into a complete RAG pipeline. The pipeline takes a user query, performs hybrid retrieval to fetch the most relevant context, builds a prompt from the retrieved chunks, and generates the final answer using the language model. class RAGPipeline: “”“ Full RAG pipeline: Query → Hybrid Search → OpenAI Answer “”“ def init( self, searcher, prompt_builder: RAGPromptBuilder, generator: OpenAIAnswerGenerator, top_k: int, ): self.searcher = searcher self.prompt_builder = prompt_builder self.generator = generator self.top_k = top_k def answer(self, query: str) -> str: contexts = self.searcher.search( query=query, top_k=self.top_k, ) messages = self.prompt_builder.build( query=query, contexts=contexts, ) return self.generator.generate(messages) rag = RAGPipeline( searcher=searcher, prompt_builder=prompt_builder, generator=generator, top_k=config[“rag”][“top_k”], ) In this blog, I showed how to build a hybrid search system in practice and how Qdrant helped me implement it cleanly. By combining semantic vector search with keyword matching, the system gives more reliable and controllable results than semantic search alone. The main takeaway is simple: real retrieval systems work better when multiple signals are used together. Hybrid search is not extra complexity, it’s what actually works in production. That’s it for this post. See you in the next one. Until then, keep learning and keep exploring. Code repository GitHub - anjaliikakde/hybrid-search-rag Reference GitHub - qdrant/qdrant: Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/ Home - Qdrant FastEmbed - Qdrant Vector Databases in Practice: Building a Realistic Hybrid Search RAG System with Qdrant was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.