, off the back of Retrieval Augmented Generation (RAG), vector databases are getting a lot of attention in the AI world.
Many people say you need tools like Pinecone, Weaviate, Milvus, or Qdrant to build a RAG system and manage your embeddings. If you are working on enterprise applications with hundreds of millions of vectors, then tools like these are essential. They let you perform CRUD operations, filter by metadata, and use disk-based indexing that goes beyond your computer’s memory.
But for most internal tools, documentation bots, or MVP agents, adding a dedicated vector database might be overkill. It increases complexity, network delays, adds serialisation costs, and makes things more complicated to manage.
The truth is that “Vector Search” (i.e the Retrieval part of RAG) is just matrix multiplication. And Python already has some of the world’s best tools for that.
In this article, we’ll show how to build a production-ready retrieval component of a RAG pipeline for small-to-medium data volumes using only NumPy and SciKit-Learn. You’ll see that it’s possible to search millions of text strings in milliseconds, all in memory and without any external dependencies.
Understanding Retrieval as Matrix Math
Typically, RAG involves four main steps:
- Embed: Turn the text of your source data into vectors (lists of floating-point numbers)
- Store: Squirrel those vectors away into a database
- Retrieve: Find vectors that are mathematically “close” to the query vector.
- Generate: Feed the corresponding text to an LLM and get your final answer.
Steps 1 and 4 rely on large language models. Steps 2 and 3 are the domain of the Vector DB. We will concentrate on parts 2 and 3 and how we avoid using vector DBs entirely.
But when we’re searching our vector database, what actually is “closeness”? Usually, it is Cosine Similarity. If your two vectors are normalised to have a magnitude of 1, then cosine similarity is just the dot product of the two.
If you have a one-dimensional query vector of size N, Q(1xN), and a database of document vectors of size M by N, D(MxN), finding the best matches is not a database query; it is a matrix multiplication operation, the dot product of D with the transpose of Q.
Scores = D.Q^T
NumPy is designed to perform this kind of operation efficiently, using routines that leverage modern CPU features such as vectorisation.
The Implementation
We’ll create a class called SimpleVectorStore to handle ingestion, indexing, and retrieval. Our input data will consist of one or more files containing the text we want to search on. Using Sentence Transformers for local embeddings will make everything work offline.
Prerequisites
Set up a new development environment, install the required libraries, and start a Jupyter notebook.