, off the back of Retrieval Augmented Generation (RAG), vector databases are getting a lot of attention in the AI world.
Many people say you need tools like Pinecone, Weaviate, Milvus, or Qdrant to build a RAG system and manage your embeddings. If you are working on enterprise applications with hundreds of millions of vectors, then tools like these are essential. They let you perform CRUD operations, filter by metadata, and use disk-based indexing that goes beyond your computer’s memory.
But for most internal tools, documentation bots, or MVP agents, adding a dedicated vector database might be overkill. It increases complexity, network delays, adds serialisation costs, and makes things more complicated to manage.
The truth is that “Vector Search” (i.e the Retrieval part of RAG…
, off the back of Retrieval Augmented Generation (RAG), vector databases are getting a lot of attention in the AI world.
Many people say you need tools like Pinecone, Weaviate, Milvus, or Qdrant to build a RAG system and manage your embeddings. If you are working on enterprise applications with hundreds of millions of vectors, then tools like these are essential. They let you perform CRUD operations, filter by metadata, and use disk-based indexing that goes beyond your computer’s memory.
But for most internal tools, documentation bots, or MVP agents, adding a dedicated vector database might be overkill. It increases complexity, network delays, adds serialisation costs, and makes things more complicated to manage.
The truth is that “Vector Search” (i.e the Retrieval part of RAG) is just matrix multiplication. And Python already has some of the world’s best tools for that.
In this article, we’ll show how to build a production-ready retrieval component of a RAG pipeline for small-to-medium data volumes using only NumPy and SciKit-Learn. You’ll see that it’s possible to search millions of text strings in milliseconds, all in memory and without any external dependencies.
Understanding Retrieval as Matrix Math
Typically, RAG involves four main steps:
- Embed: Turn the text of your source data into vectors (lists of floating-point numbers)
- Store: Squirrel those vectors away into a database
- Retrieve: Find vectors that are mathematically “close” to the query vector.
- Generate: Feed the corresponding text to an LLM and get your final answer.
Steps 1 and 4 rely on large language models. Steps 2 and 3 are the domain of the Vector DB. We will concentrate on parts 2 and 3 and how we avoid using vector DBs entirely.
But when we’re searching our vector database, what actually is “closeness”? Usually, it is Cosine Similarity. If your two vectors are normalised to have a magnitude of 1, then cosine similarity is just the dot product of the two.
If you have a one-dimensional query vector of size N, Q(1xN), and a database of document vectors of size M by N, D(MxN), finding the best matches is not a database query; it is a matrix multiplication operation, the dot product of D with the transpose of Q.
Scores = D.Q^T
NumPy is designed to perform this kind of operation efficiently, using routines that leverage modern CPU features such as vectorisation.
The Implementation
We’ll create a class called SimpleVectorStore to handle ingestion, indexing, and retrieval. Our input data will consist of one or more files containing the text we want to search on. Using Sentence Transformers for local embeddings will make everything work offline.
Prerequisites
Set up a new development environment, install the required libraries, and start a Jupyter notebook.
Type the following commands into a command shell. I’m using UV as my package manager; change to suit whatever tool you’re using.
$ uv init ragdb
$ cd ragdb
$ uv venv ragdb
$ source ragdb/bin/activate
$ uv pip install numpy scikit-learn sentence-transformers jupyter
$ jupyter notebook
The In-Memory Vector Store
We don’t need a complicated server. All we need is a function to load our text data from the input files and chunk it into byte-sized pieces, and a class with two lists: one for the raw text chunks and one for the embedding matrix. Here’s the code.
import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Any
from pathlib import Path
class SimpleVectorStore:
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
print(f"Loading embedding model: {model_name}...")
self.encoder = SentenceTransformer(model_name)
self.documents = [] # Stores the raw text and metadata
self.embeddings = None # Will become a numpy array
def add_documents(self, docs: List[Dict[str, Any]]):
"""
Ingests documents.
docs format: [{'text': '...', 'metadata': {...}}, ...]
"""
texts = [d['text'] for d in docs]
# 1. Generate Embeddings
print(f"Embedding {len(texts)} documents...")
new_embeddings = self.encoder.encode(texts)
# 2. Normalize Embeddings
# (Critical optimization: allows dot product to approximate cosine similarity)
norm = np.linalg.norm(new_embeddings, axis=1, keepdims=True)
new_embeddings = new_embeddings / norm
# 3. Update Storage
if self.embeddings is None:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
self.documents.extend(docs)
print(f"Store now contains {len(self.documents)} documents.")
def search(self, query: str, k: int = 5):
"""
Retrieves the top-k most similar documents.
"""
if self.embeddings is None or len(self.documents) == 0:
print("Warning: Vector store is empty. No documents to search.")
return []
# 1. Embed and Normalize Query
query_vec = self.encoder.encode([query])
norm = np.linalg.norm(query_vec, axis=1, keepdims=True)
query_vec = query_vec / norm
# 2. Vectorized Search (Matrix Multiplication)
# Result shape: (1, N_docs)
scores = np.dot(self.embeddings, query_vec.T).flatten()
# 3. Get Top-K Indices
# argsort sorts ascending, so we take the last k and reverse them
# Ensure k doesn't exceed the number of documents
k = min(k, len(self.documents))
top_k_indices = np.argsort(scores)[-k:][::-1]
results = []
for idx in top_k_indices:
results.append({
"score": float(scores[idx]),
"text": self.documents[idx]['text'],
"metadata": self.documents[idx].get('metadata', {})
})
return results
def load_from_directory(directory_path: str, chunk_size: int = 1000, overlap: int = 200):
"""
Reads .txt files and splits them into overlapping chunks.
"""
docs = []
# Use pathlib for robust path handling and resolution
path = Path(directory_path).resolve()
if not path.exists():
print(f"Error: Directory '{path}' not found.")
print(f"Current working directory: {os.getcwd()}")
return docs
print(f"Loading documents from: {path}")
for file_path in path.glob("*.txt"):
try:
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
# Simple sliding window chunking
# We iterate through the text with a step size smaller than the chunk size
# to create overlap (preserving context between chunks).
step = chunk_size - overlap
for i in range(0, len(text), step):
chunk = text[i : i + chunk_size]
# Skip chunks that are too small (e.g., leftover whitespace)
if len(chunk) < 50:
continue
docs.append({
"text": chunk,
"metadata": {
"source": file_path.name,
"chunk_index": i
}
})
except Exception as e:
print(f"Warning: Could not read file {file_path.name}: {e}")
print(f"Successfully loaded {len(docs)} chunks from {len(list(path.glob('*.txt')))} files.")
return docs
The embedding model used
The all-MiniLM-L6-v2 model used in the code is from the Sentence Transformers library. This was chosen because,
- It’s fast and lightweight.
- It produces 384-dimensional vectors that use less memory than larger models.
- It performs well on a wide variety of English-language tasks without needing specialised fine-tuning.
This model is just a suggestion. You can use any embedding model you want if you have a particular favourite.
Why Normalise?
You might notice the normalisation steps in the code. We mentioned it before, but to be clear, given two vectors X and Y, cosine similarity is defined as
Similarity = (X · Y) / (||X|| * ||Y||)
Where:
- X · Y is the dot product of vectors X and Y
- ||X|| is the magnitude (length) of vector X
- ||Y|| is the magnitude of vector Y
Since division takes extra computation, if all our vectors have unit magnitude, the denominator is 1, so the formula reduces to the dot product of X and Y, which makes searching much faster.
Testing the Performance
The first thing we need to do is get some input data to work with. You can use any input text file for this. For previous RAG experiments, I used a book I downloaded from Project Gutenberg. The consistently riveting:
“Diseases of cattle, sheep, goats, and swine by Jno. A. W. Dollar & G. Moussu”
Note that you can view the Project Gutenberg Permissions, Licensing and other Common Requests page using the following link.
https://www.gutenberg.org/policy/permission.html
But to summarise, the vast majority of Project Gutenberg eBooks are in the public domain in the US and other parts of the world. This means that nobody can grant or withhold permission to do with this item as you please.
“… as you please”* includes any commercial use, republishing in any format, making derivative works or performances*
I downloaded the text of the book from the Project Gutenberg website to my local PC using this link,
https://www.gutenberg.org/ebooks/73019.txt.utf-8
This book contained approximately 36,000 lines of text. Querying the book takes only six lines of code. For my sample question, line 2315 of the book discusses a disease called CONDYLOMATA. Here is the excerpt,
INFLAMMATION OF THE INTERDIGITAL SPACE.
(CONDYLOMATA.)
Condylomata result from chronic inflammation of the skin covering the interdigital ligament. Any injury to this region causing even superficial damage may result in chronic inflammation of the skin and hypertrophy of the papillæ, the first stage in the production of condylomata.
Injuries produced by cords slipped into the interdigital space for the purpose of lifting the feet when shoeing working oxen are also fruitful causes.
So that‘s what we’ll ask, “What is Condylomata?” Note that we won’t get a proper answer as we’re not feeding our search result into an LLM, but we should see that our search returns a text snippet that would give the LLM all the required information to formulate an answer had we done so.
%%time
# 1. Initialize
store = SimpleVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("What is Condylomata?", k=1)
results
And here is the output.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 2205 chunks from 1 files.
Embedding 2205 documents...
Store now contains 2205 documents.
CPU times: user 3.27 s, sys: 377 ms, total: 3.65 s
Wall time: 3.82 s
[{'score': 0.44883957505226135,
'text': 'two last\nphalanges, the latter operation being easier than
the former, and\nproviding flaps of more regular shape and better adapted
for the\nproduction of a satisfactory stump.\n\n\n
INFLAMMATION OF THE INTERDIGITAL SPACE.\n\n(CONDYLOMATA.)\n\n
Condylomata result from chronic inflammation of the skin covering
the\ninterdigital ligament. Any injury to this region causing
even\nsuperficial damage may result in chronic inflammation of the
skin and\nhypertrophy of the papillæ, the first stage in the production
of\ncondylomata.\n\nInjuries produced by cords slipped into the
interdigital space for the\npurpose of lifting the feet when shoeing
working oxen are also fruitful\ncauses.\n\nInflammation of the
interdigital space is also a common complication of\naphthous eruptions
around the claws and in the space between them.\nContinual contact with
litter, dung and urine favour infection of\nsuperficial or deep wounds,
and by causing exuberant granulation lead to\nhypertrophy of the papillary
layer of ',
'metadata': {'source': 'cattle_disease.txt', 'chunk_index': 122400}}]
Under 4 seconds to read, chunk, store, and correctly query a 36000-line text document is pretty good going.
SciKit-Learn: The Upgrade Path
NumPy works well for brute-force searches. But what if you have dozens or hundreds of documents, and brute-force is too slow? Before switching to a vector database, you can try SciKit-Learn’s NearestNeighbors. It uses tree-based structures like KD-Tree and Ball-Tree to speed up searches to O(log N) instead of O(N).
To test this out, I downloaded a bunch of other books from Gutenberg, including:-
- A Christmas Carol by Charles Dickens
- The Life and Adventures of Santa Claus by L. Frank Baum
- War and Peace by Tolstoy
- A Farewell to Arms by Hemingway
In total, these books contain around 120,000 lines of text. I copied and pasted all five input book files ten times, resulting in fifty files and 1.2 million lines of text. That’s around 12 million words, assuming an average of 10 words per line. To provide some context, this article contains approximately 2800 words, so the data volume we’re testing with is equivalent to over 4000 times the volume of this text.
$ dir
achristmascarol\ -\ Copy\ (2).txt cattle_disease\ -\ Copy\ (9).txt santa\ -\ Copy\ (6).txt
achristmascarol\ -\ Copy\ (3).txt cattle_disease\ -\ Copy.txt santa\ -\ Copy\ (7).txt
achristmascarol\ -\ Copy\ (4).txt cattle_disease.txt santa\ -\ Copy\ (8).txt
achristmascarol\ -\ Copy\ (5).txt farewelltoarms\ -\ Copy\ (2).txt santa\ -\ Copy\ (9).txt
achristmascarol\ -\ Copy\ (6).txt farewelltoarms\ -\ Copy\ (3).txt santa\ -\ Copy.txt
achristmascarol\ -\ Copy\ (7).txt farewelltoarms\ -\ Copy\ (4).txt santa.txt
achristmascarol\ -\ Copy\ (8).txt farewelltoarms\ -\ Copy\ (5).txt warandpeace\ -\ Copy\ (2).txt
achristmascarol\ -\ Copy\ (9).txt farewelltoarms\ -\ Copy\ (6).txt warandpeace\ -\ Copy\ (3).txt
achristmascarol\ -\ Copy.txt farewelltoarms\ -\ Copy\ (7).txt warandpeace\ -\ Copy\ (4).txt
achristmascarol.txt farewelltoarms\ -\ Copy\ (8).txt warandpeace\ -\ Copy\ (5).txt
cattle_disease\ -\ Copy\ (2).txt farewelltoarms\ -\ Copy\ (9).txt warandpeace\ -\ Copy\ (6).txt
cattle_disease\ -\ Copy\ (3).txt farewelltoarms\ -\ Copy.txt warandpeace\ -\ Copy\ (7).txt
cattle_disease\ -\ Copy\ (4).txt farewelltoarms.txt warandpeace\ -\ Copy\ (8).txt
cattle_disease\ -\ Copy\ (5).txt santa\ -\ Copy\ (2).txt warandpeace\ -\ Copy\ (9).txt
cattle_disease\ -\ Copy\ (6).txt santa\ -\ Copy\ (3).txt warandpeace\ -\ Copy.txt
cattle_disease\ -\ Copy\ (7).txt santa\ -\ Copy\ (4).txt warandpeace.txt
cattle_disease\ -\ Copy\ (8).txt santa\ -\ Copy\ (5).txtLet's say we are ut
Let’s say we were ultimately looking for an answer to the following question,
Who, after the Christmas holidays, did Nicholas tell his mother of his love for?
In case you didn’t know, this comes from the novel War and Peace.
Let’s see how our new search does against this large body of information.
Here is the code using SciKit-Learn.
First off, we have a new class that implements SciKit-Learn’s nearest Neighbour algorithm.
from sklearn.neighbors import NearestNeighbors
class ScikitVectorStore(SimpleVectorStore):
def __init__(self, model_name='all-MiniLM-L6-v2'):
super().__init__(model_name)
# Brute force is often faster than trees for high-dimensional data
# unless N is very large, but 'ball_tree' can help in specific cases.
self.knn = NearestNeighbors(n_neighbors=5, metric='cosine', algorithm='brute')
self.is_fit = False
def build_index(self):
print("Building Scikit-Learn Index...")
self.knn.fit(self.embeddings)
self.is_fit = True
def search(self, query: str, k: int = 5):
if not self.is_fit: self.build_index()
query_vec = self.encoder.encode([query])
# Note: Scikit-learn handles normalization internally for cosine metric
# if configured, but explicit is better.
distances, indices = self.knn.kneighbors(query_vec, n_neighbors=k)
results = []
for i in range(k):
idx = indices[0][i]
# Convert distance back to similarity score (1 - dist)
score = 1 - distances[0][i]
results.append({
"score": score,
"text": self.documents[idx]['text']
})
return results
And our search code is just as simple as for the NumPy version.
%%time
# 1. Initialize
store = ScikitVectorStore()
# 2. Load Documents
real_docs = load_from_directory("/mnt/d/book")
# 3. Add to Store
if real_docs:
store.add_documents(real_docs)
# 4. Search
results = store.search("Who, after the Christmas holidays, did Nicholas tell his mother of his love for", k=1)
results
And our output.
Loading embedding model: all-MiniLM-L6-v2...
Loading documents from: /mnt/d/book
Successfully loaded 73060 chunks from 50 files.
Embedding 73060 documents...
Store now contains 73060 documents.
Building Scikit-Learn Index...
CPU times: user 1min 46s, sys: 18.3 s, total: 2min 4s
Wall time: 1min 13s
[{'score': 0.6972659826278687,
'text': '\nCHAPTER XIII\n\nSoon after the Christmas holidays Nicholas told
his mother of his love\nfor Sónya and of his firm resolve to marry her. The
countess, who\nhad long noticed what was going on between them and was
expecting this\ndeclaration, listened to him in silence and then told her son
that he\nmight marry whom he pleased, but that neither she nor his father
would\ngive their blessing to such a marriage. Nicholas, for the first time,
\nfelt that his mother was displeased with him and that, despite her love\n
for him, she would not give way. Coldly, without looking at her son,\nshe
sent for her husband and, when he came, tried briefly and coldly to\ninform
him of the facts, in her son's presence, but unable to restrain\nherself she
burst into tears of vexation and left the room. The old\ncount began
irresolutely to admonish Nicholas and beg him to abandon his\npurpose.
Nicholas replied that he could not go back on his word, and his\nfather,
sighing and evidently disconcerted, very soon became silent ',
'metadata': {'source': 'warandpeace - Copy (6).txt',
'chunk_index': 1396000}}]
Almost all of the 1m 13s it took to do the above processing was spent on loading and chunking our input data. The actual search part, when I ran it separately, took less than one-tenth of a second!
Not too shabby at all.
Summary
I am not arguing that Vector Databases are not needed. They solve specific problems that NumPy and SciKit-Learn do not handle. You should migrate from something like our SimpleVectorStore or ScikitVectorStore to Weaviate/Pinecone/pgvector, etc, when any of the following conditions apply.
Persistence: You need data to survive a server restart without rebuilding the index from source files every time. Though np.save or pickling works for simple persistence. Engineering always involves trade-offs. Using a vector database adds complexity to your setup in exchange for scalability you may not need right now. If you start with a more straightforward RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
RAM is the bottleneck: Your embedding matrix exceeds your server’s memory. Note: 1 million vectors of 384 dimensions [float32] is only ~1.5GB of RAM, so you can fit a lot in memory.
CRUD frequency: You need to constantly update or delete individual vectors while reading. NumPy arrays, for example, are immutable, and appending requires copying the whole array, which is slow.
Metadata Filtering: You need complex queries like “Find vectors near X where user_id=10 AND date > 2023”. Doing this in NumPy requires boolean masks that can get messy.
Engineering always involves trade-offs. Using a vector database adds complexity to your setup in exchange for scalability you may not need right now. If you start with a more straightforward RAG setup using NumPy and/or SciKit-Learn for the retrieval process, you get:
- Lower Latency. No network hops.
- Lower Costs. No SaaS subscriptions or extra instances.
- Simplicity. It is just a Python script.
Just as you don’t need a sports car to go to the grocery store. In many cases, NumPy or SciKit-Learn may be all the RAG search you need.