Have you ever wanted to read through a ton of documents super fast or ask questions based on a particular knowledge or domain? That’s where RAG shines.
RAG stands for retrieval-augmented generation, and it lets you combine a knowledge base, such as a PDF or a web page, with a large language model (LLM), such as Gemini or GPT, to get accurate, fast answers. So instead of relying solely on ChatGPT, which does not know your documents and would likely hallucinate (give incorrect answers), you could build and use a RAG solution.
In this tutorial guide, we’ll build a very simple document search tool with Python, LlamaIndex, ChromaDB, and Ollama.
Prerequisites
To follow alongside this guide, you need to have the following installed on your laptop or PC.
- Have Python 3.10+, and…
Have you ever wanted to read through a ton of documents super fast or ask questions based on a particular knowledge or domain? That’s where RAG shines.
RAG stands for retrieval-augmented generation, and it lets you combine a knowledge base, such as a PDF or a web page, with a large language model (LLM), such as Gemini or GPT, to get accurate, fast answers. So instead of relying solely on ChatGPT, which does not know your documents and would likely hallucinate (give incorrect answers), you could build and use a RAG solution.
In this tutorial guide, we’ll build a very simple document search tool with Python, LlamaIndex, ChromaDB, and Ollama.
Prerequisites
To follow alongside this guide, you need to have the following installed on your laptop or PC.
- Have Python 3.10+, and
- Ollama installed.
These are the steps we will cover:
- Downloading models on Ollama,
- Setting up a Llama index and Chroma database,
- Loading our documents to the Llama index, and
- Building a query engine.
Step 1: Download a Model on Ollama
In case you are unfamiliar with Ollama. Ollama is an open source that lets you download local models and use them in your projects. You can download the software here and follow the on-screen instructions to install it.
On your terminal, run this command to download the Llama 3.1:8b model locally. You may choose to use any model of your preference; the following models are available.
ollama run llama3.1:8b
To confirm you have downloaded the model, run the same command, and you can send a message to the model.
Once that’s done, create and activate your virtual environment, then install Chromadb, Llama Index, and Python-dotenv to build with them.
pip install chromadb llama-index python-dotenv llama-index-readers-web
Step 2: Set up LlamaIndex and Chroma DB
Firstly, set up your project with Llama index and ChromaDB, but before we get to that, what is Llama index, and how does ChromaDB factor into a RAG application?
Llama index is an open-source RAG orchestrator; this is the brain behind RAG. It loads your documents (PDF, TXT, CSV files, or webpages), splits them into chunks (bits or pieces), and saves them in a vector database. A vector database is not like a regular (SQL or NoSQL) database, which stores image files as image files or text as text files; instead, it saves every piece of data as vector embeddings, a list of numbers that encodes a text. If a text like "cat sitting on a mat" is saved, it saves like this [0.12, -0.87, 0.44, …]. ChromaDB is a type of vector database, and the vectors are what the LLM uses for retrieval in RAG. The whole beauty of RAG lies in the implementation of this search.
Now that you’ve understood this, let’s initialize the llama index and chroma vector database in our project. You would need to import the following:
VectorStoreIndexThis will be used to wrap the vector store and load and embed our documents.Settings: This will be used to set our LLM and an embedding model.ChromaVectorStore: This will be used to set up our chroma vector database.HuggingFaceEmbedding: This will be used to set up an embedding model from Huggingface.
Next, we need to define our LLM and an embedding model from Huggingface. We will use the all-MiniLM-L6-v2 embedding model, which is perfect for this simple project.
from llama_index.core import VectorStoreIndex, Settings
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import chromadb
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=200)
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Next, let’s define our Chrome vector database. We will need to initialize a Chrome client, and we will use PersistentClient, which is a client type that stores vectors and metadata (document text, source, etc.) on your laptop’s local disk, similar to how SQLite works.
Then, we will define a collection. Think of a collection like a set of tables inside the Chroma; it holds your vectors (the numbers) and metadata, allowing the LLM to query them during the retrieval process.
After the collection is created, we will need to create a vector store that takes the collection as input.
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("rag_collection")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
Step 3: Load Documents & Building an Index
After setting up our LlamaIndex and ChromaDB, we need to feed our documents into LlamaIndex. It will load, split, and store the vectors of our documents in the Chroma vector database.
This guide will send a webpage as a raw document, so you will need to install llama-index-readers-web crawl a webpage.
pip install llama-index-readers-web
Once you install it, import the library, crawl the web page, and send it as a document:
from llama_index.readers.web import BeautifulSoupWebReader
url = ["https://en.wikipedia.org/wiki/Artificial_intelligence"]
docs = BeautifulSoupWebReader().load_data(url)
Next, we will need to build an index; the beauty of LlamaIndex lies in this. The index is a super smart wrapper that wraps around the vector store (ChromaDB) and handles the RAG orchestration; chunking, vector embeddings, storing of vector embeddings, and retrieval of data.
To build the index, you need to pass the documents and the vector store as arguments.
index = VectorStoreIndex.from_documents(docs, vector_store=vector_store)
Step 4: Ask Questions to Your Documents
Next, let’s build the query engine that takes our questions and gives answers. When you ask a question, the embedding model converts it into a single vector. The vector is used to search against the document’s vector embeddings stored in the Chroma DB. The most similar vector embeddings text chunks are retrieved, and the LLM uses those chunks to generate your answer.
Your question → one embedding vector → nearest document vectors → retrieved text chunks → LLM answer (English).
To build the query engine, you would need to use the as_query_engine() method in the Llama Index and send the query to it.
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the document into five bullet points")
print(response)
Your output should look like this:
Wrap-Up
And that is how you can build a simple RAG tool that crawls a webpage and allows you to ask questions about the content of the webpage. Along the way, you saw how to load and index your documents, store vector embeddings in the chroma database, and how LLM answers questions based on your documents.
With a few lines of Python, you can build a basic retrieval-augmented generation (RAG) solution, but it doesn’t stop here. You can extend this project to search for multiple web pages, load large documents, add a simple web UI using either Streamlit or Anvil, or even experiment with different models in Ollama.
The real fun starts when you can integrate this into your project, so have fun building and extending this project!
If you would love the video version of this guide, you can find it here: