8 min read1 day ago
–
Press enter or click to view image in full size
> Build your own Retrieval-Augmented Generation (RAG) system in Google Colab — understand how LLMs retrieve, reason, and respond with real facts.
💡 Why I Built This
A few weeks ago, I realized most explanations of RAG (Retrieval-Augmented Generation) online are either too abstract or too complex.
So I decided to build something anyone could understand — and more importantly, run inside Google Colab.
The result?
A Minimal, Fully Documented RAG Example — where you can drop in your own .txt
files and watch an LLM retrieve, reason, and respond with context-aware answers.
🧩 What You’ll Learn
This project walks through every major step of a RAG system:
- Installing the …
8 min read1 day ago
–
Press enter or click to view image in full size
> Build your own Retrieval-Augmented Generation (RAG) system in Google Colab — understand how LLMs retrieve, reason, and respond with real facts.
💡 Why I Built This
A few weeks ago, I realized most explanations of RAG (Retrieval-Augmented Generation) online are either too abstract or too complex.
So I decided to build something anyone could understand — and more importantly, run inside Google Colab.
The result?
A Minimal, Fully Documented RAG Example — where you can drop in your own .txt
files and watch an LLM retrieve, reason, and respond with context-aware answers.
🧩 What You’ll Learn
This project walks through every major step of a RAG system:
- Installing the required libraries
- Loading and chunking your text documents
- Converting those chunks into dense embeddings
- Building a FAISS similarity index
- Retrieving the most relevant chunks for a user query
- Building a context-rich prompt
- Generating a deterministic, reproducible answer using GPT-2
⚙️ How It Works
Here’s what happens under the hood:
1️⃣ Document Chunking
Each .txt
file from your /docs/
folder is split into 50-word chunks with a 10-word overlap.
This ensures smooth semantic continuity — no idea gets lost between boundaries.
2️⃣ Embeddings with SentenceTransformers
Each chunk is transformed into a vector — a numerical summary of its meaning. These embeddings help the system search by meaning, not just keywords.
3️⃣ FAISS Indexing
The embeddings are stored in a FAISS (Facebook AI Similarity Search) index, which makes it possible to instantly find the most relevant chunks for any question.
4️⃣ Retrieval
When you ask a question, the system encodes it into the same embedding space, then retrieves the top-3 most similar chunks.
5️⃣ Prompt Construction
The retrieved chunks are stitched into a structured prompt like this:
Context:<chunk1><chunk2><chunk3>Question: <your question>Answer:
6️⃣ Answer Generation
Finally, a GPT-2 language model reads the context and generates an answer — deterministically (same input = same output every time).
🧪 Example Output
Question:
What are the essential kits for hiking or trekking?
Generated Answer:
1. Backpack2. Water bottle / Hydration pack3. Trekking poles4. Map / Compass / GPS5. First-aid kit6. Rain gear7. Snacks / Energy bars
Everything you see is retrieved from your own documents, not from the model’s memory — that’s the beauty of RAG.
🧰 How to Use
- Open the Google Colab Notebook.
- Create a folder named
docs/
in the Colab root. - Place any number of
.txt
files inside it. - Run all cells (or execute
python rag_colab.py
).
📜 Full Code Walkthrough
# ---------- 1. Install required libraries ----------# (Run this in Colab or your environment; in a script omit the '!' when using pip)!pip install -q sentence-transformers faiss-cpu transformers
# ---------- 2. Imports ----------import osimport faissimport numpy as npimport torchfrom pathlib import Pathfrom sentence_transformers import SentenceTransformerfrom transformers import pipeline, set_seed
# ---------- 3. Document loader ----------def load_documents(doc_dir: str) -> list: """ Load and preprocess all plain‑text documents in a directory. Parameters ---------- doc_dir : str Path to a folder that contains one or more *.txt files. Returns ------- docs : list of str A flat list where each element is a chunk of 50 words taken from the original files, with a 10‑word overlap between consecutive chunks. Notes ----- • Chunking at a small granularity (50 words) allows the retriever to identify highly relevant snippets rather than whole paragraphs. • The 10‑word overlap ensures that the boundary words of a chunk are not lost when we split a document – this improves semantic continuity for the embedding model. """ print("Loading documents...") # Helper that splits a single string into overlapping chunks def chunk_text(text: str, chunk_size: int = 50, overlap: int = 10) -> list: """ Split a block of text into overlapping word‑based chunks. Parameters ---------- text : str Raw document text. chunk_size : int, optional Number of words per chunk (default 50). overlap : int, optional Number of words that consecutive chunks share (default 10). Returns ------- list of str List of chunk strings. """ words = text.split() chunks = [] start = 0 while start < len(words): end = start + chunk_size chunk = " ".join(words[start:end]) chunks.append(chunk) start += chunk_size - overlap return chunks docs = [] for file_path in Path(doc_dir).glob("*.txt"): with open(file_path, "r", encoding="utf-8") as f: content = f.read() # Extend the master list with the new chunks docs.extend(chunk_text(content)) print(f"Loaded: Number of docs: {len(docs)}") return docs
# ---------- 4. Embeddings ----------def embed_documents(docs: list, model_name: str = "all-MiniLM-L6-v2") -> tuple: """ Convert text chunks into dense vector representations. Parameters ---------- docs : list of str The list of document chunks to embed. model_name : str, optional The sentence‑transformer model to use. The default “all-MiniLM-L6-v2” is lightweight and works well for quick demos. Returns ------- embeddings : np.ndarray 2‑D array of shape (num_chunks, embedding_dim). model : SentenceTransformer The loaded embedding model – kept for re‑encoding queries later. """ print("Embedding documents...") model = SentenceTransformer(model_name) embeddings = model.encode(docs, convert_to_numpy=True) return embeddings, model
# ---------- 5. FAISS index ----------def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2: """ Build a FAISS index for fast nearest‑neighbour search. Parameters ---------- embeddings : np.ndarray 2‑D array of document embeddings (float32 or float64). Returns ------- faiss.IndexFlatL2 A FAISS index that can answer distance‑based queries. """ print("Building FAISS index...") if embeddings.ndim != 2 or embeddings.shape[0] == 0: raise ValueError( f"Embeddings must be a non‑empty 2‑D array. Received shape: {embeddings.shape}" ) dim = embeddings.shape[1] index = faiss.IndexFlatL2(dim) index.add(embeddings) return indexp
# ---------- 6. Retrieval ----------def retrieve(index: faiss.IndexFlatL2, query_embedding: np.ndarray, k: int = 3) -> tuple: """ Find the top‑k most similar document chunks to a query vector. Parameters ---------- index : faiss.IndexFlatL2 The pre‑built FAISS index. query_embedding : np.ndarray Embedding of the user question, shape (1, dim). k : int, optional How many neighbours to return (default 3). Returns ------- indices : np.ndarray of int 1‑D array of the top‑k document indices. distances : np.ndarray of float Corresponding L2 distances – useful for debugging. """ print(f"Retrieving top-{k} documents...") distances, indices = index.search(query_embedding, k) return indices[0], distances[0]
# ---------- 7. Prompt construction ----------def build_prompt(context_docs: list, user_query: str) -> str: """ Assemble the final prompt that will be fed to the language model. Parameters ---------- context_docs : list of str The text chunks that were retrieved for the query. user_query : str The original user question. Returns ------- prompt : str A single string that follows the format required by the generation step: “Context:\n<docs>\n\nQuestion: <q>\nAnswer:” """ print("Building prompt...") context = "\n\n".join(context_docs) prompt = f"Context:\n{context}\n\nQuestion: {user_query}\nAnswer:" return prompt
# ---------- 8. Generation ----------def generate_answer( prompt: str, model_name: str = "gpt2-large", max_new_tokens: int = 50, dtype: torch.dtype = None,) -> str: """ Generate a deterministic answer using a causal language model. The generation step is *deterministic* because we fix the random seed and set `do_sample=False`. This makes the output reproducible – a crucial property for teaching labs. Parameters ---------- prompt : str Prompt produced by :func:`build_prompt`. model_name : str, optional Name of the Hugging‑Face transformer model to use. `"gpt2-large"` is a good trade‑off between quality and speed. max_new_tokens : int, optional Maximum number of tokens to generate beyond the prompt. dtype : torch.dtype, optional Data type for model tensors – `torch.float16` reduces GPU memory usage when a GPU is available. Returns ------- answer : str The generated text after the last “Answer:” marker. """ # Initialise the generation pipeline (will cache the model on disk) generator = pipeline( "text-generation", model=model_name, tokenizer=model_name, device=0 if torch.cuda.is_available() else -1, truncation=True, dtype=dtype, ) # Deterministic behaviour set_seed(42) # Some models (e.g., GPT‑2) do not define a pad token. # We fall back to the EOS token to avoid warnings. if generator.model.config.pad_token_id is None: generator.model.config.pad_token_id = generator.model.config.eos_token_id # Generate text without sampling output = generator( prompt, max_new_tokens=max_new_tokens, num_return_sequences=1, do_sample=False, # deterministic )[0]["generated_text"] # Remove the prompt part, keep only the answer text answer = output.split("Answer:")[-1].strip() return answer
# ---------- 9. Main workflow ----------"""Full RAG pipeline executed when the script runs.1. Load and chunk documents from the `docs/` folder.2. Embed the chunks with Sentence‑Transformer.3. Build a FAISS index for efficient similarity search.4. Encode a sample user query and retrieve the 3 most relevant context snippets.5. Build a prompt that combines the retrieved context with the question.6. Generate a deterministic answer using GPT‑2‑large."""# 1️⃣ Load documentsdocs = load_documents("docs") # ← put your .txt files in `docs/`# 2️⃣ Create embeddingsembeddings, embed_model = embed_documents(docs)# 3️⃣ Build FAISS indexfaiss_index = build_faiss_index(embeddings)# 4️⃣ Example queryuser_query = "What are the essential kits for hiking/ trekking?"query_vec = embed_model.encode([user_query], convert_to_numpy=True)# 5️⃣ Retrieve top‑k contextstop_k_indices, _ = retrieve(faiss_index, query_vec, k=3)retrieved_docs = [docs[i] for i in top_k_indices]# 6️⃣ Build promptprompt = build_prompt(retrieved_docs, user_query)# 7️⃣ Generate answeranswer = generate_answer( prompt, max_new_tokens=70, # increase if you want longer answers dtype=torch.float16, # keeps GPU memory down)# 8️⃣ Output the resultsprint(f"\nPrompt:\n<start_of_prompt>\n{prompt}\n<end_of_prompt>")print(f"\nQuestion:\n<start_of_question>\n{user_query}\n<end_of_question>")print(f"\nAnswer:\n<start_of_answer>\n{answer}\n<end_of_answer>")
Loading documents...Loaded: Number of docs: 52Embedding documents...Building FAISS index...Retrieving top-3 documents...Building prompt...Device set to use cuda:0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.Prompt:<start_of_prompt>Context:Equipment** ### 🥾 **For Hiking** * Sturdy **hiking boots** * **Backpack** (20–40 L) * **Water bottle / Hydration pack** * **Trekking poles** * **Map / Compass / GPS** * **Weather-appropriate clothing** * **Snacks / Energy bars** * **First-aid kit** * **Rain gear** ### 🧗♀️ **For Mountain Climbing** * **Helmet** ** Basic first aid * Weather awareness ### For Mountaineering: * Rope handling & knots * Ice axe use & self-arrest * Crampon walking * Crevasse rescue * Altitude management * Team coordination --- ## 🎒 **6. Essential Gear & Equipment** ### 🥾 **For Hiking** * Sturdy **hiking boots** ***Expedition Climbing** | Multi-week climbs of massive peaks (e.g., Mount Everest). | | **Indoor Climbing** | Practicing on artificial climbing walls. | --- ## 🧠 **5. Skills Required** ### For Hiking: * Navigation (map, compass, GPS) * Endurance & pacing * Basic first aid * Weather awareness ### For Mountaineering:Question: What are the essential kits for hiking/ trekking?Answer:<end_of_prompt>Question:<start_of_question>What are the essential kits for hiking/ trekking?<end_of_question>Answer:<start_of_answer>1. Backpack2. Water bottle / Hydration pack3. Trekking poles4. Map / Compass / GPS5. First-aid kit6. Rain gear7. First-aid kit8. Snacks / Energy bars9. First-aid<end_of_answer>
✅That’s it — your mini RAG pipeline in action!
🔒 Notes & Tips
- Use
SentenceTransformer('all-MiniLM-L6-v2')
for quick demos; swap for larger models for better retrieval quality. IndexFlatL2
is exact L2 search — fine for small corpora. Use HNSW/IVF/PQ for millions of vectors.- Deterministic output:
set_seed(42)
+do_sample=False
. For creative outputs, enable sampling & temperature. - In Colab GPU, use
dtype=torch.float16
to reduce memory.
💬 Final Thought
RAG is the simplest, most practical way to ground language models in real knowledge. This notebook helps you experiment and learn — not just run — the pipeline. Happy building! ⚡
📂 GitHub Repository
Full code, PDF walkthrough, and examples are available here: 👉 Learn RAG Repository