Implementing RAG from Scratch in Google Colab

8 min read1 day ago

–

Press enter or click to view image in full size

> Build your own Retrieval-Augmented Generation (RAG) system in Google Colab — understand how LLMs retrieve, reason, and respond with real facts.

💡 Why I Built This

A few weeks ago, I realized most explanations of RAG (Retrieval-Augmented Generation) online are either too abstract or too complex.

So I decided to build something anyone could understand — and more importantly, run inside Google Colab.

The result? A Minimal, Fully Documented RAG Example — where you can drop in your own .txt files and watch an LLM retrieve, reason, and respond with context-aware answers.

🧩 What You’ll Learn

This project walks through every major step of a RAG system:

Installing the …

8 min read1 day ago

–

Press enter or click to view image in full size

> Build your own Retrieval-Augmented Generation (RAG) system in Google Colab — understand how LLMs retrieve, reason, and respond with real facts.

💡 Why I Built This

A few weeks ago, I realized most explanations of RAG (Retrieval-Augmented Generation) online are either too abstract or too complex.

So I decided to build something anyone could understand — and more importantly, run inside Google Colab.

The result? A Minimal, Fully Documented RAG Example — where you can drop in your own .txt files and watch an LLM retrieve, reason, and respond with context-aware answers.

🧩 What You’ll Learn

This project walks through every major step of a RAG system:

Installing the required libraries
Loading and chunking your text documents
Converting those chunks into dense embeddings
Building a FAISS similarity index
Retrieving the most relevant chunks for a user query
Building a context-rich prompt
Generating a deterministic, reproducible answer using GPT-2

⚙️ How It Works

Here’s what happens under the hood:

1️⃣ Document Chunking

Each .txt file from your /docs/ folder is split into 50-word chunks with a 10-word overlap. This ensures smooth semantic continuity — no idea gets lost between boundaries.

2️⃣ Embeddings with SentenceTransformers

Each chunk is transformed into a vector — a numerical summary of its meaning. These embeddings help the system search by meaning, not just keywords.

3️⃣ FAISS Indexing

The embeddings are stored in a FAISS (Facebook AI Similarity Search) index, which makes it possible to instantly find the most relevant chunks for any question.

4️⃣ Retrieval

When you ask a question, the system encodes it into the same embedding space, then retrieves the top-3 most similar chunks.

5️⃣ Prompt Construction

The retrieved chunks are stitched into a structured prompt like this:

Context:<chunk1><chunk2><chunk3>Question: <your question>Answer:

6️⃣ Answer Generation

Finally, a GPT-2 language model reads the context and generates an answer — deterministically (same input = same output every time).

🧪 Example Output

Question:

What are the essential kits for hiking or trekking?

Generated Answer:

1. Backpack2. Water bottle / Hydration pack3. Trekking poles4. Map / Compass / GPS5. First-aid kit6. Rain gear7. Snacks / Energy bars

Everything you see is retrieved from your own documents, not from the model’s memory — that’s the beauty of RAG.

🧰 How to Use

Open the Google Colab Notebook.
Create a folder named docs/ in the Colab root.
Place any number of .txt files inside it.
Run all cells (or execute python rag_colab.py).

📜 Full Code Walkthrough

# ---------- 1. Install required libraries ----------# (Run this in Colab or your environment; in a script omit the '!' when using pip)!pip install -q sentence-transformers faiss-cpu transformers

# ---------- 2. Imports ----------import osimport faissimport numpy as npimport torchfrom pathlib import Pathfrom sentence_transformers import SentenceTransformerfrom transformers import pipeline, set_seed

# ---------- 3. Document loader ----------def load_documents(doc_dir: str) -> list:    """    Load and preprocess all plain‑text documents in a directory.    Parameters    ----------    doc_dir : str        Path to a folder that contains one or more *.txt files.    Returns    -------    docs : list of str        A flat list where each element is a chunk of 50 words taken from the        original files, with a 10‑word overlap between consecutive chunks.    Notes    -----    •  Chunking at a small granularity (50 words) allows the retriever to      identify highly relevant snippets rather than whole paragraphs.    •  The 10‑word overlap ensures that the boundary words of a chunk      are not lost when we split a document – this improves semantic      continuity for the embedding model.    """    print("Loading documents...")    # Helper that splits a single string into overlapping chunks    def chunk_text(text: str, chunk_size: int = 50, overlap: int = 10) -> list:        """        Split a block of text into overlapping word‑based chunks.        Parameters        ----------        text : str            Raw document text.        chunk_size : int, optional            Number of words per chunk (default 50).        overlap : int, optional            Number of words that consecutive chunks share (default 10).        Returns        -------        list of str            List of chunk strings.        """        words = text.split()        chunks = []        start = 0        while start < len(words):            end = start + chunk_size            chunk = " ".join(words[start:end])            chunks.append(chunk)            start += chunk_size - overlap        return chunks    docs = []    for file_path in Path(doc_dir).glob("*.txt"):        with open(file_path, "r", encoding="utf-8") as f:            content = f.read()        # Extend the master list with the new chunks        docs.extend(chunk_text(content))    print(f"Loaded: Number of docs: {len(docs)}")    return docs

# ---------- 4. Embeddings ----------def embed_documents(docs: list, model_name: str = "all-MiniLM-L6-v2") -> tuple:    """    Convert text chunks into dense vector representations.    Parameters    ----------    docs : list of str        The list of document chunks to embed.    model_name : str, optional        The sentence‑transformer model to use.  The default        “all-MiniLM-L6-v2” is lightweight and works well for quick demos.    Returns    -------    embeddings : np.ndarray        2‑D array of shape (num_chunks, embedding_dim).    model : SentenceTransformer        The loaded embedding model – kept for re‑encoding queries later.    """    print("Embedding documents...")    model = SentenceTransformer(model_name)    embeddings = model.encode(docs, convert_to_numpy=True)    return embeddings, model

# ---------- 5. FAISS index ----------def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:    """    Build a FAISS index for fast nearest‑neighbour search.    Parameters    ----------    embeddings : np.ndarray        2‑D array of document embeddings (float32 or float64).    Returns    -------    faiss.IndexFlatL2        A FAISS index that can answer distance‑based queries.    """    print("Building FAISS index...")    if embeddings.ndim != 2 or embeddings.shape[0] == 0:        raise ValueError(            f"Embeddings must be a non‑empty 2‑D array. Received shape: {embeddings.shape}"        )    dim = embeddings.shape[1]    index = faiss.IndexFlatL2(dim)    index.add(embeddings)    return indexp

# ---------- 6. Retrieval ----------def retrieve(index: faiss.IndexFlatL2, query_embedding: np.ndarray, k: int = 3) -> tuple:    """    Find the top‑k most similar document chunks to a query vector.    Parameters    ----------    index : faiss.IndexFlatL2        The pre‑built FAISS index.    query_embedding : np.ndarray        Embedding of the user question, shape (1, dim).    k : int, optional        How many neighbours to return (default 3).    Returns    -------    indices : np.ndarray of int        1‑D array of the top‑k document indices.    distances : np.ndarray of float        Corresponding L2 distances – useful for debugging.    """    print(f"Retrieving top-{k} documents...")    distances, indices = index.search(query_embedding, k)    return indices[0], distances[0]

# ---------- 7. Prompt construction ----------def build_prompt(context_docs: list, user_query: str) -> str:    """    Assemble the final prompt that will be fed to the language model.    Parameters    ----------    context_docs : list of str        The text chunks that were retrieved for the query.    user_query : str        The original user question.    Returns    -------    prompt : str        A single string that follows the format required by the        generation step: “Context:\n<docs>\n\nQuestion: <q>\nAnswer:”    """    print("Building prompt...")    context = "\n\n".join(context_docs)    prompt = f"Context:\n{context}\n\nQuestion: {user_query}\nAnswer:"    return prompt

# ---------- 8. Generation ----------def generate_answer(    prompt: str,    model_name: str = "gpt2-large",    max_new_tokens: int = 50,    dtype: torch.dtype = None,) -> str:    """    Generate a deterministic answer using a causal language model.    The generation step is *deterministic* because we fix the random seed    and set `do_sample=False`.  This makes the output reproducible – a    crucial property for teaching labs.    Parameters    ----------    prompt : str        Prompt produced by :func:`build_prompt`.    model_name : str, optional        Name of the Hugging‑Face transformer model to use.        `"gpt2-large"` is a good trade‑off between quality and speed.    max_new_tokens : int, optional        Maximum number of tokens to generate beyond the prompt.    dtype : torch.dtype, optional        Data type for model tensors – `torch.float16` reduces GPU memory        usage when a GPU is available.    Returns    -------    answer : str        The generated text after the last “Answer:” marker.    """    # Initialise the generation pipeline (will cache the model on disk)    generator = pipeline(        "text-generation",        model=model_name,        tokenizer=model_name,        device=0 if torch.cuda.is_available() else -1,        truncation=True,        dtype=dtype,    )    # Deterministic behaviour    set_seed(42)    # Some models (e.g., GPT‑2) do not define a pad token.    # We fall back to the EOS token to avoid warnings.    if generator.model.config.pad_token_id is None:        generator.model.config.pad_token_id = generator.model.config.eos_token_id    # Generate text without sampling    output = generator(        prompt,        max_new_tokens=max_new_tokens,        num_return_sequences=1,        do_sample=False,  # deterministic    )[0]["generated_text"]    # Remove the prompt part, keep only the answer text    answer = output.split("Answer:")[-1].strip()    return answer

# ---------- 9. Main workflow ----------"""Full RAG pipeline executed when the script runs.1. Load and chunk documents from the `docs/` folder.2. Embed the chunks with Sentence‑Transformer.3. Build a FAISS index for efficient similarity search.4. Encode a sample user query and retrieve the 3 most relevant    context snippets.5. Build a prompt that combines the retrieved context with    the question.6. Generate a deterministic answer using GPT‑2‑large."""# 1️⃣ Load documentsdocs = load_documents("docs")  # ← put your .txt files in `docs/`# 2️⃣ Create embeddingsembeddings, embed_model = embed_documents(docs)# 3️⃣ Build FAISS indexfaiss_index = build_faiss_index(embeddings)# 4️⃣ Example queryuser_query = "What are the essential kits for hiking/ trekking?"query_vec = embed_model.encode([user_query], convert_to_numpy=True)# 5️⃣ Retrieve top‑k contextstop_k_indices, _ = retrieve(faiss_index, query_vec, k=3)retrieved_docs = [docs[i] for i in top_k_indices]# 6️⃣ Build promptprompt = build_prompt(retrieved_docs, user_query)# 7️⃣ Generate answeranswer = generate_answer(    prompt,    max_new_tokens=70,          # increase if you want longer answers    dtype=torch.float16,        # keeps GPU memory down)# 8️⃣ Output the resultsprint(f"\nPrompt:\n<start_of_prompt>\n{prompt}\n<end_of_prompt>")print(f"\nQuestion:\n<start_of_question>\n{user_query}\n<end_of_question>")print(f"\nAnswer:\n<start_of_answer>\n{answer}\n<end_of_answer>")

Loading documents...Loaded: Number of docs: 52Embedding documents...Building FAISS index...Retrieving top-3 documents...Building prompt...Device set to use cuda:0Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.Prompt:<start_of_prompt>Context:Equipment** ### 🥾 **For Hiking** * Sturdy **hiking boots** * **Backpack** (20–40 L) * **Water bottle / Hydration pack** * **Trekking poles** * **Map / Compass / GPS** * **Weather-appropriate clothing** * **Snacks / Energy bars** * **First-aid kit** * **Rain gear** ### 🧗‍♀️ **For Mountain Climbing** * **Helmet** ** Basic first aid * Weather awareness ### For Mountaineering: * Rope handling & knots * Ice axe use & self-arrest * Crampon walking * Crevasse rescue * Altitude management * Team coordination --- ## 🎒 **6. Essential Gear & Equipment** ### 🥾 **For Hiking** * Sturdy **hiking boots** ***Expedition Climbing** | Multi-week climbs of massive peaks (e.g., Mount Everest). | | **Indoor Climbing** | Practicing on artificial climbing walls. | --- ## 🧠 **5. Skills Required** ### For Hiking: * Navigation (map, compass, GPS) * Endurance & pacing * Basic first aid * Weather awareness ### For Mountaineering:Question: What are the essential kits for hiking/ trekking?Answer:<end_of_prompt>Question:<start_of_question>What are the essential kits for hiking/ trekking?<end_of_question>Answer:<start_of_answer>1. Backpack2. Water bottle / Hydration pack3. Trekking poles4. Map / Compass / GPS5. First-aid kit6. Rain gear7. First-aid kit8. Snacks / Energy bars9. First-aid<end_of_answer>

✅That’s it — your mini RAG pipeline in action!

🔒 Notes & Tips

Use SentenceTransformer('all-MiniLM-L6-v2') for quick demos; swap for larger models for better retrieval quality.
IndexFlatL2 is exact L2 search — fine for small corpora. Use HNSW/IVF/PQ for millions of vectors.
Deterministic output: set_seed(42) + do_sample=False. For creative outputs, enable sampling & temperature.
In Colab GPU, use dtype=torch.float16 to reduce memory.

💬 Final Thought

RAG is the simplest, most practical way to ground language models in real knowledge. This notebook helps you experiment and learn — not just run — the pipeline. Happy building! ⚡

📂 GitHub Repository

Full code, PDF walkthrough, and examples are available here: 👉 Learn RAG Repository

💡 Why I Built This

🧩 What You’ll Learn

💡 Why I Built This

🧩 What You’ll Learn

⚙️ How It Works

1️⃣ Document Chunking

2️⃣ Embeddings with SentenceTransformers

3️⃣ FAISS Indexing

4️⃣ Retrieval

5️⃣ Prompt Construction

6️⃣ Answer Generation

🧪 Example Output

🧰 How to Use

📜 Full Code Walkthrough

🔒 Notes & Tips

💬 Final Thought

📂 GitHub Repository

Similar Posts