I built a fully local Retrieval-Augmented Generation (RAG) system that lets a Llama 3 model answer questions about my own PDFs and Markdown files, no cloud APIs, no external servers, all running on my machine.
It’s powered by:
- Streamlit for the frontend
- FastAPI for the backend
- ChromaDB for vector storage
- Ollama to run Llama 3 locally
The system ingests documents, chunks and embeds them, retrieves relevant parts for a query, and feeds them into Llama 3 to generate grounded answers.
🧠 Introduction – Why Go Local?
It started with a simple frustration, I had a bunch of private PDFs and notes I wanted to query like ChatGPT, but without sending anything to the cloud. LLMs are powerful, but they don’t know your documents. RAG changes that, it…
I built a fully local Retrieval-Augmented Generation (RAG) system that lets a Llama 3 model answer questions about my own PDFs and Markdown files, no cloud APIs, no external servers, all running on my machine.
It’s powered by:
- Streamlit for the frontend
- FastAPI for the backend
- ChromaDB for vector storage
- Ollama to run Llama 3 locally
The system ingests documents, chunks and embeds them, retrieves relevant parts for a query, and feeds them into Llama 3 to generate grounded answers.
🧠 Introduction – Why Go Local?
It started with a simple frustration, I had a bunch of private PDFs and notes I wanted to query like ChatGPT, but without sending anything to the cloud. LLMs are powerful, but they don’t know your documents. RAG changes that, it gives the model a “working memory” by feeding it relevant chunks of your data each time you ask a question.
So, I decided to build my own mini-ChatGPT for personal docs, everything self-hosted, modular, and transparent. The goals were simple:
- Upload docs → Ask questions → Get answers with citations.
- Stay offline and private.
- Learn the moving parts of a RAG pipeline by hand, not just through frameworks like LangChain.
Project Repository: github.com/trickste/raga
🏗 Architecture Overview
The setup runs in two clear phases: ingestion and querying.
The Streamlit UI lets users upload documents and ask questions interactively. The FastAPI backend handles everything else, text extraction, embedding, search, and invoking the LLM.
This separation makes it easy to debug, extend, and even swap components later (e.g., replace Chroma with Pinecone, or Llama 3 with Mistral).
Code Walkthrough
1. Ingestion – Reading and Chunking Documents
The ingestion pipeline starts by reading PDFs or Markdown files and turning them into clean text chunks.
def process_file(file_bytes: bytes, filename: str):
if filename.endswith(".pdf"):
text = extract_text_from_pdf(file_bytes) # via PyMuPDF
elif filename.endswith((".md", ".markdown")):
text = extract_text_from_markdown(file_bytes)
else:
text = file_bytes.decode("utf-8", errors="ignore")
chunks = chunk_text(text) # split into ~500-char chunks
num_added = add_texts(chunks, source=filename)
return num_added
Chunking was trickier than expected. Too long, and the model forgets details. Too short, and it loses context. After experimenting, I settled around 300–500 words per chunk with slight overlaps to maintain continuity.
for i in range(0, len(words), CHUNK_SIZE - OVERLAP):
chunk = " ".join(words[i : i + CHUNK_SIZE])
chunks.append(chunk)
That overlap (~10–15%) turned out to be a small tweak that made a big difference in retrieval quality.
2. Vector Storage – Semantic Search with ChromaDB
Each chunk is embedded using SentenceTransformers (all-MiniLM-L6-v2) and stored in a local ChromaDB collection.
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("langrag_docs")
def add_texts(texts, source):
embeddings = embedding_model.encode(texts, normalize_embeddings=True)
collection.add(
documents=texts,
metadatas=[{"source": source}] * len(texts),
ids=[f"{source}_{uuid4().hex}" for _ in texts],
embeddings=embeddings.tolist()
)
Each document is now represented as a vector in a 384-dimensional space. When a query comes in, we embed the question and retrieve the top-k most similar chunks.
results = collection.query(query_texts=[user_query], n_results=3)
retrieved_chunks = results["documents"][0]
This is where the “retrieval” in Retrieval-Augmented Generation happens.
3. Querying – Feeding the Context to the LLM
Once we have the most relevant chunks, we build a grounded prompt and send it to Llama 3 via Ollama’s local API.
def generate_answer(query, context):
messages = [
{"role": "system", "content": (
"You are a document QA assistant. "
"Answer strictly using only the CONTEXT below."
)},
{"role": "user", "content": f"CONTEXT:\n{context}\n\nQUESTION: {query}"}
]
payload = {"model": "llama3:latest", "messages": messages, "stream": True}
with requests.post("http://localhost:11434/api/chat", json=payload, stream=True) as r:
...
This prompt discipline was critical. If you don’t tell the model to stick to the context, it will happily hallucinate. By enforcing “If it’s not in the context, say you don’t know,” we drastically improved trustworthiness.
4. The FastAPI Backend – Tying It All Together
FastAPI glues it all: ingestion, querying, and LLM invocation.
@app.post("/add_docs")
async def add_docs(file: UploadFile = File(...)):
file_bytes = await file.read()
num_chunks = process_file(file_bytes, file.filename)
return {"message": f"Added {num_chunks} chunks from {file.filename}"}
@app.post("/query")
async def query_docs(req: QueryRequest):
query = req.query.strip()
results = collection.query(query_texts=[query], n_results=3)
context = "\n\n".join(results["documents"][0])
answer = generate_answer(query, context)
return {"answer": answer}
This clean separation of endpoints made debugging painless. Every step logs info: embeddings added, chunks retrieved, distances scored, and LLM latency.
5. Streamlit Frontend – A Minimal, Interactive UI
The frontend makes it fun, drag and drop a file, type a question, and watch it respond.
st.file_uploader("Upload PDF or Markdown", type=["pdf", "md"])
query = st.text_input("Ask a question about your documents:")
if st.button("Get Answer"):
res = requests.post(f"{API_URL}/query", json={"query": query})
st.write("**Answer:**", res.json()["answer"])
It’s only ~30 lines of Streamlit, but it transforms the project into a usable app.
Lessons Learned
1. Chunking Is an Art
Small, overlapping chunks worked better than large ones. Breaking text at semantic boundaries (paragraphs or sections) gave cleaner retrievals.
2. Quality of Retrieval Beats Quantity
Feeding too many chunks diluted answers. 3 relevant chunks > 10 vague ones.
3. Prompt Grounding Changes Everything
Explicitly instructing the LLM not to make things up was the single most effective fix for hallucination.
4. Local Models Are Ready for Prime Time
Running Llama 3 via Ollama felt just as smooth as using an API, but faster, cheaper, and private. And yes, no API keys or rate limits, WOOHOO !!!
5. Observe Everything
Logging every stage, chunk sizes, retrieval scores, final prompts, made debugging feel scientific rather than guesswork.
Conclusion – My Own “ChatGPT for PDFs”
At the end of this build, I had a working Local RAG Assistant, a tiny offline system that could read, index, and reason about my documents. It runs entirely on my laptop, keeps my data private, and helped me deeply understand how modern LLM pipelines actually work under the hood.
There’s plenty of room to grow:
- Add source citations in answers.
- Support more file types (Word, HTML, etc.).
- Experiment with different embedding models.
- Add caching or user authentication for a real-world app.
But most importantly, it taught me how retrieval, embeddings, and prompt engineering combine to make language models truly useful.
If you’ve been thinking about building something similar, just start. It’s incredibly rewarding to see your own files talk back intelligently.
Happy building and may your vectors always find their nearest neighbors.