Asking Our Documents the Right Questions

Building a Private RAG Assistant for Company Knowledge

There’s a quiet thrill in asking an AI a question and getting a sharp, context-rich answer.

It’s even better when the AI knows your own company’s history — the old projects, the niche case studies, the little industry details that only live in your internal docs.

That was the spark behind this small proof of concept (PoC): a local AI that can answer questions about our own documents without sending a single byte to an external server.

Lessons First: Making Sense of the Terminology

Before diving into what we built, it’s worth untangling some of the terminology that can overwhelm anyone stepping into the AI space:

Fine-tuning: retraining a base model with your data. Powerful, but slow and expensive to update …

Building a Private RAG Assistant for Company Knowledge

There’s a quiet thrill in asking an AI a question and getting a sharp, context-rich answer.

It’s even better when the AI knows your own company’s history — the old projects, the niche case studies, the little industry details that only live in your internal docs.

That was the spark behind this small proof of concept (PoC): a local AI that can answer questions about our own documents without sending a single byte to an external server.

Lessons First: Making Sense of the Terminology

Before diving into what we built, it’s worth untangling some of the terminology that can overwhelm anyone stepping into the AI space:

Fine-tuning: retraining a base model with your data. Powerful, but slow and expensive to update every time documents change.
RAG (Retrieval-Augmented Generation): the model stays the same, but you provide it with the most relevant pieces of your documents at query time. Faster, lighter, and ideal for dynamic knowledge bases.
Embeddings: a way to convert text into numbers (vectors) so that similar meanings are close together in “vector space.”
Vector store: a specialized database that stores those embeddings and can quickly return the most relevant chunks. We used ChromaDB.
Model swapping: in Ollama, you can quickly change which AI model answers your questions, balancing speed vs. quality depending on your needs.

With these concepts in mind, the decisions we made will make more sense.

Why We Built It

Our motivation was straightforward: experiment, explore, and see how AI could help us in our workflow.

A practical case came up when preparing for meetings with potential leads. If we could surface old projects in the same industry quickly, we could bring success stories to the table at the right moment.

The problem: our documentation is extensive, and finding the right example in time isn’t always easy.

Hence, the idea of “asking” our documents directly.

Why Local and Why Ollama

We chose to keep the whole setup local. Not because cloud models aren’t good, but because sending internal documentation to an external API wasn’t something we wanted to do — even for a PoC. Local models gave us control, privacy, and independence.

For the runtime, we used Ollama because:

It’s open source, which means we can run and host it ourselves.
It’s widely adopted and actively maintained.
It makes it trivial to download and switch between different models (phi3, tinyllama, qwen3, etc.).

For a PoC, that combination of simplicity and flexibility made Ollama the right choice.

The Data Flow: Simple by Design

One of the nicest things about this PoC is how little ceremony it requires:

Export a .zip of your docs (in our case, Markdown files from Outline).
Drop it into the program.
Ask questions.

No manual tagging, no special formatting — just “export, drop, ask.” The system handles recursive folders, chunking documents into smaller pieces, embedding them, and storing them in ChromaDB. It even keeps a hash of the last indexed file so unchanged docs aren’t reprocessed.

That simplicity was intentional: it makes experimentation comfortable.

Why RAG

RAG and fine-tuning aren’t mutually exclusive. They tackle different layers of the problem: fine-tuning reshapes how the model thinks — adapting its reasoning, tone, and domain understanding — while RAG extends what the model knows by dynamically grounding its answers in the most current and relevant data available at query time.

In our case, the document export is about 25 MB and changes often. Fine-tuning would force retraining every time new data appeared — a slow and unnecessary loop for an evolving dataset. RAG, instead, lets us retrieve the latest content at query time, keeping responses accurate without touching the model’s weights.

For this proof-of-concept, that balance made sense: real-time knowledge, minimal overhead, and the freedom to iterate fast — without pretending the model needs to memorize what it can simply look up.

Playing With Models

Ollama made it easy to try different models:

phi3:latest — best accuracy, but slower on our test hardware (a 2019 Intel i9 MacBook Pro).
phi3:mini — faster with the --fast flag, a good balance of quality and performance.
tinyllama:latest — the quickest (--ultra-fast), but weaker answers.

We’re also considering qwen3:4b and starcoder2:3b for further balance tests.

Where We’re Headed

Right now, the system runs in the terminal. The vision is to evolve it into a Slack bot so anyone in the company can query the documentation naturally.

That means tackling bigger questions:

How do we keep the index fresh daily?
What hardware setup do we need for smooth performance?
How do we enforce permissions so sensitive docs don’t leak?

But those are production challenges. For now, we’ve proven the core idea: a local AI can tap into company knowledge and provide answers without relying on the cloud.

How to Recreate the Setup

Here’s a short guide for replicating this PoC.

1. Install Ollama

Download from: https://ollama.com/download

Start the server:

ollama serve

Pull the models you want to test:

ollama pull phi3:latest
ollama pull phi3:mini
ollama pull tinyllama:latest

2. Prepare Your Documents

This PoC expects:

A .zip with your .md files.
Nested folders are supported.

Example:

my_docs.zip
├── project1.md
├── industry_case.md
├── subfolder/
│   └── nested_doc.md
└── ...

3. Install Dependencies

pip install chromadb sentence-transformers requests tqdm

4. Index Your Documents

python main.py --zip /path/to/my_docs.zip --model phi3:latest

What happens:

Extracts the ZIP
Splits into 512-character chunks
Embeds with all-MiniLM-L6-v2
Stores vectors in vector_store/
Skips reprocessing if unchanged

5. Ask Questions

Question (or 'quit'): What projects have we done in the healthcare industry?

The system:

Retrieves the top 5 relevant chunks
Passes them to the model
Returns the answer (or admits it can’t if the context doesn’t match)

6. Iterate on Models

python main.py --zip my_docs.zip --fast
python main.py --zip my_docs.zip --ultra-fast

From here, turning it into a Slack bot is the natural next step: wrap the Q&A logic into a Slack app, run it on a server, and let the team start asking away.

Building a Private RAG Assistant for Company Knowledge

Lessons First: Making Sense of the Terminology

Building a Private RAG Assistant for Company Knowledge

Lessons First: Making Sense of the Terminology

Why We Built It

Why Local and Why Ollama

The Data Flow: Simple by Design

Why RAG

Playing With Models

Where We’re Headed

How to Recreate the Setup

1. Install Ollama

2. Prepare Your Documents

3. Install Dependencies

4. Index Your Documents

5. Ask Questions

6. Iterate on Models

Similar Posts