Building a Local-First RAG Engine for AI Coding Assistants

AI coding assistants have a context problem.

They can generate code, explain algorithms, refactor functions. But ask Claude or Cursor "where is authentication handled in this codebase?" and you’ll get a guess at best.

The assistant doesn’t actually know your code. It sees one file at a time. No persistent memory. No understanding of how components connect.

This is the RAG problem — Retrieval Augmented Generation. The AI needs relevant context to give useful answers. Someone has to find that context first.

The Current Options

Cloud indexing services upload your codebase to external servers. They build searchable indexes, handle embeddings, serve results via API. Fast and convenient — until you remember that’s proprietary code sitting on infrastructure you don’t c…

AI coding assistants have a context problem.

They can generate code, explain algorithms, refactor functions. But ask Claude or Cursor "where is authentication handled in this codebase?" and you’ll get a guess at best.

The assistant doesn’t actually know your code. It sees one file at a time. No persistent memory. No understanding of how components connect.

This is the RAG problem — Retrieval Augmented Generation. The AI needs relevant context to give useful answers. Someone has to find that context first.

The Current Options

IDE-specific solutions like GitHub Copilot work well but lock you into their ecosystem. Switch tools, lose your indexed context.

Self-hosted RAG pipelines usually mean Python, a dozen dependencies, vector database setup, and configuration that breaks between machines. Great for experimentation, painful for daily use.

None of these felt right for how I actually work.

What I’m Building

AmanMCP is a local-first search engine for codebases.

Runs entirely on your machine
Single binary, zero dependencies
Works with any MCP-compatible assistant (Claude Code, Cursor, others)
Your code never leaves your laptop

MCP is the Model Context Protocol — an open standard for connecting AI assistants to external tools and data sources. AmanMCP implements this protocol, so any compatible client can use it without custom integration.

How It Actually Works

Hybrid Search with Query Classification

Most code search tools use either keyword matching (grep-style) or vector similarity (semantic search). Both have tradeoffs.

Keyword search excels at exact matches. Looking for ERR_CONNECTION_REFUSED? Keyword search finds it instantly. But ask "how does error handling work?" and keyword search struggles.

Vector search understands meaning. It knows "authentication" and "login verification" are related concepts. But it can miss exact technical terms, especially uncommon ones.

AmanMCP uses both — and automatically adjusts the balance based on your query.

The classifier examines query structure — presence of error codes, camelCase identifiers, natural language patterns — and sets weights accordingly. No manual tuning required.

Results from both searches merge using Reciprocal Rank Fusion (RRF), a technique that combines ranked lists without needing comparable scores.

AST-Aware Chunking

RAG systems split documents into chunks before indexing. Most use fixed token counts — every 500 tokens, create a new chunk.

This breaks code in awkward places. A function split mid-way loses meaning. A class definition separated from its methods becomes harder to understand.

AmanMCP uses tree-sitter to parse actual code structure. Chunks align with logical boundaries:

Functions stay whole
Classes keep their methods
Related code stays together

When a function exceeds the chunk limit, it splits at nested boundaries — inner functions, large blocks — rather than arbitrary positions.

Markdown files use header-based chunking. Each section becomes a chunk, with header hierarchy preserved as context.

Local Embeddings

Vector search requires embeddings — numerical representations of text that capture semantic meaning.

Most RAG systems call cloud APIs (OpenAI, Cohere) for embeddings. Every query and every indexed chunk makes a network request. Costs add up. Rate limits apply. Your code travels over the wire.

AmanMCP generates embeddings locally using Ollama with the nomic-embed-text model. Runs on your hardware. No API costs. No external calls.

For machines without GPU or when Ollama isn’t running, a static embeddings fallback provides basic semantic search using CPU-only word vectors. Quality decreases, but search still works.

Architecture

Storage layer uses:

USearch for vector similarity (HNSW algorithm)
Custom BM25 inverted index for keyword search
SQLite for metadata and file tracking

All indexes live in .amanmcp/ within your project directory. Portable and inspectable.

Performance Targets

The goal is sub-100ms query latency on a 50,000 file codebase running on typical developer hardware (16-32GB RAM).

This requires:

Efficient vector indexing (USearch with HNSW)
In-memory BM25 with smart caching
LRU cache for repeated queries
Parallel search execution (BM25 and vector run concurrently)

Memory management adapts to available RAM:

16GB system → conservative settings, I8 quantization
24GB system → balanced defaults, F16 quantization
32GB+ → full precision, larger caches

Why Go?

AmanMCP is written in Go. Single binary compilation. No runtime dependencies. Cross-platform without configuration.

# That's it. No pip, no venv, no node_modules.
./amanmcp serve

Tree-sitter bindings require CGO, but the distributed binary includes everything. Users don’t need a compiler toolchain.

Go’s concurrency model fits the architecture — parallel search paths, background indexing, file watching — without callback complexity.

What’s Next

Current status: finalizing the technical specification and will be completing it over the weekend.

The spec covers:

Complete data models (chunks, symbols, projects)
Search algorithms with code examples
Configuration schema with sensible defaults
MCP tool definitions
Error handling and graceful degradation

Next milestone: working v1 with core search functionality.

Get Involved

AmanMCP will be open source. If you’re interested in:

Local-first developer tools
RAG systems and hybrid search
Go-based infrastructure

Watch the repo (link coming soon) or connect with me here.

The AI assistant ecosystem is growing fast. The tooling that feeds context to these assistants matters. I’d rather that tooling respect privacy by default.

AI #OpenSource #DeveloperTools #BuildInPublic #Golang #RAG #LocalFirst #LLM #DevTools #AIAssistant

Building in public. Questions and feedback welcome.

The Current Options

The Current Options

What I’m Building

How It Actually Works

Hybrid Search with Query Classification

AST-Aware Chunking

Local Embeddings

Architecture

Performance Targets

Why Go?

What’s Next

Get Involved

AI #OpenSource #DeveloperTools #BuildInPublic #Golang #RAG #LocalFirst #LLM #DevTools #AIAssistant

Similar Posts