I work as an SRE, so I care deeply about control, performance, and safe failures. When I started using AI chatbots, something felt uncomfortable — I didn’t know where the answers came from or why the AI sounded so confident, and that uncertainty stayed with me.So in my free time, I decided to build my own , not to sell anything, but to understand how these systems actually work. This post explains the idea in , and no AI background is needed.The problem I wanted to solveSend data to cloud servicesAnswer from general knowledgeGuess when they are not sureFrom an SRE point of view, this is risky:No clear failure behaviorI wanted an AI system that behaves like a :Answers only from known documentsOverall system architecture (high level)At a high level, the system looks like this:Backend → handl…
I work as an SRE, so I care deeply about control, performance, and safe failures. When I started using AI chatbots, something felt uncomfortable — I didn’t know where the answers came from or why the AI sounded so confident, and that uncertainty stayed with me.So in my free time, I decided to build my own , not to sell anything, but to understand how these systems actually work. This post explains the idea in , and no AI background is needed.The problem I wanted to solveSend data to cloud servicesAnswer from general knowledgeGuess when they are not sureFrom an SRE point of view, this is risky:No clear failure behaviorI wanted an AI system that behaves like a :Answers only from known documentsOverall system architecture (high level)At a high level, the system looks like this:Backend → handles logic and securityOllama → runs AI models locallyLocal storage → document indexEverything runs on .Running AI on my own system (Ollama)I used , which lets you run AI models on your own machine.Ollama turns your laptop or server into your own AI API.No dependency on cloud AI APIsEasy to restart, monitor, and controlThe models I use (simple and clear)I use , each with a specific job.1️⃣ Embedding model (understanding meaning)Converts text into vectors (numbers)Helps compare , not wordsThis model .2️⃣ Chat model (writing answers)Reads selected informationGood balance of quality and speedUsing two models reduces hallucination and improves control.What is RAG (explained simply)RAG means Retrieval-Augmented Generation.The AI is not allowed to guess. It must look at documents first.It works like an .Documents are split into small chunksEach chunk is converted into a vectorUser question is also converted into a vectorMost similar chunks are selectedAI answers using only those chunksIf similarity is low → safe fallbackWhy Redis is used (semantic caching)People ask the same question in different ways.“How does feature X work?”“Can you guide me on feature X?”Text is different, meaning is the same.Normal cache does not work well here.Redis semantic cache (diagram)Question → converted into a vector3. New question → vector comparison4. If similarity is high:Same meaning ≠ repeated costWhere everything runs: Mac mini + Cloudflare TunnelAll of this runs on a that I use as a small home server.Backend services running locallyExposed only via Cloudflare TunnelAccess controlled at CloudflareI’ll share the full Mac mini + Cloudflare Tunnel setup in a separate post if people are interested.I didn’t build this to sell anything.Understand AI systems deeplyTreat AI like infrastructureLearn by building in my free timeThis is — concepts first, code later.Building this private AI chatbot helped me stop seeing AI as magic and start seeing it as . Once I approached it like any other service, familiar SRE questions naturally followed — SLOs, failure modes, confidence thresholds, and observability — and that’s when everything started to make sense.This v1 system already has clear behavior, predictable performance, and practical safeguards like semantic caching and safe fallbacks. Instead of hiding uncertainty, it makes it visible, which is exactly how reliable systems should behave. Just as important, it gave me real insight into latency, load, and system behavior — the foundations of operating anything in production.There’s plenty more to build on top of this, and that’s the exciting part. This version gave me a solid, working base and a much deeper understanding of how modern AI systems behave under real constraints. For me, that’s the real win: learning to operate AI with the same discipline and confidence we apply to any production service.This is v1 — a strong foundation, and the beginning of something I’m genuinely excited to keep improving.Part 2: Architecture & Design Decisions(with GitHub code walkthrough)Part 3: Deployment & Mac Mini Server Setup(how I built an in-house server using a Mac mini and Cloudflare Tunnel)