Small and local LLMs are often framed as the cheap alternative to frontier models. That framing is wrong. They are not a degraded version of the same thing. They are a different architectural choice, selected for control, predictability, and survivable failure modes.
Iβm as guilty as anyone for pushing βtheyβre freeβ narrative...as if that were the only deciding factor. But like choosing a database / hosting platform for a system you need to understand what trade-offs you are making.
Using a small model via Ollama, LM Studio, ONNX Runtime, or similar is not (just) about saving money. It is about choosing where non-determinism is allowed to exist.
- [The Real Difference: Failure Modes](#the-real-difference-β¦
Small and local LLMs are often framed as the cheap alternative to frontier models. That framing is wrong. They are not a degraded version of the same thing. They are a different architectural choice, selected for control, predictability, and survivable failure modes.
Iβm as guilty as anyone for pushing βtheyβre freeβ narrative...as if that were the only deciding factor. But like choosing a database / hosting platform for a system you need to understand what trade-offs you are making.
Using a small model via Ollama, LM Studio, ONNX Runtime, or similar is not (just) about saving money. It is about choosing where non-determinism is allowed to exist.
The Real Difference: Failure Modes
Large frontier models are broader and more fluent. They densify more of human expressed logic, span more domains, and produce more convincing reasoning traces. That also makes them more dangerous in systems that require guarantees.
Frontier models make sense when breadth is required and outputs are advisory by design - creative drafting, open-ended exploration, or synthesis across unfamiliar domains. But thatβs not most production systems.
Their failures are semantic rather than structural. This is the category error: treating a probabilistic component as if it were a system boundary. They generate valid-looking outputs that are wrong in subtle ways. Those failures are:
- Expensive to detect
- Expensive to debug
- Often only visible after damage is done
Small models fail differently.
When a small model is confused, it tends to:
- Break schemas
- Emit invalid JSON
- Truncate outputs
- Lose track of structure
These are cheap failures. They are detectable with simple validation. They trigger retries or fallbacks immediately. They do not silently advance state.
This is not a weakness. It is a feature.
Where This Principle Comes From
This insight isnβt abstract theory - itβs the foundation of the Ten Commandments of LLM Use. The core principle:
LLMs interpret reality. They must never be allowed to define it.
When you follow this principle, you discover something surprising: you stop needing expensive models. A 7B parameter model running locally can classify, summarise, and generate hypotheses just fine - because the deterministic systems around it handle everything that actually needs to be correct.
Small models are not "weak" - they are often sufficient because the problem has already been reduced by the time it reaches them.
The frontier models are selling you reliability you should be building yourself.
The Right Mental Model
Just as DuckDB is not "cheap SQL" and Postgres is not "worse Azure SQL", small LLMs occupy a different point in the design space. You choose them when:
| Concern | Small Model Advantage |
|---|---|
| Locality | Runs on your hardware, your network, your jurisdiction |
| Auditability | Every inference is logged, reproducible, inspectable |
| Blast radius | Failures are contained, not propagated through API chains |
| Correctness enforcement | Validation happens outside the model |
| Bounded non-determinism | Uncertainty is tightly constrained |
How I Use This in Practice
This isnβt hypothetical. My projects demonstrate this pattern repeatedly:
My GraphRAG implementation offers three modes:
| Mode | LLM Calls | Best For |
|---|---|---|
| Heuristic | 0 per chunk | Pure determinism via IDF + structure |
| Hybrid | 1 per document | Small model validates candidates |
| LLM | 2 per chunk | Maximum quality when needed |
The hybrid mode is the sweet spot: heuristic extraction finds candidates (deterministic), then a small local model validates and enriches them. One LLM call per document, not per chunk.
With Ollama running locally, the cost is $0. But thatβs not why I use it - cost savings are a side-effect of correct abstraction, not the goal. I use it because the failures are cheap and obvious.
ONNX Embeddings: No LLM Required
Semantic search with ONNX and Qdrant shows another pattern: some tasks donβt need an LLM at all. BERT embeddings via ONNX Runtime give you:
- CPU-friendly inference - no GPU required
- Deterministic outputs - same input always produces same embedding
- Local execution - no API calls, no latency, no rate limits
- ~90MB model - runs anywhere
For hybrid search, I combine these embeddings with BM25 scoring. The LLM only appears at synthesis time - and even then, a small local model works fine because itβs explaining structure that deterministic systems have already validated.
DocSummarizer: Structure First, LLM Second
DocSummarizer embodies this philosophy:
- Parse documents with deterministic libraries (OpenXML, Markdig)
- Chunk content using structural rules (headings, paragraphs, code blocks)
- Embed chunks with ONNX BERT
- Retrieve relevant chunks via vector search
- Synthesise with Ollama - the only probabilistic step
The LLM is the last step, working on pre-validated, pre-structured content. It can fail - and when it does, the failure is obvious because the structure is already correct.
The Three Questions
Frontier models are powerful tools when used deliberately. But they increase expressive power faster than they reduce risk. Small models, when embedded inside deterministic systems, give you just enough uncertainty to explore - without obscuring truth or responsibility.
The right question is not "which model is best?"
It is:
- Where does probability belong?
- Where must determinism be absolute?
- What failures can this system survive?
If the answer involves state, side effects, money, policy, or guarantees - the model should never be in charge. And if the model is only there to classify, summarise, rank, or propose hypotheses, a small local model is often the correct choice, not the economical one.
The Pattern: Boring Machinery + Small Model
This is the architecture that works:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DETERMINISTIC LAYER β
β State machines, queues, validation, storage β
β (DuckDB, Postgres, Redis, file systems) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTERFACE LAYER β
β Schema validation, retries, fallbacks β
β (Polly, FluentValidation, custom guards) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROBABILISTIC LAYER β
β Classification, summarisation, hypothesis gen β
β (Ollama, ONNX, small local models) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The LLM is at the bottom, not the top. It proposes; the deterministic layers dispose.
Reliability Is Not About Avoiding Failure
All three perspectives - the questions, the pattern, and this final principle - reduce to the same rule:
Reliability is about choosing failures you can survive.
With LLMs, that means managing non-determinism through deterministic practices:
- Commandment I: State lives outside the model
- Commandment VII: Make failure loud and boring
- Commandment IX: Build the boring machinery first
Small models make this easier because their failures are loud. Invalid JSON. Truncated output. Schema violations. These are gifts - they tell you immediately that something went wrong.
Frontier model failures are quiet. Plausible-sounding nonsense. Confident hallucinations. Semantic drift that only becomes visible when a customer complains or an audit fails.
Iβll take loud failures every time.
The Philosophy
- Ten Commandments of LLM Use - The principles behind this approach
- Why I Donβt Use LangChain - Framework complexity vs. clarity
- Why Commercial AI Projects Are Dumb - The case for local-first AI
The Implementation
- GraphRAG: Minimum Viable Implementation - Three extraction modes in practice
- Semantic Search with ONNX and Qdrant - CPU-friendly embeddings
- DocSummarizer Tool - Structure first, LLM second
- Hybrid Search and Auto-Indexing - Production-ready search
The Architecture
- DiSE: Treating LLMs as Untrustworthy - The "untrustworthy gods" pattern
- Bot Detection with LLM Advisors - LLM as advisor, not controller
- Zero-PII Customer Intelligence - Semantic understanding with boundaries
External Resources
- Ollama - Run LLMs locally with one command
- ONNX Runtime - Cross-platform ML inference
- LM Studio - Desktop app for local LLMs
- llama.cpp - Efficient C++ inference