From KV caching to multi‑agent workflows, these are the blueprints used in real client projects to scale LLM products without losing quality
10 min readJust now
–
Press enter or click to view image in full size
Source: Diagram by the author.
I work with large language models (LLMs) every day and keep running into the same failure modes in production. Over the last few years, I’ve distilled 23 design patterns that** consistently fix latency, hallucinations, brittle prompts, and mysterious outages in real systems**. If you’re building or operating LLM‑driven products, these patterns will save you weeks of trial‑and‑error and make your models faster, cheaper, and more reliable in production.
1. KV Cache Optimization
Imagine regenerating every previous token’s attention ov…
From KV caching to multi‑agent workflows, these are the blueprints used in real client projects to scale LLM products without losing quality
10 min readJust now
–
Press enter or click to view image in full size
Source: Diagram by the author.
I work with large language models (LLMs) every day and keep running into the same failure modes in production. Over the last few years, I’ve distilled 23 design patterns that** consistently fix latency, hallucinations, brittle prompts, and mysterious outages in real systems**. If you’re building or operating LLM‑driven products, these patterns will save you weeks of trial‑and‑error and make your models faster, cheaper, and more reliable in production.
1. KV Cache Optimization
Imagine regenerating every previous token’s attention over and over — painfully slow. KV Cache solves this by caching keys and values from earlier tokens to avoid redundant computation.
It’s like remembering previous conversations rather than starting fresh each time. This dramatically speeds up token generation, especially for long contexts.
- Scenario: ED physician uses a “chart summarizer” that streams a 600–900 token discharge summary from a 6,000-token visit context.
- Concrete setup: vLLM or llama.cpp server with KV cache enabled, continuous batching on, and max context set to 8k. Cache stays on GPU, paged if supported.
- Example request: “Summarize this visit and produce discharge instructions,” with the same patient context reused across 5 follow-up prompts (meds, allergies, follow-up plan, work note, coding).
- What changes: Only the new tokens compute attention; prior tokens reuse cached K/V, so the 2nd–5th prompts run much faster.
- Metric: Time-to-first-token and tokens/sec for follow-ups improve materially because the shared prefix does not get recomputed.
2. Retrieval-Augmented Generation (RAG)
LLMs can hallucinate facts; RAG plugs in reliable external knowledge during generation by retrieving relevant documents on the fly.
Think of it as giving the model a well-stocked library to check instead of relying on memory alone.
- Scenario: Insurance call center bot answers: “Does plan HMO-483 cover physiotherapy after ACL surgery?”
- Concrete setup: Index policy PDFs and benefit tables into a vector store plus raw text store. Retrieve top 8 chunks, pass them as citations to the LLM.
- Example flow: 1. Retrieve: “HMO-483 Benefits 2025,” “Physiotherapy limits,” “Prior authorization rules.” 2. Generate answer constrained to retrieved text, include clause IDs and page numbers.
- Metric: Hallucination rate drops because the model must quote and cite the policy, not “remember” it.
3. LLM Failover and Redundancy
When your go-to model hiccups or rate limits hit, automatic failover to backup models keeps your app alive.
Just like a power backup switches on when the main supply fails, failover strategies enhance reliability.
- Scenario: Trading support chatbot spikes at market open; primary model gets rate-limited.
- Concrete setup: Circuit breaker: if primary returns 429 or latency > 4s for 3 consecutive calls, route to backup model. Keep a third “degraded mode” for FAQs only.
- Example policy: Primary: high-quality model for complex reasoning Backup: cheaper model for basic Q&A and drafting Degraded: retrieval-only canned answers + escalation
- Metric: Availability stays within SLO (for example 99.9%) even during provider issues.
4. Direct Prompting
Crafting precise prompts is an art. Putting clear instructions upfront and separating context helps the model understand what you want.
Think of it as giving a clear brief instead of vague hints.
- Scenario: Bank wants consistent “transaction dispute” replies that follow compliance language.
- Concrete prompt template (system + user): System: “You are a bank dispute assistant. Follow the bank style guide. Never promise refunds. Always request required fields.” User: “Customer message: … Required fields: transaction_id, last4, date, amount.”
- Output schema: JSON with keys:
missing_fields,next_questions,draft_reply. - Metric: Fewer prompt regressions because instructions, context, and output format are clearly separated.
5. Chain-of-Thought Prompting
For complex problems, encouraging the model to reason step-by-step (like explaining your thinking) helps it arrive at better answers.
It’s reasoning in text form.
- Scenario: Benefits adjudication assistant computes patient out-of-pocket estimate.
- Concrete instruction: “Reason privately. Output only: final estimate, itemized table, assumptions list.”
- Inputs: CPT codes, allowed amounts, deductible remaining, coinsurance, out-of-pocket max.
- Output example (structure): Table: line items, allowed, deductible applied, coinsurance, patient pays Final estimate: total patient responsibility
- Metric: Fewer arithmetic mistakes because the model follows a structured reasoning workflow while only exposing auditable artifacts.
6. Self-Consistency Sampling
Generate multiple reasoning paths and aggregate the best answer to reduce randomness. It’s like cross-checking your conclusions before acting.
- Scenario: Auto-coding diagnoses from a discharge summary (ICD-10 suggestions).
- Concrete setup: Generate 7 candidate code sets with temperature 0.7, then majority-vote by code frequency, breaking ties with a clinical coding rule set.
- Example: If 5 of 7 candidates include “I10” (hypertension) and 2 omit it, keep “I10.”
- Metric: Higher stability: the same chart yields consistent codes despite stochastic decoding.
7. LLM Arbitration
Different models excel at different tasks. Arbitration dynamically chooses which LLM to use based on task, quality, or cost.
Like choosing specialists in a team.
- Scenario: Enterprise HR copilot handles both simple policy lookups and complex “write a performance plan” drafting.
- Concrete router: If task = classification, extraction, short answers → small fast model If task = multi-paragraph drafting with nuance → large model If task mentions “legal risk” or “termination” → large model + mandatory citation mode
- Example: “What is the vacation carryover limit?” uses small model + RAG. “Draft a PIP” uses large model.
- Metric: Cost drops without sacrificing quality where it matters.
8. Hybrid Retriever Architectures
Combine semantic and keyword search to enhance document retrieval quality, pushing RAG systems’ relevance and accuracy up.
- Scenario: Internal legal team searches: “Most favored nation clause termination notice period.”
- Concrete setup: Retrieve with BM25 (keyword) + embeddings (semantic), union results, then rerank.
- Why it matters concretely: “MFN” might appear as “most-favored nation,” “pricing parity,” or an acronym. BM25 catches exact; embeddings catch paraphrases.
- Metric: Top-5 recall improves versus either retriever alone.
9. Query Rewriting and Refinement
Automatically tweak your queries based on partial outputs or errors to get better, more relevant responses.
It’s iterative improvement for questions.
- Scenario: Clinician asks: “What are the rules for insulin pump coverage?”
- Concrete rewrite step: LLM converts it into a structured search query: Plan = patient’s plan ID Topic = “DME insulin pump” Filters = “prior auth,” “medical necessity,” “age limits,” “renewal interval”
- Then retrieval runs on the rewritten query, not the vague original.
- Metric: Retrieval relevance improves, fewer “wrong document” answers.
10. Reranking with Auxiliary Models
Use a smaller, specialized model to reorder multiple candidate answers, picking the highest-quality response.
This boosts accuracy without heavy GPU use.
- Scenario: Procurement assistant retrieves 30 vendor policy snippets for “SOC 2 incident notification timeline.”
- Concrete setup: 1. Retriever returns top 30 chunks. 2. Lightweight cross-encoder reranker scores each chunk for direct answerability. 3. Pass only top 5 to the big LLM for synthesis.
- Metric: Better answers at lower cost because the expensive model sees less junk context.
11. Continuous Evaluation and Monitoring
A healthy LLM deployment needs automated tests to catch drift, bias changes, or latency spikes early.
Continuous guardrails keep systems reliable.
- Scenario: A patient portal chatbot must not give unsafe medical advice and must keep latency under 2.5s p95.
- Concrete controls: Nightly regression suite of 300 “golden” prompts with expected citations Drift checks: refusal rate, escalation rate, citation coverage Production metrics: p50/p95 latency, tool error rates, hallucination heuristics
- Alert example: “citation coverage < 80% for plan-coverage intents” triggers rollback.
- Metric: You detect degradation in hours, not after weeks of complaints.
12. Fine-Tuning and Instruction Tuning
Though expensive, fine-tuning on task-specific datasets turns a generalist into a specialist, improving output relevance.
Get Bran Kop, Engineer @Conformal, Founder of aiHQ’s stories in your inbox
Join Medium for free to get updates from this writer.
Instruction tuning enhances prompt-following via training on lots of prompt-response pairs.
- Scenario: Radiology department wants “Findings/Impression” written in their house style, with strict phrasing rules.
- Concrete approach: LoRA fine-tune on 20,000 de-identified past reports + instruction tuning on “convert free text into house-style report.”
- Example training pairs: Input: messy dictation + metadata Output: standardized report with mandated headings and phrasing
- Metric: Less prompt complexity, fewer formatting errors, higher clinician acceptance.
13. Retrieval-Augmented Fine-Tuning (RAFT)
Take RAG further by fine-tuning LLMs alongside their retrievers, enabling seamless and richer knowledge integration.
- Scenario: Pharma SOP assistant must answer only from controlled documents and cite exact section IDs.
- Concrete setup: During training, each question is paired with retrieved SOP chunks; the model learns to use retrieved context and to refuse when context is missing.
- Also tune retriever embeddings on “SOP question ↔ correct section” pairs.
- Metric: Higher groundedness than plain RAG because the model is trained to treat retrieval as first-class input.
14. Multi-Agent Workflow Orchestration
Divide complex tasks into subtasks delegated to different LLM-powered agents that talk and collaborate.
This mirrors human teamwork for agility.
- Scenario: Hospital discharge planning assistant coordinates tasks: meds, follow-up appointments, transport, home care eligibility.
- Concrete agent split: Planner agent: produces task list Retriever agent: pulls policies and patient-specific constraints Scheduler agent: calls clinic scheduling API Compliance agent: checks contraindications and mandatory disclosures
- Output: a consolidated discharge packet plus an audit log of which agent did what.
- Metric: Fewer dropped steps compared to one monolithic prompt.
15. Planning Patterns
Have LLMs break workflows into ordered or branching steps, enabling complex multi-step problem solving.
Planning helps avoid dead ends.
- Scenario: Finance “month-end close copilot” that reconciles accounts, flags anomalies, drafts journal entries.
- Concrete pattern: 1. LLM produces a numbered plan with dependencies (reconcile cash → AR → AP → accruals). 2. Executes steps with tool calls (SQL, ledger API). 3. Stops if a step fails and proposes the smallest next action.
- Metric: Reduced “wandering” outputs and fewer dead-end tool calls.
16. Memory-Augmented LLMs
Give LLMs access to working and long-term memory stores to extend context beyond token limits.
Memory helps remember history and learn over time.
- Scenario: Enterprise sales engineer copilot that supports the same account over 6 months.
- Concrete memory: Working memory: last 20 turns Long-term memory store: key facts (stack, decision-makers, constraints, prior objections) written as small structured records
- Example: After a call, it stores: “Customer forbids SaaS, requires on-prem, uses Okta, wants audit trails.”
- Metric: The copilot stops re-asking the same questions and produces more coherent follow-ups.
17. Tool Use and API Integration
Enable LLMs to call external software or APIs to fetch data, perform calculations or actions.
Like having superpowers plugged in.
- Scenario: Clinical assistant answers: “What were my last HbA1c results and trend?”
- Concrete tools: FHIR API: fetch Observation resources for HbA1c Charting function: compute trend and latest value
- Flow: LLM calls tool
get_hba1c(patient_id)then summarizes with dates and values, warns if data missing. - Metric: Factual accuracy improves because numbers come from systems of record, not generation.
18. Self-Reflective Reasoning (Meta-Reasoning)
LLMs iteratively critique and revise their own answers, akin to proofreading.
It improves quality and reliability.
- Scenario: Vendor risk assistant drafts a SOC 2 exception response that must be accurate and non-committal.
- Concrete pattern: 1. Draft response 2. Critique pass with a checklist: “Any promises, any unverifiable claims, any missing citations, any risky language” 3. Revise response
- Metric: Lower compliance violations and fewer embarrassing overclaims.
19. Best-of-N Sampling
Generate multiple outputs, select the one scoring highest by confidence or external criteria.
Boosts reliability at the cost of compute.
- Scenario: Collections department wants the best SMS copy that maximizes payment link clicks but stays within tone rules.
- Concrete setup: Generate 12 variants, score each with a small rubric model for: clarity, compliance phrases present, reading level, length, tone. Pick top 1.
- Example constraints: “No threats, no false urgency, include opt-out language.”
- Metric: Better engagement without manual rewriting.
20. Monte Carlo Tree Search (MCTS) for LLMs
MCTS guides multi-step decisions by exploring options selectively, balancing exploration and exploitation.
Great for complex planning tasks.
- Scenario: IT incident commander assistant for a Kubernetes outage chooses among recovery actions.
- Concrete model: State: current symptoms and telemetry Actions: rollback deployment, scale replicas, disable feature flag, drain node pool, fail over region Simulation: LLM predicts likely outcomes; a scorer penalizes risk and downtime
- MCTS explores action sequences, not just a single guess, then recommends the highest expected utility path.
- Metric: Fewer risky “first idea” actions, more reliable recovery playbooks.
21. Mixture of Experts (MOE)
Compactly route different inputs to specialized network parts to improve efficiency and specialization.
Makes huge models more scalable.
- Scenario: A single assistant must handle HR, finance, and clinical ops with high accuracy and low cost.
- Concrete MOE use: Route inputs to specialized experts (or adapters) based on intent classification: Expert A: HR policy tone and templates Expert B: finance reconciliation language and controls Expert C: clinical admin workflows
- Gating: a small router model chooses expert(s); only those experts run.
- Metric: Better domain correctness at lower compute than “one giant always-on model.”
22. Context Window Expansion
Techniques like chunking, window sliding, or retrieval to handle inputs larger than the model’s maximum token length.
Overcomes hard context limits.
- Scenario: Contract review of a 180-page MSA plus 40-page DPA exceeds context limits.
- Concrete technique:
- Chunk by section headers with overlap (for example 800–1,200 tokens, 150-token overlap)
- Retrieve only relevant chunks per question (“limitation of liability,” “breach notification”)
- Optional hierarchical summaries: section summaries → master summary → targeted Q&A
- Metric: You can answer precise questions without stuffing the entire contract into one prompt.
23. Session-Aware and Prefix Caching
Across users or queries with shared context, reuse cached hidden states for massive efficiency gains.
This pattern reduces redundant computation in multi-user environments.
- Scenario: 2,000 employees use the same internal policy copilot daily, all sharing a long static “Company Policy Manual 2025” plus the same system instructions.
- Concrete setup: Precompute and cache the prefix (system prompt + policy preamble + common glossary). Store prefix KV states keyed by
(model_id, policy_version_hash, system_prompt_hash). - Runtime: Each user query attaches after the cached prefix, so the model starts from an already-computed internal state.
- Metric: Big throughput gains in multi-user settings because the shared context is not recomputed for every session.
I’ve found these patterns indispensable for fixing the real problems that sink LLM systems in practice: latency, hallucinations, flaky reliability, and runaway complexity. Use them as building blocks to turn fragile demos into durable, production‑ready AI products.
Click ➡️ subscribe*! *If you found this article helpful, don’t forget to hit that Clap button, leave a comment, or follow for more interesting content.🌟🧑💻🚀
Press enter or click to view image in full size
Thank You For Reading 🤝