When I first added Redis caching to my RAG API, the motivation was simple: latency was creeping up, costs were rising and many questions looked repetitive. Caching felt like the obvious win. But once I went beyond the happy path, I realized caching in RAG isn’t about Redis at all. It’s about what you choose to cache and how safely you decide two queries are “the same”.
This post walks through:
- why Redis caching works for RAG
- what a normalized query really means
- why semantic caching is tempting but dangerous
- and how a proper normalization layer keeps correctness intact
Why Redis Caching Makes Sense in RAG
RAG pipelines are expensive because they repeatedly do the same things:
- embedding generation
- vector retrieval
- context assembly
- LLM inference
For many …
When I first added Redis caching to my RAG API, the motivation was simple: latency was creeping up, costs were rising and many questions looked repetitive. Caching felt like the obvious win. But once I went beyond the happy path, I realized caching in RAG isn’t about Redis at all. It’s about what you choose to cache and how safely you decide two queries are “the same”.
This post walks through:
- why Redis caching works for RAG
- what a normalized query really means
- why semantic caching is tempting but dangerous
- and how a proper normalization layer keeps correctness intact
Why Redis Caching Makes Sense in RAG
RAG pipelines are expensive because they repeatedly do the same things:
- embedding generation
- vector retrieval
- context assembly
- LLM inference
For many user questions, especially in internal tools: the answer doesn’t change between requests
Redis gives you:
- sub-millisecond reads
- TTL-based eviction
- simple operational model
- predictable cost
So the first version of my cache looked like this:
cache_key = hash(user_query)
Why this doesn’t work. You know it.
Text Equality Is Not Intent Equality
These queries are clearly the same:
- "Explain docker networking"
- "Can you explain Docker networking?"
- "docker networking explained"
But Redis treats them as different keys. That’s when the idea of a normalized query enters the picture.
What Is a Normalized Query (Really)?
A normalized query about stripping away presentation noise while preserving intent.
The goal:
- improve cache hit rate
- without returning wrong answers
Safe normalizations:
- lowercasing
- trimming whitespaces
- removing punctuation
- collapsing filler phrases
Dangerous normalizations:
- removing numbers
- collapsing versions
- replacing domain terms
- synonym substitution
- semantic guessing In RAG, wrong cache hits are worse than cache misses.
An Example of Normalization Function
import re
FILLER_PHRASES = ["can you", "please", "tell me", "explain"]
def normalize_query(query: str) -> str:
q = query.lower().strip()
for phrase in FILLER_PHRASES:
q = q.replace(phrase, "")
q = re.sub(r"[^\w\s]", "", q)
q = re.sub(r"\s+", " ", q)
return q.strip()
This intentionally avoids:
- NLP stopword lists
- embeddings
- synonym expansion
Boring. Predictable. Correct.
A Better Cache Key
Text alone is still not enough. A correct cache key must capture how the answer was produced, not just the question.
cache_key = hash(
model_name +
normalized_query +
retrieval_config
)
This prevents:
- reusing answers across models
- mixing retrieval strategies
- silent correctness bugs
Where Semantic Caching Tempted Me (& Why It’s Risky)
At some point, I considered: "What if I reuse answers for similar questions?" This is semantic caching. Example:
"How does Redis caching work in RAG?"
"Explain caching strategy for RAG systems"
They feel similar. But semantic similarity is probabilistic, not deterministic.
The risks:
- incorrect reuse
- subtle hallucinations
- hard-to-debug failures
- broken trust
For production RAG, that’s dangerous.
Where Semantic Caching Can Work (Carefully)
Semantic caching is acceptable when:
- questions are FAQs
- answers are generic
- correctness tolerance is high
- fallback to exact cache exists
- The safe pattern is two-tier caching:
- Exact cache (normalized query)
- Semantic cache (optional, guarded)
- Retrieval fallback Never semantic-cache authoritative answers.
The Normalization Layer (The Missing Piece)
The biggest realization for me was this: Normalization is not a function; it’s a layer.
Especially when RAG involves:
- SQL / Athena
- APIs
- logs
- metrics In those cases, the “query” isn’t text anymore. It’s intent + constraints. Instead of caching raw SQL, normalize the logical query shape:
{
"source": "athena",
"table": "deployments",
"metrics": ["count"],
"filters": {
"status": "FAILED",
"time_range": "LAST_7_DAYS"
}
}
Then hash a canonical form.
- This makes caching:
- deterministic
- debuggable
- correct
What Actually Worked in Practice
My final setup looked like this:
- Redis for fast cache
- conservative text normalization
- intent-level normalization for structured queries
- no semantic caching for critical paths
- TTL aligned with data freshness
Results:
- ~40% cost reduction
- lower latency
- zero correctness regressions
- predictable behavior
- Most importantly, I trusted my system again.
Takeaways
- Redis caching is easy — correct caching is not
- Normalize form, not meaning
- Over-normalization silently breaks RAG
- Semantic caching should be optional, not default
- Structured queries need intent-level normalization
- Determinism beats cleverness
Final Thoughts
Caching in RAG isn’t about saving tokens. It’s about engineering discipline.
If we get normalization right, Redis becomes a superpower. If we don’t, caching becomes a liability.
Thanks for reading. Mahak
p.s. This is a deceptively hard problem, and there’s no one-size-fits-all solution. Different RAG setups demand different normalization strategies depending on how context is retrieved, structured & validated. In my own project, this exact approach didn’t work out of the box, the real implementation was far more constrained & nuanced. What I’ve shared here is the idea and way of thinking that helped me reason about the problem, not a drop-in solution. Production-grade systems inevitably require careful, system-specific trade-offs.