A deep dive into semantic deduplication for LLM context windows
If youβre building with RAG (Retrieval-Augmented Generation), youβve probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory systemβjust worded slightly differently.
This isnβt a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. Thatβs wasted tokens, higher API costs, and confused model outputs.
I built GoVectorSync to fix this. Hereβs the technical deep-dive on the problem and solution.
The Problem: Semantic Redundancy in Multi-Source RAG
Modern AI agents β¦
A deep dive into semantic deduplication for LLM context windows
If youβre building with RAG (Retrieval-Augmented Generation), youβve probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory systemβjust worded slightly differently.
This isnβt a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. Thatβs wasted tokens, higher API costs, and confused model outputs.
I built GoVectorSync to fix this. Hereβs the technical deep-dive on the problem and solution.
The Problem: Semantic Redundancy in Multi-Source RAG
Modern AI agents pull context from multiple sources:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Query β
β "How do I reset my password?" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Context Sources β
βββββββββββββββββββββββββββββββββββββββββββ€
β π RAG (Documentation) β
β π§ MCP Tools (API responses) β
β π§ Memory (Past conversations) β
β β‘ Skills (Procedural knowledge) β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Retrieved Chunks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [RAG] "To reset your password, click 'Forgot Password' β
β on the login page..." β
β [RAG] "Password reset: Navigate to login, select β
β 'Forgot Password'..." β
β [MCP] "User password can be reset via the forgot β
β password flow in the auth system" β
β [Memory] "Last time you asked, I explained the password β
β reset uses the forgot password link..." β
β [RAG] "Account deletion is available in Settings..." β
β [MCP] "Delete account: Settings > Account > Delete" β
β [RAG] "Set up 2FA for extra security..." β
β [Skills] "Contact support at support@example.com" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Look at those first four results. Theyβre all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.
The Math of Waste
If you retrieve 8 chunks and 5 are duplicates:
- 62% of your context window is wasted
- Youβre paying for tokens that add no information
- The LLM sees repetition, which can bias its response
- Youβre missing diverse information that could help
Why Cosine Similarity Isnβt Enough
You might think: "Just dedupe by cosine similarity threshold."
The problem is choosing that threshold:
Similarity between chunks about password reset:
"To reset your password, click 'Forgot Password'..."
vs
"Password reset: Navigate to login, select 'Forgot'..."
Cosine Similarity: 0.82
Is 0.82 a duplicate? What about 0.75? 0.68?
A fixed threshold fails because:
- Domain variance: Technical docs cluster tighter than conversational text
- Length effects: Longer chunks have different similarity distributions
- Embedding model quirks: Different models have different similarity ranges
You need something smarter.
The Solution: Cluster β Select β Diversify
GoVectorSync uses a three-stage pipeline:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GoVectorSync Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β STAGE 1 β β STAGE 2 β β STAGE 3 β
β Over-fetch ββββββΆβ Cluster ββββββΆβ Select + β
β (3-5x K) β β (Semantic) β β MMR β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β 50 chunks β β 12 clusters β β 8 chunks β
β from VectorDBβ β by meaning β β diverse β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Stage 1: Over-fetch
Instead of retrieving exactly K chunks, retrieve 3-5x more:
// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK // 50
This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost youβll save.
Stage 2: Agglomerative Clustering
We group semantically similar chunks using hierarchical clustering:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agglomerative Clustering β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]
Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]
Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]
Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]
Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]
Stop when distance > threshold (0.15)
Final clusters:
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ ββββββββββ
β Password Reset β β Account Delete β β 2FA β βSupport β
β C1, C2, C3, C4 β β C5, C6 β β C7 β β C8 β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ ββββββββββ
The algorithm:
func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
// Initialize: each chunk is its own cluster
nodes := make([]*clusterNode, n)
for i := range chunks {
nodes[i] = &clusterNode{
members: []int{i},
centroid: chunks[i].Embedding,
active: true,
}
}
// Compute pairwise distance matrix
distMatrix := c.computeDistanceMatrix(chunks)
// Agglomerative merging
for activeCount > 1 {
// Find closest pair
minDist, minI, minJ := findClosestPair(nodes, distMatrix)
// Stop if distance exceeds threshold
if minDist > c.cfg.Threshold {
break
}
// Merge clusters
mergeClusters(nodes[minI], nodes[minJ], chunks)
nodes[minJ].active = false
}
return buildResult(nodes, chunks)
}
Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).
func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
switch c.cfg.Linkage {
case "single":
// Min distance - can cause chaining
return minPairwiseDistance(a, b, distMatrix)
case "complete":
// Max distance - very conservative
return maxPairwiseDistance(a, b, distMatrix)
case "average":
// Mean distance - balanced
return avgPairwiseDistance(a, b, distMatrix)
}
}
Stage 3: Representative Selection
From each cluster, we pick one representative. Multiple strategies:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Selection Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Strategy: "score" (default)
βββββββββββββββββββββββββββββ
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking
Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
Selected: C1 (score 0.92)
Strategy: "centroid"
βββββββββββββββββββββββββββββ
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk
Cluster centroid: [0.12, -0.34, 0.56, ...]
C1 distance to centroid: 0.08
C2 distance to centroid: 0.12
C3 distance to centroid: 0.05 β Selected
C4 distance to centroid: 0.15
Strategy: "hybrid"
βββββββββββββββββββββββββββββ
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality
hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity
func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
switch s.cfg.Strategy {
case SelectByScore:
return s.selectByScore(cluster) // Highest retrieval score
case SelectByCentroid:
return s.selectByCentroid(cluster) // Closest to centroid
case SelectByHybrid:
return s.selectByHybrid(cluster) // Weighted combination
}
}
Stage 4: MMR Re-ranking (Optional)
After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Maximal Marginal Relevance (MMR) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Formula:
MMR(chunk) = Ξ» Γ relevance(chunk)
- (1-Ξ») Γ max_similarity(chunk, already_selected)
Where:
Ξ» = 0.5 (balanced)
Ξ» = 1.0 (pure relevance, no diversity)
Ξ» = 0.0 (pure diversity, ignore relevance)
Example with Ξ» = 0.5:
βββββββββββββββββββββββββββββ
Candidates: [A, B, C, D, E] (already selected: none)
Round 1:
MMR(A) = 0.5 Γ 0.95 - 0 = 0.475 β Selected (highest relevance)
Round 2: (A already selected)
MMR(B) = 0.5 Γ 0.90 - 0.5 Γ sim(B,A)
= 0.45 - 0.5 Γ 0.85 = 0.025
MMR(C) = 0.5 Γ 0.85 - 0.5 Γ sim(C,A)
= 0.425 - 0.5 Γ 0.20 = 0.325 β Selected (diverse!)
Round 3: (A, C already selected)
MMR(B) = 0.5 Γ 0.90 - 0.5 Γ max(sim(B,A), sim(B,C))
...
The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.
func (m *MMR) computeMMRScore(candidateIdx int, selected []int,
scores []float64, simMatrix [][]float64) float64 {
relevance := scores[candidateIdx]
if len(selected) == 0 {
return m.cfg.Lambda * relevance
}
// Find max similarity to any selected chunk
maxSim := 0.0
for _, selIdx := range selected {
sim := simMatrix[candidateIdx][selIdx]
if sim > maxSim {
maxSim = sim
}
}
// MMR formula
return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}
The Full Pipeline
Hereβs how it all fits together:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GoVectorSync Broker β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. EMBED QUERY β
β "How do I reset my password?" β [0.12, -0.34, ...] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. OVER-FETCH FROM VECTOR DB β
β Query Pinecone/Qdrant for top 50 chunks β
β Include embeddings for clustering β
β β
β Latency: ~15ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. AGGLOMERATIVE CLUSTERING β
β 50 chunks β 15 clusters β
β Threshold: 0.15 cosine distance β
β β
β Latency: ~8ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. REPRESENTATIVE SELECTION β
β 15 clusters β 15 representatives β
β Strategy: highest score per cluster β
β β
β Latency: ~1ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. MMR RE-RANKING β
β 15 representatives β 8 diverse chunks β
β Lambda: 0.5 (balanced relevance/diversity) β
β β
β Latency: ~3ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESULT β
β 8 chunks, each covering a distinct topic: β
β β’ Password reset β
β β’ Account deletion β
β β’ 2FA setup β
β β’ Support contact β
β β’ Billing β
β β’ API keys β
β β’ Data export β
β β’ Team management β
β β
β Total added latency: ~12ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation Details
Distance Matrix Computation
For N chunks, we compute an NΓN distance matrix. This is O(NΒ²) but N is small (50-100 chunks):
func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
n := len(chunks)
matrix := make([][]float64, n)
for i := 0; i < n; i++ {
matrix[i] = make([]float64, n)
for j := i + 1; j < n; j++ {
dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
matrix[i][j] = dist
matrix[j][i] = dist // Symmetric
}
}
return matrix
}
Cosine Distance
func CosineDistance(a, b []float32) float64 {
var dot, normA, normB float64
for i := range a {
dot += float64(a[i] * b[i])
normA += float64(a[i] * a[i])
normB += float64(b[i] * b[i])
}
if normA == 0 || normB == 0 {
return 1.0 // Max distance
}
similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
return 1.0 - similarity // Convert to distance
}
Centroid Update on Merge
When merging clusters, we recompute the centroid as the mean of all member embeddings:
func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
a.members = append(a.members, b.members...)
// Recompute centroid
dim := len(chunks[0].Embedding)
newCentroid := make([]float32, dim)
for _, idx := range a.members {
for d := 0; d < dim; d++ {
newCentroid[d] += chunks[idx].Embedding[d]
}
}
invN := float32(1.0 / float64(len(a.members)))
for d := 0; d < dim; d++ {
newCentroid[d] *= invN
}
a.centroid = newCentroid
}
Configuration Tuning
Cluster Threshold
The threshold controls how aggressively we merge:
| Threshold | Effect | Use Case |
|---|---|---|
| 0.10 | Conservative, more clusters | High-precision domains |
| 0.15 | Balanced (default) | General purpose |
| 0.20 | Aggressive, fewer clusters | Noisy/redundant data |
| 0.25+ | Very aggressive | Heavy deduplication |
MMR Lambda
| Lambda | Effect | Use Case |
|---|---|---|
| 0.3 | Diversity-focused | Exploratory queries |
| 0.5 | Balanced (default) | General purpose |
| 0.7 | Relevance-focused | Precise queries |
| 1.0 | Pure relevance | No diversity needed |
Over-fetch Ratio
| Ratio | Chunks Retrieved | Trade-off |
|---|---|---|
| 2x | 16 for K=8 | Minimal overhead, less dedup potential |
| 3x | 24 for K=8 | Good balance |
| 5x | 40 for K=8 | Maximum dedup, higher retrieval cost |
Performance
Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):
| Stage | Latency |
|---|---|
| Distance matrix | 2ms |
| Clustering | 6ms |
| Selection | <1ms |
| MMR | 3ms |
| Total overhead | ~12ms |
This is negligible compared to:
- Vector DB query: 15-50ms
- LLM inference: 500-2000ms
But the savings are significant:
- 35% fewer tokens per query
- 2x more diverse context
- Better LLM answers (less confusion from repetition)
Usage
As a Proxy
GoVectorSync sits between your app and vector DB:
βββββββββββ ββββββββββββββββ ββββββββββββββ
β Your ββββββΆβ GoVectorSync ββββββΆβ Pinecone β
β App βββββββ Proxy βββββββ Qdrant β
βββββββββββ ββββββββββββββββ ββββββββββββββ
Direct Integration
Request for access for Go SDK & demo
Whatβs Next
Iβm building GoVectorSync as an open-source tool for the RAG community. Current focus:
- More vector DB integrations (Weaviate, Milvus, Chroma)
- Ingestion-time dedup (deduplicate before storing)
- Adaptive thresholds (learn optimal settings per namespace)
- Streaming support (deduplicate as chunks arrive)
Try It
Join the waitlist for early access:
waitlist.siddhantkhare.com/?project=GoVectorSync
Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.
*Built by Siddhant Khare