The Engineering guide to Context window efficiency

A deep dive into semantic deduplication for LLM context windows

If you’re building with RAG (Retrieval-Augmented Generation), you’ve probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory system—just worded slightly differently.

This isn’t a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That’s wasted tokens, higher API costs, and confused model outputs.

I built GoVectorSync to fix this. Here’s the technical deep-dive on the problem and solution.

The Problem: Semantic Redundancy in Multi-Source RAG

Modern AI agents …

A deep dive into semantic deduplication for LLM context windows

This isn’t a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That’s wasted tokens, higher API costs, and confused model outputs.

I built GoVectorSync to fix this. Here’s the technical deep-dive on the problem and solution.

The Problem: Semantic Redundancy in Multi-Source RAG

Modern AI agents pull context from multiple sources:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                           │
│                "How do I reset my password?"                │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│           Context Sources               │
├─────────────────────────────────────────┤
│  📄 RAG (Documentation)                 │
│  🔧 MCP Tools (API responses)           │
│  🧠 Memory (Past conversations)         │
│  ⚡ Skills (Procedural knowledge)       │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│                   Retrieved Chunks                          │
├─────────────────────────────────────────────────────────────┤
│ [RAG]    "To reset your password, click 'Forgot Password'  │
│           on the login page..."                             │
│ [RAG]    "Password reset: Navigate to login, select        │
│           'Forgot Password'..."                             │
│ [MCP]    "User password can be reset via the forgot        │
│           password flow in the auth system"                 │
│ [Memory] "Last time you asked, I explained the password    │
│           reset uses the forgot password link..."           │
│ [RAG]    "Account deletion is available in Settings..."    │
│ [MCP]    "Delete account: Settings > Account > Delete"     │
│ [RAG]    "Set up 2FA for extra security..."                │
│ [Skills] "Contact support at support@example.com"          │
└─────────────────────────────────────────────────────────────┘

Look at those first four results. They’re all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.

The Math of Waste

If you retrieve 8 chunks and 5 are duplicates:

62% of your context window is wasted
You’re paying for tokens that add no information
The LLM sees repetition, which can bias its response
You’re missing diverse information that could help

Why Cosine Similarity Isn’t Enough

You might think: "Just dedupe by cosine similarity threshold."

The problem is choosing that threshold:

Similarity between chunks about password reset:

"To reset your password, click 'Forgot Password'..."
vs
"Password reset: Navigate to login, select 'Forgot'..."

Cosine Similarity: 0.82

Is 0.82 a duplicate? What about 0.75? 0.68?

A fixed threshold fails because:

Domain variance: Technical docs cluster tighter than conversational text
Length effects: Longer chunks have different similarity distributions
Embedding model quirks: Different models have different similarity ranges

You need something smarter.

The Solution: Cluster → Select → Diversify

GoVectorSync uses a three-stage pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    GoVectorSync Pipeline                    │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   STAGE 1    │     │   STAGE 2    │     │   STAGE 3    │
│  Over-fetch  │────▶│   Cluster    │────▶│   Select +   │
│   (3-5x K)   │     │ (Semantic)   │     │     MMR      │
└──────────────┘     └──────────────┘     └──────────────┘
│                    │                    │
▼                    ▼                    ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  50 chunks   │     │  12 clusters │     │   8 chunks   │
│  from VectorDB│     │  by meaning  │     │   diverse    │
└──────────────┘     └──────────────┘     └──────────────┘

Stage 1: Over-fetch

Instead of retrieving exactly K chunks, retrieve 3-5x more:

// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK  // 50

This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost you’ll save.

Stage 2: Agglomerative Clustering

We group semantically similar chunks using hierarchical clustering:

┌─────────────────────────────────────────────────────────────┐
│              Agglomerative Clustering                       │
└─────────────────────────────────────────────────────────────┘

Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]

Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]

Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]

Stop when distance > threshold (0.15)

Final clusters:
┌─────────────────┐ ┌─────────────────┐ ┌────────┐ ┌────────┐
│ Password Reset  │ │ Account Delete  │ │  2FA   │ │Support │
│ C1, C2, C3, C4  │ │    C5, C6       │ │  C7    │ │  C8    │
└─────────────────┘ └─────────────────┘ └────────┘ └────────┘

The algorithm:

func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
// Initialize: each chunk is its own cluster
nodes := make([]*clusterNode, n)
for i := range chunks {
nodes[i] = &clusterNode{
members:  []int{i},
centroid: chunks[i].Embedding,
active:   true,
}
}

// Compute pairwise distance matrix
distMatrix := c.computeDistanceMatrix(chunks)

// Agglomerative merging
for activeCount > 1 {
// Find closest pair
minDist, minI, minJ := findClosestPair(nodes, distMatrix)

// Stop if distance exceeds threshold
if minDist > c.cfg.Threshold {
break
}

// Merge clusters
mergeClusters(nodes[minI], nodes[minJ], chunks)
nodes[minJ].active = false
}

return buildResult(nodes, chunks)
}

Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).

func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
switch c.cfg.Linkage {
case "single":
// Min distance - can cause chaining
return minPairwiseDistance(a, b, distMatrix)
case "complete":
// Max distance - very conservative
return maxPairwiseDistance(a, b, distMatrix)
case "average":
// Mean distance - balanced
return avgPairwiseDistance(a, b, distMatrix)
}
}

Stage 3: Representative Selection

From each cluster, we pick one representative. Multiple strategies:

┌─────────────────────────────────────────────────────────────┐
│                Selection Strategies                         │
└─────────────────────────────────────────────────────────────┘

Strategy: "score" (default)
─────────────────────────────
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking

Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
Selected: C1 (score 0.92)


Strategy: "centroid"
─────────────────────────────
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk

Cluster centroid: [0.12, -0.34, 0.56, ...]

C1 distance to centroid: 0.08
C2 distance to centroid: 0.12
C3 distance to centroid: 0.05  ← Selected
C4 distance to centroid: 0.15


Strategy: "hybrid"
─────────────────────────────
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality

hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity

func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
switch s.cfg.Strategy {
case SelectByScore:
return s.selectByScore(cluster)      // Highest retrieval score
case SelectByCentroid:
return s.selectByCentroid(cluster)   // Closest to centroid
case SelectByHybrid:
return s.selectByHybrid(cluster)     // Weighted combination
}
}

Stage 4: MMR Re-ranking (Optional)

After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:

┌─────────────────────────────────────────────────────────────┐
│           Maximal Marginal Relevance (MMR)                  │
└─────────────────────────────────────────────────────────────┘

Formula:
MMR(chunk) = λ × relevance(chunk)
- (1-λ) × max_similarity(chunk, already_selected)

Where:
λ = 0.5 (balanced)
λ = 1.0 (pure relevance, no diversity)
λ = 0.0 (pure diversity, ignore relevance)


Example with λ = 0.5:
─────────────────────────────

Candidates: [A, B, C, D, E]  (already selected: none)

Round 1:
MMR(A) = 0.5 × 0.95 - 0 = 0.475  ← Selected (highest relevance)

Round 2: (A already selected)
MMR(B) = 0.5 × 0.90 - 0.5 × sim(B,A)
= 0.45 - 0.5 × 0.85 = 0.025
MMR(C) = 0.5 × 0.85 - 0.5 × sim(C,A)
= 0.425 - 0.5 × 0.20 = 0.325  ← Selected (diverse!)

Round 3: (A, C already selected)
MMR(B) = 0.5 × 0.90 - 0.5 × max(sim(B,A), sim(B,C))
...

The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.

func (m *MMR) computeMMRScore(candidateIdx int, selected []int,
scores []float64, simMatrix [][]float64) float64 {
relevance := scores[candidateIdx]

if len(selected) == 0 {
return m.cfg.Lambda * relevance
}

// Find max similarity to any selected chunk
maxSim := 0.0
for _, selIdx := range selected {
sim := simMatrix[candidateIdx][selIdx]
if sim > maxSim {
maxSim = sim
}
}

// MMR formula
return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}

The Full Pipeline

Here’s how it all fits together:

┌─────────────────────────────────────────────────────────────┐
│                  GoVectorSync Broker                        │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBED QUERY                                              │
│    "How do I reset my password?" → [0.12, -0.34, ...]      │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. OVER-FETCH FROM VECTOR DB                                │
│    Query Pinecone/Qdrant for top 50 chunks                  │
│    Include embeddings for clustering                        │
│                                                             │
│    Latency: ~15ms                                           │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. AGGLOMERATIVE CLUSTERING                                 │
│    50 chunks → 15 clusters                                  │
│    Threshold: 0.15 cosine distance                          │
│                                                             │
│    Latency: ~8ms                                            │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REPRESENTATIVE SELECTION                                 │
│    15 clusters → 15 representatives                         │
│    Strategy: highest score per cluster                      │
│                                                             │
│    Latency: ~1ms                                            │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MMR RE-RANKING                                           │
│    15 representatives → 8 diverse chunks                    │
│    Lambda: 0.5 (balanced relevance/diversity)               │
│                                                             │
│    Latency: ~3ms                                            │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ RESULT                                                      │
│ 8 chunks, each covering a distinct topic:                   │
│ • Password reset                                            │
│ • Account deletion                                          │
│ • 2FA setup                                                 │
│ • Support contact                                           │
│ • Billing                                                   │
│ • API keys                                                  │
│ • Data export                                               │
│ • Team management                                           │
│                                                             │
│ Total added latency: ~12ms                                  │
└─────────────────────────────────────────────────────────────┘

Implementation Details

Distance Matrix Computation

For N chunks, we compute an N×N distance matrix. This is O(N²) but N is small (50-100 chunks):

func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
n := len(chunks)
matrix := make([][]float64, n)

for i := 0; i < n; i++ {
matrix[i] = make([]float64, n)
for j := i + 1; j < n; j++ {
dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
matrix[i][j] = dist
matrix[j][i] = dist  // Symmetric
}
}
return matrix
}

Cosine Distance

func CosineDistance(a, b []float32) float64 {
var dot, normA, normB float64
for i := range a {
dot += float64(a[i] * b[i])
normA += float64(a[i] * a[i])
normB += float64(b[i] * b[i])
}

if normA == 0 || normB == 0 {
return 1.0  // Max distance
}

similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
return 1.0 - similarity  // Convert to distance
}

Centroid Update on Merge

When merging clusters, we recompute the centroid as the mean of all member embeddings:

func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
a.members = append(a.members, b.members...)

// Recompute centroid
dim := len(chunks[0].Embedding)
newCentroid := make([]float32, dim)

for _, idx := range a.members {
for d := 0; d < dim; d++ {
newCentroid[d] += chunks[idx].Embedding[d]
}
}

invN := float32(1.0 / float64(len(a.members)))
for d := 0; d < dim; d++ {
newCentroid[d] *= invN
}

a.centroid = newCentroid
}

Configuration Tuning

Cluster Threshold

The threshold controls how aggressively we merge:

Threshold	Effect	Use Case
0.10	Conservative, more clusters	High-precision domains
0.15	Balanced (default)	General purpose
0.20	Aggressive, fewer clusters	Noisy/redundant data
0.25+	Very aggressive	Heavy deduplication

MMR Lambda

Lambda	Effect	Use Case
0.3	Diversity-focused	Exploratory queries
0.5	Balanced (default)	General purpose
0.7	Relevance-focused	Precise queries
1.0	Pure relevance	No diversity needed

Over-fetch Ratio

Ratio	Chunks Retrieved	Trade-off
2x	16 for K=8	Minimal overhead, less dedup potential
3x	24 for K=8	Good balance
5x	40 for K=8	Maximum dedup, higher retrieval cost

Performance

Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):

Stage	Latency
Distance matrix	2ms
Clustering	6ms
Selection	<1ms
MMR	3ms
Total overhead	~12ms

This is negligible compared to:

Vector DB query: 15-50ms
LLM inference: 500-2000ms

But the savings are significant:

35% fewer tokens per query
2x more diverse context
Better LLM answers (less confusion from repetition)

Usage

As a Proxy

GoVectorSync sits between your app and vector DB:

┌─────────┐     ┌──────────────┐     ┌────────────┐
│  Your   │────▶│ GoVectorSync │────▶│  Pinecone  │
│   App   │◀────│    Proxy     │◀────│   Qdrant   │
└─────────┘     └──────────────┘     └────────────┘

Direct Integration

Request for access for Go SDK & demo

What’s Next

I’m building GoVectorSync as an open-source tool for the RAG community. Current focus:

More vector DB integrations (Weaviate, Milvus, Chroma)
Ingestion-time dedup (deduplicate before storing)
Adaptive thresholds (learn optimal settings per namespace)
Streaming support (deduplicate as chunks arrive)

Try It

Join the waitlist for early access:

waitlist.siddhantkhare.com/?project=GoVectorSync

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

*Built by Siddhant Khare

The Problem: Semantic Redundancy in Multi-Source RAG

The Problem: Semantic Redundancy in Multi-Source RAG

The Math of Waste

Why Cosine Similarity Isn’t Enough

The Solution: Cluster → Select → Diversify

Stage 1: Over-fetch

Stage 2: Agglomerative Clustering

Stage 3: Representative Selection

Stage 4: MMR Re-ranking (Optional)

The Full Pipeline

Implementation Details

Distance Matrix Computation

Cosine Distance

Centroid Update on Merge

Configuration Tuning

Cluster Threshold

MMR Lambda

Over-fetch Ratio

Performance

Usage

As a Proxy

Direct Integration

What’s Next

Try It

Similar Posts