Matryoshka embeddings: How to make vector search 5x faster

OpenAI’s text-embedding-3-large model, when truncated to just 256 dimensions, outperforms their previous text-embedding-ada-002 at 1,536 dimensions on the MTEB benchmark. Read that again. A 6x smaller embedding beats the full-size version of the previous generation.

This is Matryoshka Representation Learning (MRL) in action. It’s a training technique that lets you slice embeddings to any size and still get useful representations. I recently implemented this for a semantic search system and cut our vector search latency by 80% while barely touching accuracy.

If you’re running vector search at scale, you’re probably storing full-dimension embeddings when you could be using a fraction of the space. Here’s…

If you’re running vector search at scale, you’re probably storing full-dimension embeddings when you could be using a fraction of the space. Here’s how MRL works and how to use it.

IMPORTANT: if you want to use MRL, read below the section: MUST DO: renormalize after manual truncation

Traditional embedding models produce a single fixed-size vector. A 768-dimension embedding uses all 768 numbers to represent meaning. If you naively chop off the last 500 dimensions, you get garbage.

Matryoshka Representation Learning, introduced by Kusupati et al. at NeurIPS 2022, trains models differently. The key insight: force the model to store the most important information in the earliest dimensions.

This is a standard embedding training:

Generate embeddings for training batch 1.

Compute loss on full-size embeddings 1.

Backprop and update weights

This is the MRL training:

Generate embeddings for training batch 1.

Compute loss at multiple truncation points (768, 512, 256, 128, 64 dims) 1.

Sum all the losses together 1.

Backprop and update weights

The optimizer has to minimize loss at every truncation point simultaneously. The only way to do that is to frontload important information.

The result: you can slice an MRL embedding at any of the trained dimensions and get a working representation. The first 64 dimensions capture coarse semantics. Adding dimensions refines the representation progressively.

Training overhead? Negligible. You’re just computing extra loss terms, not running multiple forward passes.

Does truncating embeddings actually work? The benchmarks say yes.

On the STSBenchmark test set, researchers compared a Matryoshka-trained model against a standard model at various truncation points:

At 8.3% of the original size (64 of 768 dims), the MRL model preserved 98.37% of its full performance

The standard model at the same size preserved only 96.46%

MRL performance degrades gracefully; standard model performance falls off a cliff

The original paper from Kusupati et al. reports even more dramatic results on ImageNet:

Up to 14x smaller embedding size at the same classification accuracy

Up to 14x real-world speed-ups for large-scale retrieval

OpenAI’s own benchmarks show the same pattern: their 3,072-dimension `text-embedding-3-large`, when truncated to 256 dimensions, still outperforms the older 1,536-dimension `text-embedding-ada-002`. You can use 1/6th the storage and get better results.

The important caveat: these numbers only hold for models trained with MRL. You can’t just truncate any embedding model and expect this behavior.

The problem with vector search at scale: you have millions (or billions) of embeddings. Searching the full-dimension space is expensive. Memory bandwidth and compute costs scale with dimension count.

The solution: two-stage coarse-to-fine retrieval.

Stage 1 (Fast, Broad): Search using truncated embeddings (say, 256 dims) to find 1,000 candidates quickly.

Stage 2 (Accurate, Narrow): Rerank those 1,000 candidates using full-dimension embeddings to get the final top 10.

You’ve traded one expensive search over millions of vectors for one cheap search plus one expensive search over a thousand vectors. The math works out massively in your favor.

Supabase documented this pattern with impressive results:

Single-pass search at 1,536 dims: 89.2% accuracy, 670 queries/second

Two-stage (512-dim first pass + 3,072-dim rerank): 99% accuracy, 580 queries/second

They lost a bit of throughput but jumped from 89% to 99% accuracy. Or flip it around: they could have traded that accuracy gain for even more speed.

You can take this further with funnel search. Milvus describes the pattern:

Search with first 1/32 of dimensions → 10,000 candidates 1.

Rerank with first 1/16 of dimensions → 2,500 candidates 1.

Rerank with first 1/8 of dimensions → 500 candidates 1.

Rerank with first 1/4 of dimensions → 100 candidates 1.

Rerank with full dimensions → 10 final results

Each stage prunes the candidate set before moving to more expensive comparisons. The compute saved at each stage compounds.

Several embedding providers now support Matryoshka natively.

The text-embedding-3-small and text-embedding-3-large models accept a dimensions parameter:

from openai import OpenAI
client = OpenAI()

# Full 3072-dim embedding
full_embedding = client.embeddings.create(
model="text-embedding-3-large",
input="Your text here"
).data[0].embedding

# Truncated 256-dim embedding
short_embedding = client.embeddings.create(
model="text-embedding-3-large",
input="Your text here",
dimensions=256  # API handles truncation and normalization
).data[0].embedding

print(len(full_embedding))   # 3072
print(len(short_embedding))  # 256

When you use the dimensions parameter, OpenAI handles normalization automatically. The returned vector is ready to use.

For Sentence Transformers models like Nomic’s nomic-embed-text-v1.5:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer(
"nomic-ai/nomic-embed-text-v1.5",
trust_remote_code=True
)

# Generate full embedding
embeddings = model.encode(
["search_query: What is vector search?"],
convert_to_tensor=True
)

# Nomic uses post-normalization, so apply layer norm before truncating
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
short_embeddings = embeddings[..., :256]

If you truncate embeddings manually (not using an API parameter), you must renormalize. Cutting dimensions changes the vector’s magnitude. Without renormalization, your cosine similarity calculations will be wrong.

import numpy as np

def truncate_and_normalize(embedding, target_dims):
truncated = embedding[:target_dims]
norm = np.linalg.norm(truncated)
return truncated / norm

This is the most common mistake I see. OpenAI (with dimensions) and Sentence Transformers (truncate_dim) do it automatically. But if you’re slicing arrays yourself, renormalize.

Matryoshka reduces dimensions. Binary quantization reduces bits per dimension. Combine them for multiplicative gains.

Vespa’s team documented this approach and found you can achieve up to 32x memory savings from binary quantization alone, plus another 8x from MRL truncation.

The tradeoff: binary quantization loses more accuracy than MRL truncation. A reasonable production setup might be:

In-memory index: Binary-quantized MRL embeddings (fast, compact)

Reranking: Full float32 embeddings from disk or a slower tier (accurate)

This tiered approach balances latency, memory, and accuracy.

MRL isn’t magic. Some constraints:

The model must be trained with MRL. You can’t truncate a standard embedding model and expect graceful degradation. Check if your model explicitly supports it.

Small datasets may not benefit. If you have fewer than 100,000 vectors, full-dimension brute-force search typically completes in single-digit milliseconds on modern hardware. The complexity of two-stage retrieval adds engineering overhead that may not pay off.

Different MRL models aren’t compatible. `text-embedding-3-large` truncated to 256 dims and `text-embedding-3-small` at 256 dims produce different embeddings. You can’t mix them in the same index.

Precision-critical applications need testing. That 98% performance retention is an average. Your specific use case might be in the tail. Measure before deploying.

If you’re running vector search at any meaningful scale, here’s a concrete experiment:

Check if your embedding model supports MRL (OpenAI’s new models, Nomic, Voyage AI, and others do) 1.

Take a representative sample of your queries 1.

Run search at full dimensions and at 1/4 dimensions 1.

Measure accuracy (recall@k) and latency 1.

Calculate storage savings

In my experience, latency reductions in the 50-80% range with minimal accuracy loss are achievable, though your results will depend on dataset characteristics and dimension ratio. The implementation for two-stage search is straightforward: your first-stage index uses short embeddings, and you store full embeddings separately for reranking.

The days of storing 1,536 or 3,072-dimension vectors everywhere are over. Matryoshka embeddings give you the flexibility to trade off exactly how much dimension you need, per-query, per-tier, per-use-case. That’s a superpower for systems operating at scale.

No posts