TL;DR – We’re excited to introduce the beta version of the Retrieval Embedding Benchmark (RTEB), a new benchmark designed to reliably evaluate the retrieval accuracy of embedding models for real-world applications. Existing benchmarks struggle to measure true generalization, while RTEB addresses this with a hybrid strategy of open and private datasets. Its goal is simple: to create a fair, transparent, and application-focused standard for measuring how models perform on data they haven’t seen before.
The performance of many AI applications, from RAG and agents to recommendation systems, is fundamentally limited by the quality of search and retrieval. As such, accurately measuring the retrieval quality of embedding models is a common pain point for developers. How do you really know how well a model will perform in the wild?
This is where things get tricky. The current standard for evaluation often relies on a model’s “zero-shot” performance on public benchmarks. However, this is, at best, an approximation of a model’s true generalization capabilities. When models are repeatedly evaluated against the same public datasets, a gap emerges between their reported scores and their actual performance on new, unseen data.
Performance Discrepancy Between Public and Closed Datasets
To address these challenges, we developed RTEB, a benchmark built to provide a reliable standard for evaluating retrieval models.
Why Existing Benchmarks Fall Short
While the underlying evaluation methodology and metrics (such as NDCG@10) are well-known and robust, the integrity of existing benchmarks is often set back by the following issues:
The Generalization Gap. The current benchmark ecosystem inadvertently encourages “teaching to the test.” When training data sources overlap with evaluation datasets, a model’s score can become inflated, undermining a benchmark’s integrity. This practice, whether intentional or not, is evident in the training datasets of several models. This creates a feedback loop where models are rewarded for memorizing test data rather than developing robust, generalizable capabilities.
Because of the above, models with a lower zero-shot score[1] may perform very well on the benchmark, without generalizing to new problems. For this reason, models with slightly lower benchmark performance and a higher zero-shot score are often recommended instead.