Token-Count-Based Batching: Faster, Cheaper Embedding Inference for Queries
mongodb.com·2h·
Discuss: Hacker News
Performance Engineering
Preview
Report Post

Motivation

Embedding model inference often struggles with efficiency when serving large volumes of short requests—a common pattern in search, retrieval, and recommendation systems. At Voyage AI by MongoDB, we call these short requests queries, and other requests are called documents. Queries typically must be served with very low latency (typically 100–300 ms).

Queries are typically short, and their token-length distribution is highly skewed. As a result, query inference tends to be memory-bound rather than compute-bound. Query traffic is pretty spiky, so autoscaling is too slow. In sum, serving many short requests sequentially is highly inefficient.

In this blog post, we explore how batching can be used to serve queries more efficiently. We first discuss padding removal in…

Similar Posts

Loading similar posts...