Inside Mixedbread: How We Built Multimodal Late-Interaction at Billion Scale

Most semantic search issues don’t show up as obvious failures. They show up as results that look reasonable, read well, and are still wrong. In our experience, this is a structural limitation of single-vector retrieval on dense and unfamiliar inputs: the representation collapses detail, and the retriever confidently returns "close enough" content that doesn’t actually answer the query.

Building a reliable retriever is also harder than it looks. You’re stitching together parsing, chunking, embedding, metadata extraction, and ANN search, and each stage introduces its own brittleness. When quality drops, it’s rarely clear whether the problem is upstream ingestion, representation, indexing, or scoring.

We built a multimodal late-interaction retrieval system to make those failure modes rarer and easier to reason about. The system uses multi-vector representations across text, images, audio, and video, and it’s deployed at billion-document scale: 1B+ documents indexed, 500+ QPS per store, and ~50ms search latency end-to-end. The rest of this post walks through the three pieces we had to build and jointly tune to get there: ingestion, encoding, and retrieval.

Get started for free

For the vast majority of our embedding purposes, our goal is to create a true end-to-end representation pipeline, where we retrieve information from exactly where it lives in the embedding space, surrounded by all the valuable context.

This means that audio files are first pre-processed to maximize quality before being passed to the model, which dynamically splits it into meaningful units on its own. For textual inputs, we have a series of pre-processing steps which ensures the data is broken down into manageable blocks ("chunks") while maintaining all necessary context. Code is treated as its own separate input, with the AST parsed to determine logical cutoff points. For images, the model natively processes pixels.

As for document formats, like PDFs and PowerPoint, every single page is individually exported to screenshots of the pages, ensuring that all visual and layout information, such as tables and graphs, are preserved and represented as individual semantic units, with context such as headings preserved across pages.

Multimodal Ingestion

Unlike most other systems where the model’s training data is largely decoupled from expected real-world inputs, our model is trained specifically on the output of these pre-processing steps. This ensures that it is fully optimized to retrieve documents exactly as they will be in production and yield more accurate search results.

Loading more...