Today we are releasing **pplx-embed-v1** and **pplx-embed-context-v1**, two state-of-the-art text embedding models built for real-world, web-scale retrieval. **pplx-embed-v1** is optimized for standard dense text retrieval, while **pplx-embed-context-v1** embeds passages with respect to surrounding document-level context.
Both **pplx-embed-v1** and **pplx-embed-context-v1** are available at 0.6B and 4B parameter scales. The 0.6B models target lightweight, low-latency embedding generation, while the 4B models maximize retrieval quality. In our evaluations, the **pplx-embed** family leads a range of public benchmarks including MTEB(Multilingual, v2), BERGEN, ToolRet, and ConTEB. Moreover, **pplx-embed-v1** delivers best-in-class results on our internal web-scale benchmarks: PPLXQuery2Query, PPLXQuery2Doc.
The models produce INT8 and binary embeddings, reducing storage requirements by 4x and 32x, respectively, compared to FP32. This compactness makes web-scale embedding storage and retrieval significantly more practical.
| Model | Dimensions | Context | MRL | Quantization | Instruction | Pooling |
|---|---|---|---|---|---|---|
pplx-embed-v1-0.6B | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-v1-4B | 2560 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-context-v1-0.6B | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-context-v1-4B | 2560 | 32K | Yes | INT8/BINARY | No | Mean |
Check out our technical report, Hugging Face models, and API docs.
Motivation
Dense text embeddings map queries and documents into a shared semantic space where retrieval reduces to approximate nearest neighbor search. At Perplexity鈥檚 scale, embeddings are the first stage of our retrieval pipeline, determining which documents from billions of web pages get considered by downstream rankers and language models.
Most leading embedding models today are built on decoder-only LLMs with causal attention masking, where each token can only attend to preceding tokens. For retrieval, this is a fundamental limitation: understanding a passage often requires bidirectional context. We address this through diffusion-based continued pretraining, which converts a causal LLM into a bidirectional encoder. Full bidirectional attention enables mean pooling over all token representations and supports late chunking for contextual embeddings, where each chunk鈥檚 embedding is informed by the full document it appears in.