pplx-embed: State-of-the-Art Embedding Models for Web-Scale Retrieval

Today we are releasing **pplx-embed-v1** and **pplx-embed-context-v1**, two state-of-the-art text embedding models built for real-world, web-scale retrieval. **pplx-embed-v1** is optimized for standard dense text retrieval, while **pplx-embed-context-v1** embeds passages with respect to surrounding document-level context.

Both **pplx-embed-v1** and **pplx-embed-context-v1** are available at 0.6B and 4B parameter scales. The 0.6B models target lightweight, low-latency embedding generation, while the 4B models maximize retrieval quality. In our evaluations, the **pplx-embed** family leads a range of public benchmarks including MTEB(Multilingual, v2), BERGEN, ToolRet, and ConTEB. Moreover, **pplx-embed-v1** delivers best-in-class results on our internal web-scale benchmarks: PPLXQuery2Query, PPLXQuery2Doc.

The models produce INT8 and binary embeddings, reducing storage requirements by 4x and 32x, respectively, compared to FP32. This compactness makes web-scale embedding storage and retrieval significantly more practical.

Model	Dimensions	Context	MRL	Quantization	Instruction	Pooling
`pplx-embed-v1-0.6B`	1024	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-v1-4B`	2560	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-context-v1-0.6B`	1024	32K	Yes	INT8/BINARY	No	Mean
`pplx-embed-context-v1-4B`	2560	32K	Yes	INT8/BINARY	No	Mean

Check out our technical report, Hugging Face models, and API docs.

Motivation

Dense text embeddings map queries and documents into a shared semantic space where retrieval reduces to approximate nearest neighbor search. At Perplexity’s scale, embeddings are the first stage of our retrieval pipeline, determining which documents from billions of web pages get considered by downstream rankers and language models.

Most leading embedding models today are built on decoder-only LLMs with causal attention masking, where each token can only attend to preceding tokens. For retrieval, this is a fundamental limitation: understanding a passage often requires bidirectional context. We address this through diffusion-based continued pretraining, which converts a causal LLM into a bidirectional encoder. Full bidirectional attention enables mean pooling over all token representations and supports late chunking for contextual embeddings, where each chunk’s embedding is informed by the full document it appears in.

Loading more...