pplx-embed: State-of-the-Art Embedding Models for Web-Scale Retrieval (opens in new tab)馃幆Embedding Models

Today we are releasing **pplx-embed-v1** and **pplx-embed-context-v1**, two state-of-the-art text embedding models built for real-world, web-scale retrieval. **pplx-embed-v1** is optimized for standard dense text retrieval, while **pplx-embed-context-v1** embeds passages with respect to surrounding document-level context.

Both **pplx-embed-v1** and **pplx-embed-context-v1** are available at 0.6B and 4B parameter scales. The 0.6B models target lightweight, low-latency embedding generation, while the 4B models maximize retrieval quality. In our evaluations, the **pplx-embed** family leads a range of public benchmarks including MTEB(Multilingual, v2), BERGEN, ToolRet, and ConTEB. Moreover, **pplx-embed-v1** delivers best-in-class results on our internal web-scale benchmarks: PPLXQuery2Query, PPLXQuery2Doc.

The models produce INT8 and binary embeddings, reducing storage requirements by 4x and 32x, respectively, compared to FP32. This compactness makes web-scale embedding storage and retrieval significantly more practical.

ModelDimensionsContextMRLQuantizationInstructionPooling
pplx-embed-v1-0.6B102432KYesINT8/BINARYNoMean
pplx-embed-v1-4B256032KYesINT8/BINARYNoMean
pplx-embed-context-v1-0.6B102432KYesINT8/BINARYNoMean
pplx-embed-context-v1-4B256032KYesINT8/BINARYNoMean

Check out our technical report, Hugging Face models, and API docs.

Motivation

Dense text embeddings map queries and documents into a shared semantic space where retrieval reduces to approximate nearest neighbor search. At Perplexity鈥檚 scale, embeddings are the first stage of our retrieval pipeline, determining which documents from billions of web pages get considered by downstream rankers and language models.

Most leading embedding models today are built on decoder-only LLMs with causal attention masking, where each token can only attend to preceding tokens. For retrieval, this is a fundamental limitation: understanding a passage often requires bidirectional context. We address this through diffusion-based continued pretraining, which converts a causal LLM into a bidirectional encoder. Full bidirectional attention enables mean pooling over all token representations and supports late chunking for contextual embeddings, where each chunk鈥檚 embedding is informed by the full document it appears in.

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help