Google Releases DiffusionGemma: Parallel Block Decoding (opens in new tab)

Covers DiffusionGemma: 4x Faster Text GenerationDiscussed on DEV

What: Google released DiffusionGemma, an open-weight model whose headline trick is parallel block decoding — it writes text by refining a whole block of tokens at once through iterative denoising, instead of predicting one next token at a time. Why: Decoding is the slow, sequential part of running an LLM: emitting N tokens normally costs N forward passes that each wait on the last. Laying down a block in parallel is why DiffusionGemma reports up to 4x faster decode and 1000+ tokens/sec on an ...

Read the original article