DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics (opens in new tab)
TL;DR: Google released DiffusionGemma, an open Apache 2.0 diffusion-based LLM that generates text up to 4x faster than autoregressive models, hitting 1,000+ tokens/sec on a single H100 and fitting in 18 GB VRAM. It trades some accuracy for speed. Here is what that means in practice. What DiffusionGemma Actually Is Google DeepMind released DiffusionGemma, the first production-grade open-weight model that applies discrete diffusion to text generation. The same family of techniques behind image ...
Read the original article