Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding (opens in new tab)

Discussed on Hacker News, Hacker News, and r/LocalLLaMA

Researchers at UCSD have achieved a breakthrough in AI serving efficiency by integrating DFlash, a block-diffusion speculative decoding framework, into the vLLM TPU ecosystem. By shifting from sequential $O(K)$ drafting to a parallel $O(1)$ "block-painting" approach, the team unlocked an average 3.13x speedup on TPU v5p, with math and coding tasks seeing gains of nearly 6x. This post explores the technical innovations behind the "dual-cache" solution for attention, the discovery of "K-Flat" h...

Read the original article