Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding (opens in new tab)
Researchers at UCSD have achieved a breakthrough in AI serving efficiency by integrating DFlash, a block-diffusion speculative decoding framework, into the vLLM TPU ecosystem. By shifting from sequential $O(K)$ drafting to a parallel $O(1)$ "block-painting" approach, the team unlocked an average 3.13x speedup on TPU v5p, with math and coding tasks seeing gains of nearly 6x. This post explores the technical innovations behind the "dual-cache" solution for attention, the discovery of "K-Flat" h...
Read the original article