Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding (opens in new tab)
While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path. We propo...
Read the original article