Adaptive speculative decoding: picking draft lengths at runtime (opens in new tab)

Covers 4 stories including Looking for a self-hosted alternative to Modal.com for running ML workloadsDiscussed on Hacker News

A follow-on to the economics of speculative decoding, we run the inference lab simulator on MTP & DFlash drafters with real acceptance data, and find out whether adaptively choosing the draft length is worth it.

Read the original article