Adaptive speculative decoding: picking draft lengths at runtime (opens in new tab)
A follow-on to the economics of speculative decoding, we run the inference lab simulator on MTP & DFlash drafters with real acceptance data, and find out whether adaptively choosing the draft length is worth it.
Read the original article