Speculative decoding shifted our output distribution and evals missed it (opens in new tab)

Covers [2211.17192] Fast Inference from Transformers via Speculative DecodingDiscussed on DEV

TL;DR: We turned on speculative decoding in vLLM to cut latency on a fine-tuned 8B. Got a 1.9x throughput win. Three weeks later a customer flagged that the agent's tool-call arguments had subtly changed. Greedy decoding with a draft model is not bit-identical to greedy decoding without one, and our offline evals never caught the drift because they ran on a different serving path. I lead the eval team at Nexus Labs. We do enterprise agent automation, Series B, about 14 people in engineering. ...

Read the original article