[2211.17192] Fast Inference from Transformers via Speculative Decoding (opens in new tab)

Covered by 8 sources including DEV Community, ByteByteGo Newsletter

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using spe...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 8 articles

DEV Community·

[2211.17192] Fast Inference from Transformers via Speculative Decoding (opens in new tab)

Covered in 8 articles

Speculative decoding shifted our output distribution and evals missed it

A Guide to AI Inference Engineering

Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation