VLLM Predicted Outputs

Have you ever asked an AI agent to make a simple change to a large piece of code, only to find yourself sitting idly by while the llm regurgitates pages and pages of code you’ve already written with just a few small changes made? Did you wonder ‘WHY does it have to regenerate all of this code token by token? Can’t it just regenerate the pieces that have changed?’

The answer is YES, with Predicted Outputs. Predicted outputs is a technique in llm generation that uses a prediction of the model’s output so the llm can skip sections it already ‘knows about’ and generate only new tokens. The prediction only needs to match partially: if a little bit matches then the generation will speed up a little bit, if a lot matches then the speedup will be dramatic.

Predicted outputs is not a common fe…

Predicted outputs is not a common feature of most llm platforms - one of the only implementations of it is the original in OpenAI’s api, but as we will see below it is certainly not optimal, often causing the generation to be SLOWER rather than speeding it up. Perhaps this is the reason that it hasn’t seen more uptake, but we believe that this technique has the ability to dramatically speed up many llm applications, not the least of which is coding agents.

Cascade Technologies is proud to release a new implementation of predicted outputs in the excellent VLLM platform. Our implementation scales nearly linearly with prediction accuracy, meaning that a 50% accurate prediction will result in generations in roughly half the time, and 100% accurate predictions will be nearly instantaneous. Truly, the llm only needs to generate the new content in an output.

Try a demo here: http://app.cascadetech.ai

Comparison with OpenAI

This is a test of predicted outputs on the code for a ~800 token Python snake game. “Verbatim” repeats the code with perfect prediction (93-97% acceptance rate). “Multiplayer” uses the original code as prediction with the modification prompt “make it multiplayer” (26-40% acceptance).

How does it work?

Llm prompts are generally processed in two phases:

First the input tokens are processed in parallel very quickly, thousands of tokens may all be processed in hundreds of milliseconds. They can be processed so quickly because they can be processed in parallel. They can be processed in parallel because, unlike output tokens, each token already knows all the tokens that precede it.
Next the output tokens are processed one at a time often at 10 or more milliseconds per token, 100x or more times slower than their input tokens, even though the actual math and weights that process them are identical to input.

tldr: Predicted Outputs uses a prediction about the contents of the model’s output to aid in the generation of output tokens. When predictions match, output tokens can be processed in parallel, effectively as if they were input tokens. This speedup is achieved with NO reduction in accuracy, the output of the llm is identical.

The user provides a prediction for the output of their llm request. This can be many things, but an easy example to understand is the case of code modification, in which case you would provide the original code as the prediction.
As long as the prediction is aligned with the output, we are able to process our llm generation in parallel, which can make it orders of magnitude faster.
When the prediction diverges from the output, we use standard diff algorithms to realign the prediction and continue processing the output in parallel.
Predicted Outputs is already supported in the standard OpenAI API so you can drop it in and use it with very few changes to your client code.

Example

Prediction

Predicted Outputs
Predicted outputs are great because they
allow you to use knowledge about the likely
output to speed up generation.

Model output with Predicted Outputs

Predicted Outputs:
Predicted outputs are great because they
allow you to use knowledge about the likely
output to speed up generation.

Notice the colon on the first line of the second block! As you can see, Predicted Outputs needs one identical line to realign, but then carries on with the prediction for the rest of the generation. The tokens in green are generated nearly “for free”.

Because running LLM model passes is so slow, and computing text diffs are so fast, matching just a single line can often provide speed benefits over the no-prediction base case, and there are many use cases where you can frequently match much more:

Coding agents modifying sections of code. This is the easiest win for Predicted Outputs, given the ease of prediction and frequency of prediction matching.
Implementing true Structured Outputs can often be very complicated, but passing an example of your output as a prediction can often get many of the speed benefits with much less difficulty.
Modifying documents.
Updating state agent state dicts / memory.

Implementation

Cascade Technologies’ implementation works as a form of speculative decoding, but instead of using a draft model to generate predictions, we use the static text ‘prediction’ sent via the api as a basis for our speculative proposals. When the prediction and the output diverge, we use standard diff algorithms to realign the prediction.

The prediction is entirely on the cpu, with no additional GPU resources required.
The alignment algorithm’s execution can be hidden with a one frame delay on alignment, eliminating any latency overhead.
Only exact prediction matches are accepted, so there is no loss in accuracy.

VLLM provides an easy mechanism to integrate Predicted Outputs in their speculative decoding system. Instead of a draft model, we simply use a static text prediction and the diff algorithm mentioned before. We keep a cursor to the position in the static text where we currently are, and as long as the prediction matches, we propose chunks of text to the system. When the generation diverges from the prediction we use the Myers diff algorithm to attempt to realign.

Users provide predictions via the standard OpenAI API, and the rest is automatic.

You can use Predicted Outputs yourself today via Cascade Technology’s vllm fork, or you can try it yourself on our servers here: http://app.cascadetech.ai

Comparison with OpenAI

How does it work?

Example

Implementation

Similar Posts