5 Ways to Get the Best Out of LLM Inference

7 min readJust now

–

References

Variable Length Computation and Continuous Batching

Traditional models, such as CNNs used for image classification, operate under three key assumptions: fixed input size, a static compute graph, and predictable latency per request.

In contrast, large language models (LLMs) fundamentally break these assumptions. A user’s prompt may range from just 10 tokens to over 10,000, and the model’s response could be as brief as “Yes” or as extensive as a 500-word essay. As a result, the total computational cost of any given request remains unknown until generation is complete.

Source: Image by AnyScale

This variability render…

7 min readJust now

–

References

1. X Video

Variable Length Computation and Continuous Batching

Traditional models, such as CNNs used for image classification, operate under three key assumptions: fixed input size, a static compute graph, and predictable latency per request.

Source: Image by AnyScale

This variability renders traditional static batching inefficient. If the system waits for the slowest request in a batch to finish, it wastes valuable GPU cycles on idle cores.

This is where Continuous Batching comes to rescue. It dynamically add new requests and remove completed ones fro the batch as tokens finish generating. Think of it like a taxi dispatch system: as soon as one passenger reaches their destination, the car picks up a new rider.

Note: Static batches starve GPUs. Continuous batching feeds them constantly.

References

Split Prefill and Decode: They are Fundamentally Different Workloads

LLM inference happens in two distinct stages:

1. Prompt Processing (Prefill) When you send a request like a question or instruction. The model first reads and encodes your entire input all at once. This step is called _prefill_. It’s like the model taking a deep breath to fully understand everything you’ve said before it starts answering.

2. Response Generation (Decode) After understanding your prompt, the model generates its reply one piece (called a token) at a time. Each new token depends on everything generated so far, so this step must happen sequentially like speaking sentence by sentence, not all at once.

Opposite Demands, Same Hardware? Not Ideal

1. Prefill is compute-bound which consist of lots math which is parallelizable. It involves heavy matrix math across the whole input, but because all tokens are known upfront, the work can be done in parallel. This makes it ideal for GPUs with lots of raw computing power.

2. Decode is memory-bandwidth-bound which is small operations but constant fetching of past key/values. Each new token requires only a small amount of calculation but the model must constantly fetch its “memory” of previous tokens, stored as key-value (KV) caches.

Example in RAG (Retrieval-Augmented Generation):

In RAG systems, users often feed long retrieved documents as context. Prefill becomes extremely heavy (processing thousands of tokens), while decode remains short — but latency-sensitive. Mixing both phases on one GPU can delay user responses unnecessarily.

Why mixing them on one GPU causes problems:

When prefill and decode share the same GPU, a large incoming prompt can saturate the memory bus with data transfers. This starves decode tasks of memory bandwidth, causing latency spikes especially harmful in chat-bots or assistants where users expect instant replies.

Prefill-Decode Dis-aggregation

To avoid this interference, high-performance systems split the work: 1. One set of GPUs handles prompt processing, optimized for massive parallel computation. 2. Another set handles response generation, tuned for fast memory access and low-latency output.

This separation ensures both high throughput and consistent, responsive performance; something traditional models (like CNNs) never needed because their inputs and outputs are fixed and predictable.

Several open-source frameworks now support or explore this approach:

vLLM
SGLang
Dynamo (by NVIDIA)
llm-d

But Dis-aggregation Isn’t Always Better

While powerful, prefill–decode dis-aggregation isn’t a universal win:

1. Thresholds matter: For small workloads or lightly loaded systems, the overhead can hurt performance — tests show up to **20–30% slowdown** if not properly tuned.

2. Local prefill can be faster: With short prompts or high cache reuse (e.g., repeated system instructions), running prefill directly on the decode GPU avoids transfer costs and simplifies scheduling.

3. Data transfer cost is real: Moving KV caches between prefill and decode workers requires fast, low-latency communication. If the network or interconnect is slow, gains vanish.

Rule of Thumb: Disaggregation shines in high-concurrency, long-context scenarios (like RAG or agentic workflows) but adds complexity where simplicity suffices.

References

1.Mastering LLM Techniques Inference Optimization 2. Prefill-Decode Dis-aggregation

Cache Keys and Values (KV Caching)

Prompt: “Alice went to the market with her sister. She bought apples.” Question: Who bought apples?

During decoding, the model uses attention to figure out that “She” refers to Alice, not her sister. To do this accurately, it relies on the key-value (KV) cache of previously generated or input tokens. So, it doesn’t lose track of who did what, even several words later.

Source: Image by Not Lain onHuggingFace

Without reusing this cached context, the model might forget or misassign pronouns, leading to confusing or incorrect answers.

KV Caching and Paged KV caches

Store the keys and values from past tokens in GPU memory. Reuse them in every subsequent decode step. This cuts redundant work and speeds up generation.

As outputs get longer, the KV cache grows, sometimes to gigabytes. Storing it in one big block causes **memory fragmentation**: small gaps of unused memory that can’t be reused efficiently.

Paged KV caching splits the cache into fixed-size “pages” (like pages in virtual memory). These pages can live anywhere in memory and are linked logically, avoiding wasted space.

Get Bibek Poudel’s stories in your inbox

Join Medium for free to get updates from this writer.

Paged KV Caching is implemented in vLLM, LLM inference engine which enables high throughput and supports long sequences without running out of usable memory.

Note: KV caching turns O(n²) work into O(n). Skip it, and you’re not just slow, you’re wasteful.

Reference

1. KV Cache, HuggingFace

Route Requests Intelligently with Prefix-Aware Scheduling

Not all prompts are created equal. Consider these two requests:

1. “Summarize this 10-page PDF” , a long input with heavy refill. 2. “Hi!”, a tiny input with near-instant response

If both are placed in the same decoder queue, the short request gets stuck behind the long one, a problem called head-of-line blocking. This hurts interactivity and wastes GPU cycles.

Why Prefix-aware Routing?

It inspects prompt length and content before scheduling. And intelligently route short prompts to low-latency decode queues, long and prompts to high-throughput pools.

Press enter or click to view image in full size

Image by BentoML

Why It Matters: Shared Prefixes

Many AI applications reuse common prefixes:

Chatbots: identical system prompts (“You are a helpful assistant…”)
RAG: same retrieved document fed to multiple queries
Agents: repeated tool-use templates

When requests share a prefix, their KV cache for that prefix is identical. If scheduled on the same GPU, the system can:

Compute the prefix once
Reuse the cached keys/values across multiple requests, which in turn saves compute, reduces TTFT (Time To First Token), and boosts throughput

Note: Treat every prompt like a package: fragile, urgent, or bulky. Route accordingly.

Go Further: Combine with Output-Length Awareness

Pair prefix-aware routing with output-length estimation (e.g., based on prompt patterns or user history). This enables smarter batching and prevents short-generation requests from being delayed by long-output ones.

Sharding and Mixture of Experts (MoE)

As models grow, fitting them on one GPU becomes impossible.

To serve them efficiently, we use two complementary strategies: Sharding (splitting the model across hardware) and Mixture of Experts (MoE) (activating only part of the model per token).

Model Sharding: Distribute What You Can’t Fit

When a model is too large for one GPU, we split it using parallelism techniques:

Tensor Parallelism: Splits individual layers (e.g., attention heads or MLP weights) across GPUs. Used heavily in Megatron-LM.
Pipeline Parallelism: Assigns different model layers to different devices; data flows like an assembly line.
FSDP (Fully Sharded Data Parallelism) and DeepSpeed ZeRO: Shard optimizer states, gradients, and parameters across devices to minimize per-GPU memory.

(Note: Pure data parallelism, copying the full model to every GPU — is common in training but inefficient for inference.)

These methods are now standard in serving frameworks like vLLM, TensorRT-LLM, and DeepSpeed-Inference.

Mixture of Experts (MoE): More Capacity, Less Compute

MoE replaces dense feed-forward networks (FFNs) with multiple “expert” subnetworks. For each input token, a lightweight router selects only a few experts (e.g., 2 out of 8 or 16) to activate.

Press enter or click to view image in full size

Image by Outrageously Large Neural Network Paper on HuggingFace

This enables:

Conditional computation: Only a fraction of parameters run per token
Higher model capacity: More total parameters improve reasoning and knowledge without linear cost increases.
Sparsity by design: Most weights stay idle during inference, ideal for memory-bound decode phases.

Real-World Examples

Mixtral 8x7B (by Mistral AI): 8 expert FFNs of ~7B each, but only 2 active per token.
Qwen-MoE (by Alibaba): Uses fine-grained routing and shared attention layers.
DeepSeek-MoE, Nemotron-MoE, and others follow similar patterns.

Sharding gets giant models onto hardware. MoE makes them smarter without making them slower. Used together, especially in RAG, agentic workflows, or multilingual settings, they enable high-capacity, cost-effective inference that dense models alone cannot match.

Note: Sharding spreads the weight. MoE makes the model ‘choose its brain’ per token. Together, they keep giant models affordable.

References

Inference is Orchestration

LLM inference isn’t about running a model, it’s about intelligently managing uncertainty. Unlike traditional AI workloads with fixed inputs and predictable costs, LLMs operate in a world of variable prompts, open-ended responses, and hidden memory demands.

References

Variable Length Computation and Continuous Batching

References

Variable Length Computation and Continuous Batching

References

Split Prefill and Decode: They are Fundamentally Different Workloads

Opposite Demands, Same Hardware? Not Ideal

Example in RAG (Retrieval-Augmented Generation):

Why mixing them on one GPU causes problems:

Prefill-Decode Dis-aggregation

But Dis-aggregation Isn’t Always Better

References

Cache Keys and Values (KV Caching)

KV Caching and Paged KV caches

Get Bibek Poudel’s stories in your inbox

Reference

Route Requests Intelligently with Prefix-Aware Scheduling

Why Prefix-aware Routing?

Why It Matters: Shared Prefixes

Go Further: Combine with Output-Length Awareness

Sharding and Mixture of Experts (MoE)

Model Sharding: Distribute What You Can’t Fit

Mixture of Experts (MoE): More Capacity, Less Compute

Inference is Orchestration

Similar Posts