7 min readJust now
–
References
1. X Video
Variable Length Computation and Continuous Batching
Traditional models, such as CNNs used for image classification, operate under three key assumptions: fixed input size, a static compute graph, and predictable latency per request.
In contrast, large language models (LLMs) fundamentally break these assumptions. A user’s prompt may range from just 10 tokens to over 10,000, and the model’s response could be as brief as “Yes” or as extensive as a 500-word essay. As a result, the total computational cost of any given request remains unknown until generation is complete.
Source: Image by AnyScale
This variability render…
7 min readJust now
–
References
1. X Video
Variable Length Computation and Continuous Batching
Traditional models, such as CNNs used for image classification, operate under three key assumptions: fixed input size, a static compute graph, and predictable latency per request.
In contrast, large language models (LLMs) fundamentally break these assumptions. A user’s prompt may range from just 10 tokens to over 10,000, and the model’s response could be as brief as “Yes” or as extensive as a 500-word essay. As a result, the total computational cost of any given request remains unknown until generation is complete.
Source: Image by AnyScale
This variability renders traditional static batching inefficient. If the system waits for the slowest request in a batch to finish, it wastes valuable GPU cycles on idle cores.
This is where Continuous Batching comes to rescue. It dynamically add new requests and remove completed ones fro the batch as tokens finish generating. Think of it like a taxi dispatch system: as soon as one passenger reaches their destination, the car picks up a new rider.
Note: Static batches starve GPUs. Continuous batching feeds them constantly.
References
Split Prefill and Decode: They are Fundamentally Different Workloads
LLM inference happens in two distinct stages:
1. Prompt Processing (Prefill) When you send a request like a question or instruction. The model first reads and encodes your entire input all at once. This step is called _prefill_. It’s like the model taking a deep breath to fully understand everything you’ve said before it starts answering.
2. Response Generation (Decode) After understanding your prompt, the model generates its reply one piece (called a token) at a time. Each new token depends on everything generated so far, so this step must happen sequentially like speaking sentence by sentence, not all at once.
Opposite Demands, Same Hardware? Not Ideal
1. Prefill is compute-bound which consist of lots math which is parallelizable. It involves heavy matrix math across the whole input, but because all tokens are known upfront, the work can be done in parallel. This makes it ideal for GPUs with lots of raw computing power.
2. Decode is memory-bandwidth-bound which is small operations but constant fetching of past key/values. Each new token requires only a small amount of calculation but the model must constantly fetch its “memory” of previous tokens, stored as key-value (KV) caches.
Example in RAG (Retrieval-Augmented Generation):
In RAG systems, users often feed long retrieved documents as context. Prefill becomes extremely heavy (processing thousands of tokens), while decode remains short — but latency-sensitive. Mixing both phases on one GPU can delay user responses unnecessarily.
Why mixing them on one GPU causes problems:
When prefill and decode share the same GPU, a large incoming prompt can saturate the memory bus with data transfers. This starves decode tasks of memory bandwidth, causing latency spikes especially harmful in chat-bots or assistants where users expect instant replies.
Prefill-Decode Dis-aggregation
To avoid this interference, high-performance systems split the work: 1. One set of GPUs handles prompt processing, optimized for massive parallel computation. 2. Another set handles response generation, tuned for fast memory access and low-latency output.
This separation ensures both high throughput and consistent, responsive performance; something traditional models (like CNNs) never needed because their inputs and outputs are fixed and predictable.
Several open-source frameworks now support or explore this approach:
- vLLM
- SGLang
- Dynamo (by NVIDIA)
- llm-d
But Dis-aggregation Isn’t Always Better
While powerful, prefill–decode dis-aggregation isn’t a universal win:
1. Thresholds matter: For small workloads or lightly loaded systems, the overhead can hurt performance — tests show up to **20–30% slowdown** if not properly tuned.
2. Local prefill can be faster: With short prompts or high cache reuse (e.g., repeated system instructions), running prefill directly on the decode GPU avoids transfer costs and simplifies scheduling.
3. Data transfer cost is real: Moving KV caches between prefill and decode workers requires fast, low-latency communication. If the network or interconnect is slow, gains vanish.
Rule of Thumb: Disaggregation shines in high-concurrency, long-context scenarios (like RAG or agentic workflows) but adds complexity where simplicity suffices.
References
1.Mastering LLM Techniques Inference Optimization 2. Prefill-Decode Dis-aggregation
Cache Keys and Values (KV Caching)
Prompt: “Alice went to the market with her sister. She bought apples.” Question: Who bought apples?
During decoding, the model uses attention to figure out that “She” refers to Alice, not her sister. To do this accurately, it relies on the key-value (KV) cache of previously generated or input tokens. So, it doesn’t lose track of who did what, even several words later.
Source: Image by Not Lain onHuggingFace
Without reusing this cached context, the model might forget or misassign pronouns, leading to confusing or incorrect answers.
KV Caching and Paged KV caches
Store the keys and values from past tokens in GPU memory. Reuse them in every subsequent decode step. This cuts redundant work and speeds up generation.
As outputs get longer, the KV cache grows, sometimes to gigabytes. Storing it in one big block causes **memory fragmentation**: small gaps of unused memory that can’t be reused efficiently.
Paged KV caching splits the cache into fixed-size “pages” (like pages in virtual memory). These pages can live anywhere in memory and are linked logically, avoiding wasted space.
Get Bibek Poudel’s stories in your inbox
Join Medium for free to get updates from this writer.
Paged KV Caching is implemented in vLLM, LLM inference engine which enables high throughput and supports long sequences without running out of usable memory.
Note: KV caching turns O(n²) work into O(n). Skip it, and you’re not just slow, you’re wasteful.
Reference
Route Requests Intelligently with Prefix-Aware Scheduling
Not all prompts are created equal. Consider these two requests:
1. “Summarize this 10-page PDF” , a long input with heavy refill. 2. “Hi!”, a tiny input with near-instant response
If both are placed in the same decoder queue, the short request gets stuck behind the long one, a problem called head-of-line blocking. This hurts interactivity and wastes GPU cycles.
Why Prefix-aware Routing?
It inspects prompt length and content before scheduling. And intelligently route short prompts to low-latency decode queues, long and prompts to high-throughput pools.
Press enter or click to view image in full size
Image by BentoML
Why It Matters: Shared Prefixes
Many AI applications reuse common prefixes:
- Chatbots: identical system prompts (“You are a helpful assistant…”)
- RAG: same retrieved document fed to multiple queries
- Agents: repeated tool-use templates
When requests share a prefix, their KV cache for that prefix is identical. If scheduled on the same GPU, the system can:
- Compute the prefix once
- Reuse the cached keys/values across multiple requests, which in turn saves compute, reduces TTFT (Time To First Token), and boosts throughput
Note: Treat every prompt like a package: fragile, urgent, or bulky. Route accordingly.
Go Further: Combine with Output-Length Awareness
Pair prefix-aware routing with output-length estimation (e.g., based on prompt patterns or user history). This enables smarter batching and prevents short-generation requests from being delayed by long-output ones.
Sharding and Mixture of Experts (MoE)
As models grow, fitting them on one GPU becomes impossible.
To serve them efficiently, we use two complementary strategies: Sharding (splitting the model across hardware) and Mixture of Experts (MoE) (activating only part of the model per token).
Model Sharding: Distribute What You Can’t Fit
When a model is too large for one GPU, we split it using parallelism techniques:
- Tensor Parallelism: Splits individual layers (e.g., attention heads or MLP weights) across GPUs. Used heavily in Megatron-LM.
- Pipeline Parallelism: Assigns different model layers to different devices; data flows like an assembly line.
- FSDP (Fully Sharded Data Parallelism) and DeepSpeed ZeRO: Shard optimizer states, gradients, and parameters across devices to minimize per-GPU memory.
(Note: Pure data parallelism, copying the full model to every GPU — is common in training but inefficient for inference.)
These methods are now standard in serving frameworks like vLLM, TensorRT-LLM, and DeepSpeed-Inference.
Mixture of Experts (MoE): More Capacity, Less Compute
MoE replaces dense feed-forward networks (FFNs) with multiple “expert” subnetworks. For each input token, a lightweight router selects only a few experts (e.g., 2 out of 8 or 16) to activate.
Press enter or click to view image in full size
Image by Outrageously Large Neural Network Paper on HuggingFace
This enables:
- Conditional computation: Only a fraction of parameters run per token
- Higher model capacity: More total parameters improve reasoning and knowledge without linear cost increases.
- Sparsity by design: Most weights stay idle during inference, ideal for memory-bound decode phases.
Real-World Examples
- Mixtral 8x7B (by Mistral AI): 8 expert FFNs of ~7B each, but only 2 active per token.
- Qwen-MoE (by Alibaba): Uses fine-grained routing and shared attention layers.
- DeepSeek-MoE, Nemotron-MoE, and others follow similar patterns.
Sharding gets giant models onto hardware. MoE makes them smarter without making them slower. Used together, especially in RAG, agentic workflows, or multilingual settings, they enable high-capacity, cost-effective inference that dense models alone cannot match.
Note: Sharding spreads the weight. MoE makes the model ‘choose its brain’ per token. Together, they keep giant models affordable.
References
Inference is Orchestration
LLM inference isn’t about running a model, it’s about intelligently managing uncertainty. Unlike traditional AI workloads with fixed inputs and predictable costs, LLMs operate in a world of variable prompts, open-ended responses, and hidden memory demands.