Woosuk Kwon
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-192
December 15, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf
Large language models (LLMs) have emerged as transformative technology capable of human-level or superhuman performance across diverse tasks, from writing complex software systems to discovering novel algorithms and processing multimodal data. Despite these remarkable capabilities, deploying LLMs at scale presents significant challenges due to their enormous computational and memory requirements. State-of-the-art models contain trillions of parameters and perform tens of thousands of generation steps, e…
Woosuk Kwon
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-192
December 15, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.pdf
Large language models (LLMs) have emerged as transformative technology capable of human-level or superhuman performance across diverse tasks, from writing complex software systems to discovering novel algorithms and processing multimodal data. Despite these remarkable capabilities, deploying LLMs at scale presents significant challenges due to their enormous computational and memory requirements. State-of-the-art models contain trillions of parameters and perform tens of thousands of generation steps, executed across large GPU clusters, often under strict latency constraints. These challenges are further compounded by the rapidly evolving model architectures and the growing diversity of hardware accelerators.
To address these challenges, this thesis presents the design and implementation of vLLM, an efficient and flexible open-source LLM inference engine. We first introduce PagedAttention, vLLM’s core memory management algorithm that enables high-throughput LLM inference. We then examine vLLM’s system design in detail, highlighting its scheduling mechanisms, extensible architecture, and key performance optimizations that enable it to meet a wide range of deployment requirements.
Together, these contributions establish vLLM as a comprehensive solution to LLM inference, delivering high performance, architectural flexibility, and the strength of a rapidly growing open-source ecosystem. Through vLLM, this thesis illustrates how principled systems design can effectively bridge the widening gap between the accelerating evolution of modern LLMs and the demanding practical constraints of large-scale, real-world deployment.
Advisors: Ion Stoica
BibTeX citation:
@phdthesis{Kwon:EECS-2025-192,
Author= {Kwon, Woosuk},
Title= {vLLM: An Efficient Inference Engine for Large Language Models},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Month= {Dec},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.html},
Number= {UCB/EECS-2025-192},
Abstract= {Large language models (LLMs) have emerged as transformative technology capable of human-level or superhuman performance across diverse tasks, from writing complex software systems to discovering novel algorithms and processing multimodal data. Despite these remarkable capabilities, deploying LLMs at scale presents significant challenges due to their enormous computational and memory requirements. State-of-the-art models contain trillions of parameters and perform tens of thousands of generation steps, executed across large GPU clusters, often under strict latency constraints. These challenges are further compounded by the rapidly evolving model architectures and the growing diversity of hardware accelerators.
To address these challenges, this thesis presents the design and implementation of vLLM, an efficient and flexible open-source LLM inference engine. We first introduce PagedAttention, vLLM's core memory management algorithm that enables high-throughput LLM inference. We then examine vLLM’s system design in detail, highlighting its scheduling mechanisms, extensible architecture, and key performance optimizations that enable it to meet a wide range of deployment requirements.
Together, these contributions establish vLLM as a comprehensive solution to LLM inference, delivering high performance, architectural flexibility, and the strength of a rapidly growing open-source ecosystem. Through vLLM, this thesis illustrates how principled systems design can effectively bridge the widening gap between the accelerating evolution of modern LLMs and the demanding practical constraints of large-scale, real-world deployment.},
}
EndNote citation:
%0 Thesis
%A Kwon, Woosuk
%T vLLM: An Efficient Inference Engine for Large Language Models
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 15
%@ UCB/EECS-2025-192
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-192.html
%F Kwon:EECS-2025-192