As large language models continue to scale, they consistently exceed the memory and compute limits of any single GPU. Tensor parallelism addresses the capacity issue by distributing layers across multiple GPUs-and often across multiple servers—but it introduces a new challenge: how do we synchronize shards, route requests, and share KV-cache efficiently enough to behave like a single cohesive accelerator?
This orchestration gap is exactly what NVIDIA Dynamo is designed to solve.
What Is NVIDIA Dynamo?
NVIDIA Dynamo is a distributed orchestration layer that enhances LLM inference by intelligently coordinating multi-GPU and multi-node workloads. It is inference-engine-agnostic and plugs seamlessly into frameworks such as TRT-LLM, vLLM, SGLang, and others.
Dynamo introduce…
As large language models continue to scale, they consistently exceed the memory and compute limits of any single GPU. Tensor parallelism addresses the capacity issue by distributing layers across multiple GPUs-and often across multiple servers—but it introduces a new challenge: how do we synchronize shards, route requests, and share KV-cache efficiently enough to behave like a single cohesive accelerator?
This orchestration gap is exactly what NVIDIA Dynamo is designed to solve.
What Is NVIDIA Dynamo?
NVIDIA Dynamo is a distributed orchestration layer that enhances LLM inference by intelligently coordinating multi-GPU and multi-node workloads. It is inference-engine-agnostic and plugs seamlessly into frameworks such as TRT-LLM, vLLM, SGLang, and others.
Dynamo introduces several LLM-specific capabilities that dramatically improve system performance:
Key Capabilities
- Disaggregated prefill & decode inference Maximizes GPU utilization and enables fine-grained latency/throughput trade-offs.
- Dynamic GPU scheduling Adapts resource allocation based on real-time workload demand.
- LLM-aware request routing Eliminates redundant KV-cache recomputation for faster inference.
- Accelerated data transfer (NIXL) Reduces inter-GPU communication overhead and improves response times.
- KV-cache offloading Leverages multi-tier memory hierarchies (HBM, DRAM, SSD) for higher throughput at lower cost.
Altogether, Dynamo provides the distributed intelligence required to make large-scale LLM inference behave as though all hardware resources were a single unified accelerator.
Installation & Setup Guide
1. Clone the Dynamo Repository
git clone --branch v0.4.1 --depth 1 https://github.com/ai-dynamo/dynamo.git cd dynamo
2. Build the Docker Image
docker compose -f deploy/docker-compose.yml up -d
./container/build.sh --framework VLLM
3. Create and Run the Container
./container/run.sh -it --framework VLLM [--mount-workspace]
Or attach to an existing one:
docker exec -it <container_name> bash
Running Dynamo on a Single Node
Inside the container, launch Dynamo with a specified model:
python -m dynamo.vllm --model <path_to_model>
If HBM capacity is limited, extend model length via:
--max-model-len <size>
Then start the backend services:
cd components/backends/vllm bash launch/agg.sh
Running Dynamo with LMCache Integration
To enable LMCache and configure CPU offload size:
LMCACHE_MAX_LOCAL_CPU_SIZE=500 \ python -m dynamo.vllm --model <path_to_model>
Launch the LMCache-enabled backend:
cd components/backends/vllm bash launch/agg_lmcache.sh