Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval, or multimodal tasks.

This shift has changed the scaling and orchestration problem from “run N replicas of a pod” to “coordinate a group of components as one logical system.” Managing such a system requires scaling and scheduling the right pods together, understanding that each component has distinct configuration and resource needs, starting them in a deliberate order, and placing them in the cluster with network topology in mind. Ultimately, the goal is to orchestrate a system and scale components with awareness of their dependencies as a whole, rather than one pod at a time.

To address these challenges, today we are announcing that NVIDIA Grove, a Kubernetes API for running modern ML inference workloads on Kubernetes clusters, is now available within NVIDIA Dynamo as a modular component. Grove is fully open source and available on the ai-dynamo/grove GitHub repo.

How NVIDIA Grove orchestrates inference as a whole

Grove enables you to scale your multinode inference deployment from a single replica to data center scale, supporting tens of thousands of GPUs. With Grove, you can describe your whole inference serving system in Kubernetes (for example, prefill, decode, routing, or any other component) as a single Custom Resource (CR).

From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multilevel autoscaling, and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.

Originally motivated by the challenges of orchestrating multinode, disaggregated inference systems, Grove is flexible enough to map naturally to any real-world inference architecture—from traditional single-node aggregated inference to agentic pipelines with multiple models. Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.

Prerequisites for the multinode disaggregated serving are detailed below.

Multilevel autoscaling for interdependent components

Modern inference systems need autoscaling at multiple levels: individual components (prefill workers for traffic spikes), related component groups (prefill leaders with their workers), and entire service replicas for overall capacity. These levels affect one another: scaling prefill workers may require more decode capacity, and new service replicas need proper component ratios. Traditional pod-level autoscaling can’t handle these interdependencies.

System-level lifecycle management with recovery and rolling updates

Recovery and updates must operate on complete service instances, not individual Kubernetes pods. A failed prefill worker needs to properly reconnect to its leader after a restart, and rolling updates must preserve network topology to maintain low latency. The platform must treat multicomponent systems as single operational units optimized for both performance and availability.

Flexible hierarchical gang scheduling

The AI workload scheduler should support flexible gang scheduling that goes beyond traditional all-or-nothing placement. Disaggregated serving creates a new challenge: the inference system needs to guarantee essential component combinations (at least one prefill and decode worker, for example) while allowing independent scaling of each component type. The challenge is that prefill and decode components should scale at different ratios based on workload patterns.

Traditional gang scheduling prevents this independent scaling by forcing everything into groups that must scale together. The system needs policies that enforce minimum viable component combinations while enabling flexible scaling.

Topology-aware scheduling

Component placement affects performance. On systems like NVIDIA GB200 NVL72, scheduling the related prefill and decode pods on the same NVIDIA NVLink domain optimizes KV-cache transfer latency. The scheduler must understand physical network topology, placing related components near each other while spreading replicas for availability.

Role‑aware orchestration and explicit startup ordering

Components have different jobs, configurations, and startup requirements. For example, prefill and decode leaders execute specialized startup logic than workers, and workers can’t start before leaders are ready. The platform needs role-specific configuration and dependency enforcement for reliable initialization.

Put together, this is the bigger picture: inference teams need an easy and declarative way to describe their system as it is actually operated (multiple roles, multiple nodes, clear multilevel dependencies) and have the system schedule, scale, heal, and update to that description.

Grove primitives

High-performance inference frameworks use Grove hierarchical APIs to express role-specific logic and multilevel scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multicomponent AI workloads using three hierarchical custom resources in its Workload API.

For the example shown in Figure 1, PodClique A represents a frontend component, B and C represent prefill-leader and prefill-worker, and D and E represent decode-leader and decode-worker.

PodClique specifies its own replica and minimum availability counts (for example, PodClique C with three replicas and two MinAvailable). Figure 1.** **Key components of NVIDIA Grove include PodClique, ScalingGroup, and PodCliqueSet, and how they work together

PodCliques represent groups of Kubernetes pods with specific roles, such as prefill leader or worker, decode leader or worker, or a frontend service, each with independent configuration and scaling logic.
PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, such as the prefill leader and prefill workers that together represent one model instance.
PodCliqueSets define the entire multicomponent workload, specifying startup ordering, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for additional capacity, Grove creates complete replicas of the entire PodGangSet and defines spread constraints that distribute these replicas across the cluster for high availability, while keeping each replica’s components network-packed for optimal performance. Figure 2. Grove workflow

A Grove-enabled Kubernetes cluster brings two key components together: the Grove operator and a scheduler capable of understanding PodGang resources, such as the KAI Scheduler, an open source subcomponent of the NVIDIA Run:ai platform.

When a PodCliqueSet resource is created, the Grove operator validates the specification and automatically generates the underlying Kubernetes objects required to realize it. This includes the constituent PodCliques, PodCliqueScalingGroups, and the associated pods, services, secrets, and autoscaling policies. As part of this process, Grove also creates PodGang resources, which is a part of the Scheduler API, that translate workload definitions into concrete scheduling constraints for the cluster’s scheduler.

Each PodGang encapsulates detailed requirements for its workload, including minimum replica guarantees, network topology preferences to optimize inter-component bandwidth, and spread constraints to maintain availability. Together, these ensure topology-aware placement and efficient resource utilization across the cluster.

The scheduler continuously watches for PodGang resources and applies gang scheduling logic, ensuring that all required components are scheduled together or not at all until resources are available. Placement decisions are made with GPU topology awareness and cluster locality in mind.

The result is a coordinated deployment of multicomponent AI systems, where prefill services, decode workers, and routing components start in the correct order, are located closely for performance in the network, and recover cohesively as a group. This prevents resource fragmentation, avoids partial deployments, and enables stable, efficient operation of complex model-serving pipelines at scale.

How to get started with Grove using Dynamo

This section walks you through how to deploy a disaggregated serving architecture with a KV-routing deployer using Dynamo and Grove. The setup uses the Qwen3 0.6B model and demonstrates the ability of Grove to manage distributed inference workloads with separate prefill and decode workers.

Note: This is a foundational example designed to help you understand the core concepts. For more complicated deployments, refer to the ai-dynamo/grove GitHub repo.

Prerequisites

First, ensure that you have the following components ready in your Kubernetes cluster:

Kubernetes cluster with GPU support
kubectl configured to access your cluster
Helm CLI installed
Hugging Face token secret (referenced as hf-token-secret), which can be created with the following command:

kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=<insert_huggingface_token>

Note: In the code, replace <insert_huggingface_token> with your actual Hugging Face token. Keep this token secure and never commit it to source control.

Step 1: Create a namespace

kubectl create namespace vllm-v1-disagg-router

Step 2: Install Dynamo CRDs and Dynamo Operator with Grove

# 1. Set environment
export NAMESPACE=vllm-v1-disagg-router
export RELEASE_VERSION=0.5.1

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Dynamo Operator + Grove
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace --set "grove.enabled=true"

Step 3: Verify Grove installation

kubectl get crd | grep grove

Expected output:

podcliques.grove.io
podcliquescalinggroups.grove.io
podcliquesets.grove.io
podgangs.scheduler.grove.io
podgangsets.grove.io

Step 4: Create the DynamoGraphDeployment configuration

Create a DynamoGraphDeployment manifest that defines a disaggregated serving architecture with one frontend, two decode workers, and one prefill worker:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-grove
spec:
services:
Frontend:
dynamoNamespace: vllm-v1-disagg-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
envs:
- name: DYN_ROUTER_MODE
value: kv
VllmDecodeWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 2
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
VllmPrefillWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker

Step 5: Deploy the configuration

kubectl apply -f dynamo-grove.yaml

Step 6: Verify the deployment

Verify that operator and Grove pods were created:

kubectl get pods -n ${NAMESPACE}

Expected output:

NAME                                                              READY   STATUS    RESTARTS      AGE
dynamo-grove-0-frontend-w2xxl                                     1/1     Running     0           10m
dynamo-grove-0-vllmdecodeworker-57ghl                             1/1     Running     0           10m
dynamo-grove-0-vllmdecodeworker-drgv4                             1/1     Running     0           10m
dynamo-grove-0-vllmprefillworker-27hhn                            1/1     Running     0           10m
dynamo-platform-dynamo-operator-controller-manager-7774744kckrr   2/2     Running     0           10m
dynamo-platform-etcd-0                                            1/1     Running     0           10m
dynamo-platform-nats-0                                            2/2     Running     0           10m

Step 7: Test the deployment

First, port-forward the frontend:

kubectl port-forward svc/dynamo-grove-frontend 8000:8000 -n ${NAMESPACE}

Then test the endpoint:

curl http://localhost:8000/v1/models

Optionally, you can inspect the PodClique resource to see how Grove groups pods together including replica counts:

kubectl get podclique dynamo-grove-0-vllmdecodeworker -n vllm-v1-disagg-router -o yaml

Ready for more?

NVIDIA Grove is fully open source and available on the ai-dynamo/grove GitHub repo. We invite you to try Grove in your own Kubernetes environments—with Dynamo, as a standalone component, or along high-performance AI inference engines.

Explore the Grove Deployment Guide and ask questions on GitHub or Discord. To see Grove in action, visit the NVIDIA Booth #753 at KubeCon 2025 in Atlanta. We welcome contributions, pull requests, and feedback from the community.

To learn more, check out these additional resources:

Acknowledgments

The NVIDIA Grove project acknowledges the valuable contributions of all open source developers, testers, and community members who have participated in its evolution, with special thanks to SAP (Madhav Bhargava, Saketh Kalaga, Frank Heine) for their exceptional contributions and support. Open source thrives on collaboration—thank you for being part of Grove.