CNCF Standardizes AI Infrastructure with New Kubernetes Program

CNCF Standardizes AI Infrastructure with New Kubernetes Program

Technical Standards versus Process Frameworks
Efficient Management of Accelerators in AI Training
Extended Ingress Functions for AI Inference
Scheduling and Orchestration of AI Workloads
Monitoring and Metrics
Security and Resource Separation
Support for Operators
[An …

CNCF Standardizes AI Infrastructure with New Kubernetes Program

Technical Standards versus Process Frameworks
Efficient Management of Accelerators in AI Training
Extended Ingress Functions for AI Inference
Scheduling and Orchestration of AI Workloads
Monitoring and Metrics
Security and Resource Separation
Support for Operators
An International, Community-Driven Standard with European Participation

For many companies, the central question is no longer whether to use artificial intelligence, but how to integrate it responsibly and sustainably. So far, fragmented, non-standardized island solutions and mostly expensive proprietary AI stacks are still hindering adoption. Especially for organizations that rely on data sovereignty, compliance, and long-term financial stability, uncoordinated AI infrastructure poses a significant risk – in hybrid cloud environments as well as on-premises.

With version 1.0 of the “Kubernetes AI Conformance” program, officially released as part of the KubeCon + CloudNativeCon North America 2025, the Cloud Native Computing Foundation (CNCF) now aims to bring order to the fragmented AI landscape. The program goes beyond certification – it is designed as a globally supported open-source initiative to create a common technical standard for AI infrastructures. “Especially for European companies, it provides the framework for using AI safely and scalably,” explains Mario Fahlandt, who is active at the CNCF, among other roles, as co-chair of the Technical Advisory Group (TAG) Operational Resilience and the Special Interest Group (SIG) Contributor Experience for Kubernetes. “The initiative defines a clear, future-proof roadmap that ensures workload portability, technical consistency, and digital sovereignty.”

Technical Standards versus Process Frameworks

While the AI market is characterized by a multitude of certifications, decision-makers must clearly distinguish between technical and organizational standards. Some providers focus on management and governance frameworks such as ISO 42001. This international standard sets requirements for establishing an AI Management System (AIMS). It helps companies manage risks, ethical issues, data protection, and regulatory requirements. It also assesses whether internal processes ensure responsible development and deployment of AI.

The new CNCF program “Kubernetes AI Conformance” fundamentally differs from governance standards. It primarily functions as a technical implementation standard, specifying which capabilities, APIs, and configurations a Kubernetes cluster needs to reliably and efficiently execute AI and ML workloads. Thus, CNCF conformance aims for guaranteed technical portability, which also contributes to less dependency on individual vendors. It ensures that companies can run their AI applications on any compliant platform in the future – in the public cloud, in their own data center, or at edge locations. This portability forms the basis of digital and thus also data-driven sovereignty.

The development of the standard is being driven within the Kubernetes project by a newly formed working group, supported by the Special Interest Groups Architecture and Testing. Since KubeCon Europe in spring 2025, the group has initially defined central technical pillars that consider the specific requirements of AI workloads. “Based on this, a binding catalog of requirements was created, which every platform must meet to be considered Kubernetes AI compliant,” explains Fahlandt.

Efficient Management of Accelerators in AI Training

AI training jobs require extensive hardware resources and usually need expensive, and often scarce, GPUs. In non-standardized environments, this leads to two core problems:

Resource fragmentation: Valuable GPU memory remains unused.
Topology blindness: Scheduling is not optimized for multi-GPU workloads.

Both aspects contribute to over-provisioning and increasing costs.

A CNCF-compliant platform must therefore support the Kubernetes API for Dynamic Resource Allocation (DRA). Since Kubernetes-Version 1.34, DRA is considered stable and allows complex hardware resources to be flexibly requested and shared. Similar to the PersistentVolumeClaim model for storage, users can specifically request resources from defined device classes. Kubernetes then automatically handles the scheduling and placement of all workloads.

Extended Ingress Functions for AI Inference

AI inference workloads – i.e., AI models in operation – differ significantly from typical, stateless web applications. They usually run longer, consume many resources, and store states. Standard load balancers are unsuitable for their load distribution. The CNCF conformance program therefore requires support for the Kubernetes Gateway API and its extensions for model-aware routing.

The Gateway API Inference Extension, an official Kubernetes project, extends standard gateways to specialized inference gateways. This allows routing and load balancing to be specifically optimized for AI workloads. Supported functions include weighted traffic splitting and header-based routing, which is relevant for OpenAI protocol headers, for example.

Scheduling and Orchestration of AI Workloads

Distributed AI training jobs consist of multiple components that must start simultaneously. If the scheduler schedules pods individually, deadlocks can occur: a job gets stuck because some pods cannot find resources, while others are already blocking resources. A Kubernetes platform must support at least an all-or-nothing scheduling solution, such as Kueue or Volcano. This way, distributed AI workloads only start if all associated pods can be placed simultaneously.

If a cluster autoscaler is active, it should automatically scale node groups with specific accelerator types up or down as needed. Similarly, the HorizontalPodAutoscaler must correctly scale accelerator pods, considering custom metrics relevant to AI and ML workloads.

Monitoring and Metrics

Modern AI workloads and specialized hardware create new gaps in monitoring. A unified standard for capturing accelerator metrics is still missing – many teams therefore lack suitable tools to quickly analyze infrastructure problems.

Every CNCF-compliant platform must therefore be able to install an application in the future that makes performance metrics for all supported accelerator types – such as utilization or memory usage – available via a standardized endpoint. In addition, a monitoring system is required that automatically collects and processes metrics when workloads provide them in the standard format (e.g., Prometheus exposition format).

Security and Resource Separation

Accelerators like GPUs are shared resources. If strict isolation at the kernel and API level is missing, container workloads can access each other’s data or processes, thus causing security risks in multi-tenant environments. A CNCF-compliant platform must therefore clearly separate access to accelerators and control it via frameworks like Dynamic Resource Allocation (DRA) or device plugins. Only in this way can workloads be isolated and unauthorized access or interference be prevented.

Support for Operators

AI frameworks like Ray or Kubeflow are distributed systems that run on Kubernetes as operators. A platform needs a stable foundation for this to prevent unstable webhooks, CRD management (Custom Resource Definition), or an unreliable API server structure from causing operators to fail and the entire AI platform to come to a standstill.

A CNCF-compliant environment must be able to install and run at least one complex AI operator (e.g., Ray or Kubeflow). It must demonstrate that operator pods, webhooks, and the reconciliation of custom resources function stably and completely.

An International, Community-Driven Standard with European Participation

The CNCF’s Kubernetes AI Conformance program creates a stable, open, and future-proof standard for AI infrastructures based on the pillars defined by the WG AI Conformance working group. Platforms based on open upstream APIs offer European companies in particular the opportunity to implement their AI strategies portably and sovereignly – from the public cloud to secure on-premises data centers. “Various vendor platforms are already ‘Kubernetes AI Conformant’ for Kubernetes versions 1.33 and 1.34,” says Fahlandt. These also include platforms from European providers such as Gardener, Giant Swarm, Kubermatic, and SUSE.

Further requirements are continuously being developed and discussed in the community process. The CNCF invites all interested parties to actively participate in the open standard. Further information about the program can be found in the official announcement.

(map)

Don’t miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.