Graphics processing units (GPUs) have become the default upgrade for companies building AI systems, particularly for inferencing — the process by which trained models generate outputs from new data. But relying on GPUs alone can limit performance and drive up costs, according to semiconductor firm AMD.
In an interview with Newsbytes.PH, AMD Asia Pacific general manager Alexey Navolokin said AI workloads increasingly require tighter coordination among CPUs, GPUs, memory, and networking, especially as models grow larger and agentic AI systems move toward real-world deployment.
“Today’s large models run across clusters of GPUs that must operate in parallel and exchange data constantly,” Navolokin said. “Overall performance depends not just on GPU speed, but on how efficiently the…
Graphics processing units (GPUs) have become the default upgrade for companies building AI systems, particularly for inferencing — the process by which trained models generate outputs from new data. But relying on GPUs alone can limit performance and drive up costs, according to semiconductor firm AMD.
In an interview with Newsbytes.PH, AMD Asia Pacific general manager Alexey Navolokin said AI workloads increasingly require tighter coordination among CPUs, GPUs, memory, and networking, especially as models grow larger and agentic AI systems move toward real-world deployment.
“Today’s large models run across clusters of GPUs that must operate in parallel and exchange data constantly,” Navolokin said. “Overall performance depends not just on GPU speed, but on how efficiently the system moves data and coordinates computation across the entire stack.”
CPU role in AI inferencing
Navolokin said a common misconception is treating GPUs as the sole engine of AI inferencing. Modern AI models, he noted, no longer fit on a single device and depend heavily on host CPUs to manage data movement, synchronization, and latency-sensitive tasks.
“A fast CPU keeps the GPU fully utilized, reduces overhead in the inference pipeline, and cuts end-to-end latency,” he said. “Even small reductions in CPU-side delays can significantly improve application responsiveness.”
He added that tokenization — the step where inputs are converted into numerical units — relies heavily on CPU-GPU interaction.
“Inference runs token by token, and tasks such as tokenization, batching, and synchronization sit directly on the critical path,” Navolokin said. “Delays on the host CPU can slow the entire response.”
Cost and infrastructure impact
Beyond performance, Navolokin said improved CPU-GPU balance can reduce infrastructure costs by increasing GPU utilization and lowering hardware requirements.
“Higher efficiency allows teams to meet demand with fewer CPU cores or GPU instances,” he said.
He cited South Korean IT firm Kakao Enterprise, which AMD said cut total cost of ownership by 50% and reduced server count by 60% while improving AI and cloud performance by 30% after deploying EPYC processors.
AMD’s fifth-generation EPYC processors, Navolokin said, can deliver comparable integer performance to legacy systems using up to 86% fewer racks, lowering power consumption and software licensing needs.
Agentic AI increases CPU demand
Navolokin said agentic AI systems — designed to plan, reason, and act autonomously — further increase reliance on CPUs.
“These systems generate significantly more CPU-side work than traditional inference,” he said. “Tasks such as retrieval, prompt preparation, multi-model routing, and synchronization are CPU-driven.”
In such environments, the CPU acts as the control node across distributed resources spanning data centers, cloud platforms, and edge systems.
AMD is positioning its EPYC processors as host CPUs for these workloads. The latest EPYC 9005 Series offers up to 192 cores, expanded AVX-512 execution, DDR5-6400 memory support, and PCIe Gen 5 I/O, features AMD says are designed to handle large-scale inferencing and GPU-accelerated systems.
Navolokin said the latest generation delivers a 37% improvement in instructions per cycle for machine learning and high-performance computing workloads compared with earlier EPYC processors.
He also cited Malaysian reinsurance firm Labuan Re, which expects to reduce insurance assessment turnaround times from weeks to less than a day after migrating to an EPYC-powered AI platform.
Designing for future AI workloads
As AI deployments expand beyond centralized data centers, Navolokin said organizations need to rethink infrastructure design.
“The priority should not be the performance of a single compute resource, but the ability to deploy AI consistently across heterogeneous environments,” he said.
He pointed to open platforms and distributed compute strategies as key considerations, noting that real-time inference often runs more efficiently on edge devices or AI PCs closer to data sources.
“Success in inferencing is no longer defined solely by raw compute power,” Navolokin said. “It depends on latency, efficiency, and the ability to operate across data center, cloud, and edge environments.”