Why CUDA translation wont unlock AMD

Every few years, a new solution pops up promising the same dream:

keep your CUDA codebase
target AMD (and maybe other accelerators)
no source rewrite
no HIP porting
“native performance”

On paper, that sounds perfect. Take your existing CUDA applications, swap out the toolchain, and suddenly you’re “portable.”

And to be fair: if you’re running research code or trying to get an internal tool to compile on a non-NVIDIA box, that can absolutely be useful.

But if you care about actual performance on AMD, the kind that:

reduces latency,
wins benchmarks,
squeezes every TFLOP from MI***-class accelerators,
and doesn’t send people “back to NVIDIA” after one bad experiment,

…then adopting a universal CUDA compatibility layer is the wrong long-term strategy…

Every few years, a new solution pops up promising the same dream:

keep your CUDA codebase
target AMD (and maybe other accelerators)
no source rewrite
no HIP porting
“native performance”

On paper, that sounds perfect. Take your existing CUDA applications, swap out the toolchain, and suddenly you’re “portable.”

And to be fair: if you’re running research code or trying to get an internal tool to compile on a non-NVIDIA box, that can absolutely be useful.

But if you care about actual performance on AMD, the kind that:

reduces latency,
wins benchmarks,
squeezes every TFLOP from MI***-class accelerators,
and doesn’t send people “back to NVIDIA” after one bad experiment,

…then adopting a universal CUDA compatibility layer is the wrong long-term strategy.

Not because the engineers behind these toolchains aren’t smart (they are), but because CUDA-first compilation will always be playing catch-up with what AMD exposes natively through ROCm, HIP, and vendor-tuned libraries.

Why This Approach Is So Attractive

These pitches are extremely compelling:

“Develop your application using CUDA once and deploy it across various GPU platforms.”

Concretely, these toolchains usually do something like this:

Provide an nvcc**-compatible compiler** that accepts existing CUDA code, sometimes including inline PTX.
Target AMD GPUs via LLVM backends instead of NVIDIA’s drivers.
Implement the CUDA runtime, driver, and math APIs on top of AMD’s ROCm stack.
Ship wrapper libraries that map CUDA-X APIs (e.g., cuBLAS/cuSOLVER) onto rocBLAS/rocSOLVER and friends.
Maintain validation sets showing well-known CUDA projects compiling and running on AMD hardware.

From a developer’s point of view, it feels magical:

# On NVIDIA

nvcc my_app.cu -o my_app_nvidia

# On “everything”

nvcc my_app.cu -o my_app_other_gpu

For legacy CUDA-heavy HPC where a HIP/SYCL/ROCm rewrite would be painful, this is honestly a nice option.

But that use case is very different from:

“We want state-of-the-art LLM inference and training performance on AMD, on par with or better than NVIDIA“

Those are not the same problem. At all.

CUDA Semantics ≠ AMD Semantics

CUDA was designed around NVIDIA’s hardware model (warps, memory hierarchy, intrinsics, PTX, CUDA-X libraries). NVIDIA’s warp size is 32 threads.

AMD’s world is different: on CDNA/GCN (Instinct class) the wavefront is 64 work-items.AMD ROCm Documentation RDNA (consumer) introduced a native wave32 mode, but CDNA (datacenter) remains wave64.Wikipedia

If AMD had decided from day one to be “CUDA-native hardware,” life would look very different. But that’s not reality, AMD chose its own architecture, software stack, and optimization paths.

Modern AMD guidance explicitly talks about MI***-specific optimization: kernel shapes, memory tiling, GEMM tuning, and precision choices aligned to CDNA strengths.

If you start from CUDA as the “source of truth,” you’re asking a translation layer to:

Parse CUDA (often including inline PTX).
Convert it to an IR like LLVM.
Map that onto AMD’s ISA and ROCm stack.
Somehow emit code that competes with AMD-first kernels and libraries crafted for ROCm.

That’s a very high bar.

Concrete example #1, 32 vs 64 warps/wavefronts

NVIDIA: warp = 32 threads.
AMD CDNA/GCN: wavefront = 64 work-items.

If a kernel’s control flow, synchronization, or memory coalescing is implicitly tuned for warp32, running it unchanged on wave64 can strand lanes or force the compiler/runtime to add masking/shuffles, often cutting effective throughput. In other words: the “same” kernel can be ~2× off the peak simply from the execution width mismatch, before we even talk about cache behavior or matrix-math usage. AMD dev material underscores how wave64 has different resource/occupancy characteristics than wave32.gpuopen.com

Performance Isn’t “It Compiled”

On paper, these toolchains often say:

“Ahead-of-time compilation, not emulation.”
“No inherent overhead vs native paths.”

For some classes of applications (classic CUDA-tuned HPC without mature ROCm ports), that can be reasonably true.

But for modern AI on AMD, performance comes from things way beyond “the CUDA syntax compiled”:

Fused kernels designed for AMD’s wavefront, caches, and memory system.
GEMM tuning across BF16/FP16/FP8 with AMD’s library paths and MFMA shapes.
FP8 enablement: quantizing weights/activations to AMD’s FP8 (E4M3/E5M2), FP8 KV-cache, scaling policies, and packing, wired up in frameworks like vLLM.
Tensor parallelism & communication tuned for MI***-class topology and ROCm collectives.

These are AMD-first engineering choices. A generic “CUDA front-end → AMD back-end” doesn’t conjure them out of thin air.

Concrete example #2, FP8 isn’t plug-and-play

On AMD CDNA3 (MI300-class), FP8 uses E4M3/E5M2 formats and associated scaling/packing.AMD ROCm Documentation To run LLMs efficiently in FP8, you don’t just “compile” your CUDA. You typically pre-process or quantize weights and enable FP8 KV-cache/activations via AMD-aware flows, e.g., Quark quantization tutorials and vLLM FP8 guides show explicit steps and configuration to hit the fast paths.VLLM Docs+3AMD ROCm Documentation+3rocm.blogs.amd.com+3

If a translation stack has to emulate missing FP8 behavior or falls back to non-optimal packing/scaling, the “portable” path quickly becomes measurably slower than AMD-native FP8 enablement.

How We Approach It (Paiton)

This is exactly why, at Eliovp, we took the opposite approach with Paiton.

Paiton was and is being built for AMD first, not as a side effect of a CUDA translator. We:

work directly on ROCm/HIP,
integrate with vLLM/SGLang rather than replace them,
write and tune custom kernels for MI***-class hardware,
fuse ops, tune GEMMs, and optimize FP8 data paths,
optimize tensor parallelism and communication for AMD interconnect and topology.

In our public work, MI300X + Paiton beats newer-gen NVIDIA setups in real LLM inference while you keep your existing engine stack. That’s the point: a compatibility layer tries to make AMD behave like “CUDA-compatible hardware”; Paiton leans into what AMD actually is and extracts more.

Could a translation stack match that across FP8, MoE, TP, and real traffic? Only by re-implementing the same AMD-specific effort inside the translator, perpetually.

The Ecosystem Risk: Bad First Impressions

Typical pattern:

Team with a CUDA codebase tries a “drop-in” tool for AMD.
Runs LLM/vision workloads; sees sub-optimal numbers vs NVIDIA baseline.
Internal conclusion: “We tried AMD. It’s slower.”

They rarely check whether:

the layer used AMD FP8 fast paths or proper quantization,
ROCm-first kernels would have done better,
framework guidance suggests different AMD-specific flags.

They just see a dashboard. The result: “proof” AMD is slower, even when AMD-first stacks show the opposite.

The Maintenance Treadmill

To keep pace with AMD-native tooling, a “CUDA-everywhere” layer must constantly:

track ROCm releases (new FP8 paths, GEMM lifts, library upgrades),AMD ROCm Documentation+1
follow new Instinct GPUs and their tuning knobs,AMD ROCm Documentation+1
mirror techniques landing in vLLM for AMD (FP8 KV-cache, attention backends, etc.),VLLM Docs+1
and stay compatible with evolving CUDA semantics.

That’s a high-burn, reactive position, one step removed from where AMD and partners ship optimizations first.

Where These Tools Do Make Sense (and a fair note on consumer GPUs)

Large legacy CUDA/HPC codebases where you need “it runs!” quickly.
Functionality validation while planning a proper ROCm/HIP path.
Consumer/gaming GPUs: RDNA introduced native wave32, aligning with CUDA’s warp width. For desktop users, where ROCm setup can be trickier or officially limited, these layers can be a pragmatic bridge for experimentation.

This is a reasonable on-ramp. It’s just not how you showcase what AMD Instinct hardware can really do on LLMs.

Bottom Line

We’re not saying “never use CUDA-on-AMD compilers or CUDA-to-HIP translators”. We’re saying don’t judge AMD based on them.

If you want to see what AMD GPUs can actually do for AI:

use AMD-first kernels and libraries,
run ROCm-native tuning,
configure frameworks for AMD FP8 and attention backends,
choose parallelism strategies that fit MI***-class topology,
and work with teams who optimize for AMD by design, not by translation.
Or simply use Paiton, which comes with all necessary optimizations

Otherwise, you’re adding a compatibility tax and blaming the hardware for the bill.AMD didn’t choose to be a CUDA clone, and that’s fine. Treat it as its own platform with its own strengths, and it will surprise you. That’s the thesis behind Paiton, and why we keep beating expectations with AMD-first engineering.