🚀 The feature, motivation and pitch
Disclosure:
Yes: I used an LLM to prepare this post, obviously, who has time to type all this up… Yes: I used an LLM to find and patch in what was necessary. Yes: It works and provides significant performance uplift. Yes: The work is incomplete to provide full FP8 across all models, but this is a start. Yes: I’m bitter AF about having to take the time to do this. No: I will not jump through all the hoops to upstream this as a PR myself, I’ll continue to patch locally as required, I don’t have time to deal with PR/CR process.
BLUF: LLM go faster with small code changes, benchmark results:
Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM
Summary
Successfully enabled native FP8 WMMA operations on AMD RDNA4 GPUs in…
🚀 The feature, motivation and pitch
Disclosure:
Yes: I used an LLM to prepare this post, obviously, who has time to type all this up… Yes: I used an LLM to find and patch in what was necessary. Yes: It works and provides significant performance uplift. Yes: The work is incomplete to provide full FP8 across all models, but this is a start. Yes: I’m bitter AF about having to take the time to do this. No: I will not jump through all the hoops to upstream this as a PR myself, I’ll continue to patch locally as required, I don’t have time to deal with PR/CR process.
BLUF: LLM go faster with small code changes, benchmark results:
Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM
Summary
Successfully enabled native FP8 WMMA operations on AMD RDNA4 GPUs in vLLM, achieving significant performance improvements by utilizing the hardware’s 128 AI accelerators instead of dequantizing FP8 weights to FP32. This has been in production deployment for 4 days with zero observed aberrations as a result.
Performance Results
Testing with FP8-quantized Qwen3 models on AMD Radeon R9700:
| Model | Before (TPS) | After (TPS) | Improvement |
|---|---|---|---|
| Qwen3-0.6B | ~160 decode | ~200 decode | 25% faster |
| Qwen3-30B-2507 | ~52 decode | ~85 decode | 63% faster |
Additionally I found nearly doubling of prefill performance in certain scenarios with these rough kernel configs at prompt token counts up to 10,000 tokens with tapering gains as token count grows resulting from memory pressure. Significant additional performance remains untapped with further kernel tuning.
With further improvements this will scale well, raising memory speed to 1375 from 1258 shows a further 5% uplift in performance, indicating there is room for memory transfer optimization to make more efficient use of existing bandwidth for further gains.
System Environment
-
Hardware: AMD Radeon AI Pro R9700 (RDNA4 gfx1201)
-
128 AI Accelerators
-
Native FP8 E4M3FN support
-
16x16 WMMA instruction tile size
-
Software Stack:
-
vLLM Version: 0.11.1rc6.dev223+g404d7a9d1
-
ROCm Version: 7.0.0
-
AMD Driver: 6.16.6
-
Base Image: rocm/vllm-dev:nightly
The Problem
By default, vLLM on RDNA4 was dequantizing FP8 weights to FP32 for all operations, completely wasting the hardware’s 128 AI accelerators. The execution path fell back to torch_channelwise_w8a8_scaled_mm() which explicitly upcasts to FP32:
# Default broken path
output = torch._scaled_mm(
qinput, weight,
scale_a=TORCH_DEVICE_IDENTITY,
scale_b=TORCH_DEVICE_IDENTITY,
out_dtype=torch.float32 # ← Upcast to FP32!
)
This meant zero performance benefit from FP8 quantization on RDNA4. In fact, it ran like trash, absolutely horrible.
The Solution
Follow the MI350X Triton kernel path by:
- Adding RDNA4 (gfx1201) to platform detection
- Patching AITER’s architecture mapping to recognize gfx1201
- Adding RDNA4-specific matrix sizes to kernel tuning
- Providing optimized kernel configurations
Important Note: AITER’s C++/ASM kernels do not work on RDNA4 and must be disabled (VLLM_ROCM_USE_AITER=0). However, vLLM’s FP8 code path imports AITER’s Triton kernels (aiter.ops.triton.gemm_a8w8_blockscale). These Triton kernels check AITER’s architecture mapping and will crash with a KeyError if gfx1201 is not recognized. Therefore, we must patch AITER’s architecture detection before vLLM starts, then Triton automatically compiles these kernels down to native WMMA instructions when it detects FP8 data types on gfx1201 hardware.
This routes FP8 operations through native WMMA instructions instead of dequantization.
Implementation Details
1. vLLM Code Modifications
Two files need to be modified in the vLLM source code:
File: vllm/platforms/rocm.py
Original:
def on_mi3xx() -> bool:
GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
return any(arch in GPU_ARCH for arch in ["gfx942", "gfx950"])
Modified:
def on_mi3xx() -> bool:
GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
# Added gfx1201 (RDNA4) to enable FP8 Triton kernel path
return any(arch in GPU_ARCH for arch in ["gfx942", "gfx950", "gfx1201"])
Change: Added "gfx1201" to enable RDNA4 to use the MI350X Triton kernel code path.
File: vllm/model_executor/layers/quantization/utils/fp8_utils.py
Original:
def is_aiter_triton_kernel_tuned(n, k):
return (n, k) in [
(1024, 8192),
(2112, 7168),
(3072, 1536),
(32768, 8192),
(4096, 7168),
(4608, 7168),
(512, 7168),
(7168, 2048),
(7168, 256),
(8192, 1024),
(8192, 32768),
]
Modified:
def is_aiter_triton_kernel_tuned(n, k):
# MI350 tuned sizes
mi350_sizes = [
(1024, 8192),
(2112, 7168),
(3072, 1536),
(32768, 8192),
(4096, 7168),
(4608, 7168),
(512, 7168),
(7168, 2048),
(7168, 256),
(8192, 1024),
(8192, 32768),
]
# RDNA4 (gfx1201) specific sizes verified to work
rdna4_sizes = [
(1024, 1024), # K, V projections
(2048, 1024), # Q projection
(3072, 1024), # Gate, Up projections
(1024, 3072), # Down projection
(1024, 2048), # O projection (transposed)
(512, 512), # Small models
(1024, 512), # Asymmetric
(512, 1024), # Asymmetric reverse
(2048, 2048), # Medium models
(4096, 4096), # 7B class models
(8192, 8192), # 70B class models
]
# Check architecture to include RDNA4 sizes
import torch
arch_name = torch.cuda.get_device_properties(0).gcnArchName
if "gfx12" in arch_name:
# For RDNA4, include both RDNA4 and MI350 sizes
return (n, k) in (rdna4_sizes + mi350_sizes)
else:
# Other architectures use original MI350 sizes only
return (n, k) in mi350_sizes
Changes:
- Refactored original list into
mi350_sizesvariable - Added 11 RDNA4-specific matrix dimensions in
rdna4_sizes - Added architecture detection that returns combined list for RDNA4
- The sizes used were tested and found to work correctly
2. AITER Architecture Patch (Required)
Why This Is Needed:
While AITER’s C++/ASM kernels don’t work on RDNA4 (hence VLLM_ROCM_USE_AITER=0), vLLM still imports AITER’s Triton kernels:
from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
When these Triton kernels execute, they internally call arch_info.get_device() which looks up the GPU architecture in AITER’s _ARCH_TO_DEVICE dictionary. Since gfx1201 is not in this dictionary by default, it throws a KeyError and crashes.
The solution: Patch AITER’s architecture mapping before vLLM starts.
Click to expand: AITER Patch Wrapper Script
rdna4_aiter_wrapper.sh
#!/bin/bash
# RDNA4 FP8 vLLM Wrapper
# Patches AITER architecture detection and disables AITER C++/ASM kernels
echo "=========================================="
echo "RDNA4 FP8 Startup"
echo "=========================================="
# Disable AITER's C++/ASM implementations (they don't work on RDNA4)
export VLLM_ROCM_USE_AITER=0
echo "Environment: VLLM_ROCM_USE_AITER=0"
# Patch AITER's architecture mapping BEFORE vLLM imports it
echo ""
echo "Patching AITER architecture detection for gfx1201..."
python3 -c "
import aiter.ops.triton.utils.arch_info as arch_info
if 'gfx1201' not in arch_info._ARCH_TO_DEVICE:
arch_info._ARCH_TO_DEVICE['gfx1201'] = 'MI350X'
print('[AITER Patch] ✓ Added gfx1201 -> MI350X mapping')
else:
print('[AITER Patch] gfx1201 already mapped')
"
if [ $? -eq 0 ]; then
echo "✓ AITER patch applied"
else
echo "✗ AITER patch failed - vLLM will crash"
exit 1
fi
echo ""
echo "Launching vLLM..."
exec vllm serve "$@"
What This Does:
- Disables AITER C++/ASM kernels via
VLLM_ROCM_USE_AITER=0 - Patches AITER’s Triton code by adding
gfx1201 -> MI350Xto the architecture mapping - Launches vLLM which can now successfully import and use AITER’s Triton kernels
Docker Integration:
Set this script as your Docker ENTRYPOINT:
COPY rdna4_aiter_wrapper.sh /workspace/
RUN chmod +x /workspace/rdna4_aiter_wrapper.sh
ENTRYPOINT ["/workspace/rdna4_aiter_wrapper.sh"]
Or mount and use it via systemd:
-v /path/to/rdna4_aiter_wrapper.sh:/workspace/wrapper.sh \
--entrypoint /workspace/wrapper.sh \
Alternative: Inline Patch
If you don’t want a wrapper script, add this to your Docker ENTRYPOINT or startup command:
python3 -c "import aiter.ops.triton.utils.arch_info as arch_info; arch_info._ARCH_TO_DEVICE['gfx1201'] = 'MI350X'" && \
export VLLM_ROCM_USE_AITER=0 && \
vllm serve "$@"
3. Kernel Configuration Files
16 JSON configuration files are required to optimize FP8 operations for RDNA4. These files tell vLLM’s Triton compiler how to tile and execute FP8 matrix multiplications. These an educated GUESS but they do result in massive performance uplift, there is significant room for improving throughput by correctly tuning these values.
File Locations:
- Linear layers:
vllm/model_executor/layers/quantization/utils/configs/ - MoE layers:
vllm/model_executor/layers/fused_moe/configs/
File Naming: N={n},K={k},device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
device_name=0x7551= AMD Radeon RX 9070 XT device IDdtype=fp8_w8a8= FP8 weights and activationsblock_shape=[128,128]= Quantization block size
Kernel Configuration Files
Linear Layer Configs (15 files)
All linear configs use this structure with batch size keys (“16”, “32”, “64”):
Click to expand all 15 linear config files
N=1024,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=1024,K=1536,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=1024,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=1024,K=3072,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=2048,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=2048,K=4096,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=2048,K=768,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=2560,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=256,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=3072,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=384,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=4096,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=512,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
N=768,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"32": {
"BLOCK_SIZE_K": 32,
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"GROUP_SIZE_M": 16,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
},
"64": {
"BLOCK_SIZE_K": 64,
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"GROUP_SIZE_M": 8,
"kpack": 1,
"matrix_instr_nonkdim": 16,
"num_warps": 4
}
}
MoE Layer Config (1 file)
MoE configs use num_stages instead of kpack and matrix_instr_nonkdim:
E=128,N=384,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json
{
"16": {
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 32,
"GROUP_SIZE_M": 8,
"num_warps": 4,
"num_stages": 2
},
"32": {
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 32,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 2
},
"64": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 8,
"num_warps": 4,
"num_stages": 2
}
}
Configuration Parameters Explained
| Parameter | Description | RDNA4 Values |
|---|---|---|
BLOCK_SIZE_M/N/K | Tile size for GEMM operations | 32, 64 (multiples of 16) |
GROUP_SIZE_M | Batch grouping for processing | 8, 16 |
num_warps | Warps per thread block | 4 (aligns with CU SIMDs) |
matrix_instr_nonkdim | WMMA instruction size | 16 (fixed for RDNA4) |
kpack | K-dimension packing factor | 1 (conservative) |
num_stages | Pipeline stages (MoE only) | 2 |
Notes:
- All dimensions must be multiples of 16 (WMMA tile size)
- Configs are conservative; significant tuning potential remains
num_warps=4aligns well with RDNA4’s compute unit structure- Larger block sizes (128x128) could be explored for batch operations but caused performance instability/inconsistency during prefill so were removed to fall back to default values
Key Technical Details
Triton automatically detects and uses WMMA instructions when all these conditions are met:
- GPU Architecture: gfx1201 (RDNA4) is detected
- Data Types: FP8 (float8_e4m3fn) input tensors
- Operation: Matrix multiplication via
tl.dot() - Tile Sizes: Dimensions are multiples of 16 (WMMA tile size)
When vLLM imports aiter.ops.triton.gemm_a8w8_blockscale, it’s just importing Triton code. The Triton compiler examines the target architecture, sees FP8 operations on gfx1201, and automatically generates assembly code using RDNA4’s 16x16x16 WMMA instructions (v_wmma_f32_16x16x16_fp8_fp8).
No runtime patching or AITER library support is needed aside from the above mentioned AITER loading insertion of gfx1201 - it’s pure Triton compilation doing the heavy lifting.
Execution Path
FP8 Model Load
↓
on_mi3xx() returns True (gfx1201 recognized)
↓
rocm_aiter_gemm_w8a8_blockscale_impl() [fp8_utils.py:71]
↓
is_aiter_triton_kernel_tuned(n, k) returns True
↓
Imports aiter.ops.triton.gemm_a8w8_blockscale (Triton code)
↓
Triton compiler detects gfx1201 + FP8 types
↓
Auto-generates WMMA instructions (v_wmma_f32_16x16x16_fp8_fp8)
↓
✓ Native FP8 WMMA execution!
Verification
Check Platform Detection
from vllm.platforms import rocm
import torch
print(f"GPU: {torch.cuda.get_device_properties(0).gcnArchName}") # Should be gfx1201
print(f"Detected as MI3xx: {rocm.on_mi3xx()}") # Should be True
Check AITER is Disabled
echo $VLLM_ROCM_USE_AITER # Should output: 0
Look for Log Messages
✅ GOOD: "Using configuration from .../N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8..."
❌ BAD: "Using default W8A8 Block FP8 kernel config"
Known Limitations
- Minimum Batch Size: RDNA4 requires M ≥ 16 for FP8 operations. Single-token generation may need padding, it showed failures in testing.
- AITER Runtime Patch Required: AITER’s C++/ASM kernels don’t work on RDNA4, but we still need AITER’s Triton kernels. This requires patching AITER’s architecture mapping at runtime before vLLM starts, otherwise you’ll get a
KeyError: 'gfx1201'crash. - Kernel Configs: Current configs are conservative guesses. Significant performance improvements possible with tuning.
- Model Coverage: Configs cover common LLM dimensions. New models may need additional configs.
Future Optimization Opportunities
Kernel Tuning:
- Larger block sizes (128x128) for batch operations
- Experiment with
kpack=2,4for better memory bandwidth - Profile optimal
GROUP_SIZE_Mvalues per matrix size
Additional Matrix Sizes:
- Cover more model architectures (Llama, Mistral, etc.)
- Add configs for larger models (70B+, 405B)
- Support for different TP splits (TP=4, TP=8)
Batch Size Handling:
- Implement automatic padding for M < 16
- Optimize for common batch sizes (1, 4, 8, 16, 32)
Contributing
This work enables native FP8 support on RDNA4 for the first time in vLLM. If you use these modifications:
- Test other models: Try different FP8 models and report results
- Tune configs: Experiment with kernel parameters and share improvements
- Add matrix sizes: Contribute configs for models not yet covered
- Upstream: Help prepare patches for vLLM mainline
Acknowledgments
This implementation follows the MI350X Triton kernel path and builds upon AMD’s AITER library and vLLM’s FP8 quantization framework.
Key insight: RDNA4 uses the same standard FP8 E4M3FN format as MI350X (not FNUZ like MI300), making it compatible with MI350X’s Triton kernels after proper platform detection.
Status: Production-ready with ongoing optimization opportunities.
Last Updated: 2025-11-13
Click to expand tuneableOP results that further improved performance /vllm-tunableop/2025-11-08$ cat tunableop_results0.csv Validator,PT_VERSION,2.9.0 Validator,ROCM_VERSION,7.0.0.0-38-9428210 Validator,HIPBLASLT_VERSION,100000-976b9c4a87 Validator,GCN_ARCH_NAME,gfx1201 Validator,ROCBLAS_VERSION,5.0.0.976b9c4a87 GemmTunableOp_BFloat16_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Rocblas_48090,0.0158741 GemmTunableOp_BFloat16_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Hipblaslt_47905,0.0160022 GemmTunableOp_BFloat16_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Hipblaslt_48249,0.0161586 GemmTunableOp_BFloat16_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Hipblaslt_48359,0.614931 GemmTunableOp_BFloat16_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.617321 GemmTunableOp_BFloat16_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Hipblaslt_47847,0.0357815 GemmTunableOp_BFloat16_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_47758,0.0159846 GemmTunableOp_BFloat16_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.612155 GemmTunableOp_BFloat16_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Hipblaslt_47652,0.0236782 GemmTunableOp_BFloat16_TN,tn_128_96_2048_ld_2048_2048_128,Gemm_Hipblaslt_48089,0.0157361 GemmTunableOp_BFloat16_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Rocblas_47754,0.020133 GemmTunableOp_BFloat16_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.0129801 GemmTunableOp_BFloat16_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Hipblaslt_47852,0.0159458 GemmTunableOp_BFloat16_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Hipblaslt_48089,0.015883 GemmTunableOp_BFloat16_TN,tn_128_3072_2048_ld_2048_2048_128,Default,0.0215866 GemmTunableOp_BFloat16_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Rocblas_48019,0.0165829 GemmTunableOp_BFloat16_TN,tn_128_256_2048_ld_2048_2048_128,Gemm_Rocblas_48275,0.0158254 GemmTunableOp_BFloat16_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Hipblaslt_48093,0.0155757 GemmTunableOp_BFloat16_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.616733 GemmTunableOp_BFloat16_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Hipblaslt_47762,0.0152749 GemmTunableOp_BFloat16_TN,tn_128_128_2048_ld_2048_2048_128,Gemm_Hipblaslt_48093,0.0156526 GemmTunableOp_BFloat16_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_47762,0.0155398 GemmTunableOp_BFloat16_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Hipblaslt_47760,0.0158461 GemmTunableOp_BFloat16_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_48360,0.0173334 GemmTunableOp_BFloat16_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Hipblaslt_48260,0.0172693 GemmTunableOp_BFloat16_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Hipblaslt_47753,0.019419 GemmTunableOp_BFloat16_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Hipblaslt_48164,0.0179382 GemmTunableOp_float_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0465673 GemmTunableOp_float_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0472677 GemmTunableOp_float_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Rocblas_61376,0.0508242 GemmTunableOp_float_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Hipblaslt_61377,1.32029 GemmTunableOp_float_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_-9,1.05431 GemmTunableOp_float_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.348349 GemmTunableOp_float_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_61376,0.0598218 GemmTunableOp_float_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Hipblaslt_61377,1.34172 GemmTunableOp_float_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.167474 GemmTunableOp_float_TN,tn_128_96_2048_ld_2048_2048_128,Default,0.0472857 GemmTunableOp_float_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.117973 GemmTunableOp_float_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.00946049 GemmTunableOp_float_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0464592 GemmTunableOp_float_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0456692 GemmTunableOp_float_TN,tn_128_3072_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.123511 GemmTunableOp_float_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0447937 GemmTunableOp_float_TN,tn_128_256_2048_ld_2048_2048_128,Default,0.057404 GemmTunableOp_float_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0444132 GemmTunableOp_float_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_61383,1.37875 GemmTunableOp_float_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0454013 GemmTunableOp_float_TN,tn_128_128_2048_ld_2048_2048_128,Default,0.0487681 GemmTunableOp_float_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0460332 GemmTunableOp_float_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.0819472 GemmTunableOp_float_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.0825464 GemmTunableOp_float_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Rocblas_61377,0.104956 GemmTunableOp_float_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.129388 GemmTunableOp_float_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Rocblas_61377,0.124408 GemmTunableOp_Half_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Rocblas_58331,0.0138765 GemmTunableOp_Half_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Rocblas_58329,0.0139045 GemmTunableOp_Half_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Rocblas_58334,0.0139913 GemmTunableOp_Half_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.609519 GemmTunableOp_Half_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.608598 GemmTunableOp_Half_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Rocblas_58149,0.0350798 GemmTunableOp_Half_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_58325,0.0146849 GemmTunableOp_Half_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Rocblas_58331,0.602502 GemmTunableOp_Half_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Hipblaslt_57525,0.0239766 GemmTunableOp_Half_TN,tn_128_96_2048_ld_2048_2048_128,Gemm_Hipblaslt_58323,0.0139522 GemmTunableOp_Half_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Hipblaslt_57524,0.0194073 GemmTunableOp_Half_TN,tn_128_3072_2048_ld_2048_2048_128,Default,0.021287 GemmTunableOp_Half_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Hipblaslt_58325,0.0141572 GemmTunableOp_Half_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Rocblas_58325,0.0143933 GemmTunableOp_Half_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Rocblas_58324,0.0139865 GemmTunableOp_Half_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.00908485 GemmTunableOp_Half_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Hipblaslt_58322,0.0141397 GemmTunableOp_Half_TN,tn_128_256_2048_ld_2048_2048_128,Gemm_Rocblas_58324,0.0140341 GemmTunableOp_Half_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.606477 GemmTunableOp_Half_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Hipblaslt_58328,0.0134837 GemmTunableOp_Half_TN,tn_128_128_2048_ld_2048_2048_128,Gemm_Hipblaslt_58324,0.0138293 GemmTunableOp_Half_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_58330,0.0137585 GemmTunableOp_Half_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Hipblaslt_58333,0.0150417 GemmTunableOp_Half_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_58328,0.0173309 GemmTunableOp_Half_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Rocblas_57801,0.0174229 GemmTunableOp_Half_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Rocblas_57496,0.0187534 GemmTunableOp_Half_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Hipblaslt_57891,0.0178402