Compiler optimizations for 5.8ms GPT-OSS-120B inference (not on GPUs)

At the grand opening of OpenAI’s South Korean office last month, we demonstrated running gpt-oss-120b on two RNGD chips. Just a few weeks after OpenAI released the model, we achieved strong performance (5.8 ms per output token) while delivering excellent power efficiency (well under 180 W per card). This article outlines the key optimizations that enabled us to achieve this performance and showcases our team’s ability to deliver them in such a short amount of time.

RNGD, our flagship AI accelerator for LLM and agentic AI inference, is powered by our unique Tensor Contraction Processor (TCP) architecture, which makes it much easier to achieve high utilizati…

RNGD, our flagship AI accelerator for LLM and agentic AI inference, is powered by our unique Tensor Contraction Processor (TCP) architecture, which makes it much easier to achieve high utilization and minimize data movement. Each RNGD card uses 48 GB of HBM3 and 256 MB of SRAM to deliver 512 TFLOPS of FP8 compute.

Hardware-accelerated dequantization of a new FP format

The MLP blocks in gpt-oss-120b use MXFP4, a 4‑bit floating‑point format. Because MXFP4 didn’t exist when we designed RNGD, we had to build a dequantization path that plugs into the hardware pipeline using software alone.

Our pipeline works as follows:

FP4 pair → FP8 conversion: Two FP4 values are promoted to FP8 using a vectorized table lookup. The table holds precomputed dequant mappings, ensuring constant latency. 1.

FP8 → FP32 cast: We use the hardware tensor unit’s cast path to maximize bandwidth. Because the cast is an internal pipeline stage, memory–compute overlap is possible. 1.

We multiply the scale factor using a composition of bitwise operations.

All three steps are fully pipelined, so that memory fetches and computation hide one another. As a result, each chip sustains terabytes per second of effective conversion throughput, restoring MXFP4 to a BF16/FP32‑suitable representation. This overlaps well with MLP weight loading, minimizing the contribution to overall time per output token (TPOT).

Unlike previous models we have optimized, gpt-oss-120b includes small tensors, such as MXFP4 scale tensors and bias tensors. To efficiently transfer these small tensors, we added a compiler optimization that tunes their DRAM layout. In addition, we implemented a compiler optimization to support operators like the attention sink without requiring data movement.

MoE (gpt-fast style) kernels: cutting the cost of indirection

Because gpt-oss-120b is an MoE model with 128 experts, of which 4 per token are active, we needed to load only the parameters of the experts actually selected for each token—and load them fast—to meet the demo’s latency goal. There are a few strategies for implementing MoE kernels, such as matmul_ogs in triton kernels and fused_moe in vLLM. Since some of these indirectly load activations to optimize throughput, we chose gpt-fast-style, as it loads weights sparsely to optimize latency. A few takeaways:

Optimizing indirect access: The gpt‑fast‑style kernel performs indirect indexing to the activated experts on a per‑token basis, and we optimized kernel‑launch overhead on this path.

HBM performance modeling (end‑to‑end compiler view): Unlike approaches that rely on handwritten GPU kernels, our compiler treats the entire model as a single optimization problem. This means that even for the same operation, there are many viable combinations of tensor layouts and execution schedules. Choosing the best combination requires accurately predicting achievable HBM bandwidth for each tactic, because high‑bandwidth memory only delivers its rating with optimal access patterns. This is especially challenging in setups like our demo, where the addresses to load are determined at runtime. Over the past year, our compiler team has refined prediction across diverse patterns, developing statistical models that closely match reality even for dynamic addresses, and embedding them directly into the code generator and scheduler cost engine. As a result, the demo loaded MoE weights near the hardware’s maximum effective bandwidth.

Co‑optimizing data layout: The key is finding a layout that benefits both indirect access and matrix multiplication. We extended the attention kernel optimizer to explore data‑layout of attention kernels to automatically search layouts that minimize the gather → rearrange → matmul pipeline cost end‑to‑end.

Multi‑chip parallelism: why we chose Tensor Parallelism this time

Accounting for model parameters, KV cache, and activations, gpt-oss-120b needs at least 60 GB of memory. With two RNGD chips (96 GB of memory in total), we chose Tensor Parallelism (TP) to meet the memory needs and extract the full HBM bandwidth from both chips.

Per the original customer requirements, both TP (to reduce latency) and EP (to increase throughput) are important, and we are actively optimizing for both across different serving scenarios. However, in this demo the priority was to show the fastest response time for a single user, so among ongoing efforts we prioritized TP‑focused optimizations first.

In the demo, the batch was small and four experts per token were active. Under these conditions, Expert Parallelism (EP) doesn’t save much communication, while dynamic routing and scheduling can increase latency variance. TP, by contrast, lets us predictably hide inter‑stage communication, making it better for latency.

Right before inter‑chip reduce‑scatter, we tightly pack on‑chip data to raise transfer efficiency. By reducing packet fragmentation and maximizing the effective payload ratio on the inter‑chip link, we could reliably hide communication behind compute.

Meanwhile, in decoding, we fused computations and data movement—across layer boundaries—into a single large schedule. We hid compute, loads, and communication not just within layers but also between layers. Especially in MoE, where which expert weights to load depends on prior results, we pulled in subsequent loads as soon as dependencies were resolved, cutting latency waste.

Final thoughts

This result was possible because we smoothly integrated a new format (MXFP4) into an existing hardware pipeline and attacked the MoE bottleneck—whose core is dynamic access—from the perspective of data movement. Even with only two chips, we hid communication behind compute and scheduled decoding globally, both of which materially lowered TPOT.

This was possible because of our compiler stack’s predictability, hardware‑friendly code generation, and latency‑centric scheduling. While this demo prioritized TP for single‑user latency, across deployments we continue to optimize both TP and EP to meet varied latency and throughput goals. Next, we’ll broaden serving conditions with higher concurrency, EP, and TTFT improvements to deliver consistently low latency across a wider range of deployments.

Furiosa CTO Hanjoon Kim contributed to this article.

Hardware-accelerated dequantization of a new FP format

MoE (gpt-fast style) kernels: cutting the cost of indirection

Optimizing indirect access: The gpt‑fast‑style kernel performs indirect indexing to the activated experts on a per‑token basis, and we optimized kernel‑launch overhead on this path.

Multi‑chip parallelism: why we chose Tensor Parallelism this time

Final thoughts

Similar Posts