Compiler optimizations for 5.8ms GPT-OSS-120B inference (not on GPUs)
furiosa.ai·1d·
Discuss: Hacker News
Flag this post

At the grand opening of OpenAI’s South Korean office last month, we demonstrated running gpt-oss-120b on two RNGD chips. Just a few weeks after OpenAI released the model, we achieved strong performance (5.8 ms per output token) while delivering excellent power efficiency (well under 180 W per card). This article outlines the key optimizations that enabled us to achieve this performance and showcases our team’s ability to deliver them in such a short amount of time.

RNGD, our flagship AI accelerator for LLM and agentic AI inference, is powered by our unique Tensor Contraction Processor (TCP) architecture, which makes it much easier to achieve high utilizati…

Similar Posts

Loading similar posts...