The Cerebras partnership, the “very fast Codex” promise, and why chip architecture matters.
8 min readJust now
–
Press enter or click to view image in full size
Inference just got a multiplier bump with the titan WSE-3 chip.
If you’ve used OpenAI’s Codex, you know the drill. You fire off a request, watch the progress indicator spin, and then you wait. And wait. Sometimes five minutes. Sometimes fifteen. Long enough to make coffee. Long enough to forget what you were doing.
Developers have a phrase for this: “brilliant but slow.”
It’s not a backhanded compliment. Codex genuinely produces excellent code. Surgical precision. Deep codebase understanding. Production-ready output that rarely needs cleanup. But the latency kills the workflow. By the time Cod…
The Cerebras partnership, the “very fast Codex” promise, and why chip architecture matters.
8 min readJust now
–
Press enter or click to view image in full size
Inference just got a multiplier bump with the titan WSE-3 chip.
If you’ve used OpenAI’s Codex, you know the drill. You fire off a request, watch the progress indicator spin, and then you wait. And wait. Sometimes five minutes. Sometimes fifteen. Long enough to make coffee. Long enough to forget what you were doing.
Developers have a phrase for this: “brilliant but slow.”
It’s not a backhanded compliment. Codex genuinely produces excellent code. Surgical precision. Deep codebase understanding. Production-ready output that rarely needs cleanup. But the latency kills the workflow. By the time Codex finishes, you’ve context-switched twice and lost your train of thought.
The speed problem isn’t a bug. It’s a fundamental limitation of how AI inference works on current hardware.
In January 2026, OpenAI announced a partnership that suggests they’re done accepting that limitation. The partner is Cerebras Systems. The commitment is 750 megawatts of compute capacity. Sam Altman’s message was blunt:
Here’s what that means, why it matters, and how a chip the size of a dinner plate might finally solve the problem.
The Partnership: What OpenAI Just Committed To
On 14 January 2026, OpenAI and Cerebras announced a multi-year agreement to deploy Cerebras wafer-scale systems for inference workloads. The scale is big: 750 megawatts of dedicated low-latency compute, rolling out in tranches through 2028.
This isn’t about training models. It’s about running them fast.
“OpenAI’s compute strategy is to build a resilient portfolio that matches the right systems to the right workloads,” said Sachin Katti of OpenAI. “Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people.”
Andrew Feldman, Cerebras CEO, framed it more broadly: “Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models.”
The implication is clear. OpenAI isn’t just optimising existing products. They’re betting that speed unlocks use cases that don’t exist yet.
Press enter or click to view image in full size
Why Speed Matters More for Code Than for Chat
For a chatbot answering general questions, a two-second delay is tolerable. For a coding assistant, it’s a workflow killer.
Software development is iterative. You write something, test it, see what breaks, fix it, repeat. Each cycle needs to be fast enough that you stay in flow. When your AI assistant takes ten minutes per response, you can’t iterate. You batch. You context-switch. You lose momentum.
Developers on Reddit describe the problem in vivid terms:
“I lose my train of thought waiting for Codex. By the time it’s done, I’ve forgotten what I was doing.”
“Modern development is iterative. Test, fail, fix, repeat. Codex makes each cycle 10x longer.”
“It’s like having a genius colleague who takes a coffee break before every answer.”
The comparison to Claude is stark. Developers consistently report Claude responding 5–10x faster for similar tasks. The accuracy difference is marginal. The speed difference is not.
“I’d rather have 80% accuracy at 2x speed than 95% accuracy with 5-second delays.”
This isn’t developers being impatient. Latency compounds. A five-step refactor with 10-minute cycles becomes a 50-minute ordeal. At that point, you might as well do it manually.
The Memory Wall
To understand why Cerebras matters, you need to understand why GPUs struggle with inference.
Large language models like those powering Codex have billions of parameters. During inference, those parameters need to be read from memory for every single token generated. The model doesn’t just think once; it thinks once per word, loading the same massive weight matrices over and over.
The bottleneck isn’t compute. It’s memory bandwidth.
An NVIDIA H100, the current workhorse of AI infrastructure, has about 3.35 terabytes per second of memory bandwidth. That sounds like a lot until you realise you’re trying to move hundreds of gigabytes of weights for every token. The GPU spends most of its time waiting for data, not computing.
Throwing more GPUs at the problem helps, but not linearly. When you split a model across multiple chips, you introduce a new bottleneck: inter-chip communication. Every time the chips need to synchronise, you pay a latency tax. The more chips, the more overhead.
This is the memory wall. It’s why doubling your GPU count doesn’t halve your inference time. It’s why OpenAI’s existing infrastructure, no matter how large, can’t brute-force its way to instant Codex responses.
Enter Cerebras: One Chip, No Compromise
Cerebras took a different approach. Instead of building small chips and networking them together, they built one enormous chip.
The Wafer-Scale Engine 3 (WSE-3) uses an entire silicon wafer as a single processor. At 46,225 square millimetres, it’s roughly 56 times larger than the biggest GPU. It contains 4 trillion transistors and 900,000 AI-optimised cores.
But the real advantage isn’t size. It’s architecture.
On-chip memory changes everything. The WSE-3 has 44 gigabytes of SRAM directly on the chip, with 21 petabytes per second of memory bandwidth. That’s not a typo. Twenty-one petabytes. Compared to 3.35 terabytes for the H100, that’s roughly 6,000 times more bandwidth.
When model weights live on-chip, inference stops waiting for memory. The compute cores stay fed. Tokens stream out at rates that GPU clusters simply cannot match.
No network overhead. Because everything runs on a single wafer, there’s no inter-chip communication. No NVLink. No InfiniBand. No collective operations synchronising thousands of GPUs. The entire model executes in one place, with deterministic, predictable latency.
A beast of a chip
Cerebras claims this architecture delivers inference speeds 10–20x faster than GPU clusters for large language models. Independent benchmarks have largely backed up these numbers.
Press enter or click to view image in full size
Press enter or click to view image in full size
The Team Behind the Chip
Building a chip this large wasn’t supposed to work. Conventional semiconductor wisdom says you cut wafers into small dies because defects are inevitable. A single flaw ruins an entire chip. The larger the chip, the lower the yield, the higher the cost.
Cerebras solved this through redundancy and clever routing. Extra cores and interconnects allow software to route around defects. A few bad transistors don’t kill the chip; they get bypassed.
The team that pulled this off had done something similar before.
Andrew Feldman, Cerebras CEO, founded SeaMicro in 2007 with a similarly audacious bet: low-power, high-density servers that packed 512 processors into a 10U chassis. The industry said it wouldn’t work. AMD bought SeaMicro for $334 million in 2012.
The Cerebras founding team is essentially SeaMicro reassembled. Gary Lauterbach, who led processor architecture, spent decades at Sun Microsystems designing SPARC chips. Sean Lie and Michael James brought ASIC design and hardware systems expertise. Jean-Philippe Fricker managed engineering execution.
This isn’t a startup taking a punt on an untested idea. It’s a team with a track record of challenging data centre orthodoxy and winning.
“Everyone told us it was impossible,” Feldman has said of the wafer-scale approach. “We decided to try anyway.”
The company has raised over $720 million, with investors including Benchmark Capital and Alpha Wave Global. They filed for an IPO in 2024, though regulatory scrutiny around their UAE partnership with G42 has delayed the listing. The OpenAI deal represents major commercial validation regardless.
What This Means for You
If Cerebras delivers on its benchmarks, the practical implications for ChatGPT and Codex are real.
Response times could drop from minutes to seconds. A task that currently takes 10 minutes might complete in under a minute. That’s the difference between batching requests and having a conversation.
Real-time pair programming becomes possible. When inference is fast enough, you can treat Codex like a collaborator rather than a background worker. Quick questions get quick answers. Iteration cycles tighten.
New workflows emerge. Developers currently adapt to slow Codex by saving complex tasks and using faster tools for quick fixes. With faster inference, that split may no longer be necessary.
The developers who adapted their workflows around “brilliant but slow” may find themselves adapting again, this time to “brilliant and fast.”
There are caveats. The partnership rolls out through 2028, so improvements will be gradual. Not all workloads will move to Cerebras immediately. And OpenAI hasn’t specified which products get upgraded first, though Sam Altman’s tweet strongly suggests Codex is near the front of the queue.
The Bigger Picture
This deal signals a broader shift in AI infrastructure strategy.
Training large models grabbed headlines for years. The race to build bigger GPU clusters, the billion-dollar compute budgets, the debates about scaling laws. But training happens once. Inference happens every time someone uses the model.
As AI products scale to hundreds of millions of users, inference costs and latency become the constraints that matter. A model that’s expensive or slow to run limits how it can be deployed, regardless of how capable it is.
OpenAI isn’t abandoning NVIDIA. They’re diversifying. GPUs for training, Cerebras for inference. Match the hardware to the workload.
Cerebras founder Andrew Feldman sees this as inevitable: “We’re not trying to be a little bit better than GPUs. We’re trying to be different. To change what’s possible in AI.”
For developers waiting on Codex responses, “different” sounds about right.
What Happens Next
The 750 megawatts of Cerebras capacity will come online in phases through 2028. OpenAI hasn’t published a detailed roadmap, but the partnership announcement makes the priority clear: low-latency inference for products where speed directly impacts user experience.
Codex fits that description perfectly.
For developers who’ve grown accustomed to the coffee-break cadence of current Codex, the change could be jarring. When the tool that was brilliant but slow becomes brilliant and fast, the excuse for context-switching disappears. The workflow adapts again.
Whether that’s a feature or a bug depends on how much you liked those coffee breaks.
But one of thing’s for sure, the real Codex fanfare is just about to begin:
References
- OpenAI. “OpenAI partners with Cerebras.” 14 January 2026. https://openai.com/index/cerebras-partnership/
- Cerebras. “OpenAI Partners with Cerebras to Bring High-Speed Inference to the Mainstream.” 14 January 2026. https://www.cerebras.ai/blog/openai-partners-with-cerebras-to-bring-high-speed-inference-to-the-mainstream
- Cerebras. “Wafer-Scale Engine.” https://cerebras.ai/chip
- SemiAnalysis. “Cerebras Inference: 70x Faster Than GPUs.” https://www.semianalysis.com/p/cerebras-inference-70x-faster-than
- Tom’s Hardware. “Cerebras Launches the World’s Fastest AI Inference Solution.” https://www.tomshardware.com/tech-industry/artificial-intelligence/cerebras-launches-the-worlds-fastest-ai-inference-solution-claims-20x-faster-than-gpus