Google’s Ironwood TPUs represent a bigger threat than Nvidia would have you believe

opinion Look out, Jensen! With its TPUs, Google has shown time and time again that it’s not the size of your accelerators that matters but how efficiently you can scale them in production.

Now with its latest generation of Ironwood accelerators slated for general availability in the coming weeks, the Chocolate Factory not only has scale on its side but a tensor processing unit (TPU) with the grunt to give Nvidia’s Blackwell behemoths a run for their money.

First announced in April alongside a comically bad comparison to the El Capitan supercomputer — no, an Ironwood TPU Pod is not 24x faster than the Department of Energy’s biggest iron — Goo…

opinion Look out, Jensen! With its TPUs, Google has shown time and time again that it’s not the size of your accelerators that matters but how efficiently you can scale them in production.

First announced in April alongside a comically bad comparison to the El Capitan supercomputer — no, an Ironwood TPU Pod is not 24x faster than the Department of Energy’s biggest iron — Google’s TPU v7 accelerators are a major leap in performance over prior generations.

Historically, Google’s TPUs have paled in comparison to contemporary GPUs from the likes of Nvidia and more recently AMD in terms of raw FLOPS, memory capacity, and bandwidth, making up for this deficit by simply having more of them.

Google has offered its TPUs in pods — large, scale-up compute domains — containing hundreds or even thousands of chips. If additional compute is needed, users can then scale out to multiple pods.

With TPU v7, Google’s accelerators offer performance within spitting distance of Nvidia’s Blackwell GPUs, when normalizing floating point perf to the same precision.

Each Ironwood TPU boasts 4.6 petaFLOPS of dense FP8 performance, slightly higher than Nvidia’s B200 at 4.5 petaFLOPS and just shy of the 5 petaFLOPS delivered by the GPU giant’s more powerful and power-hungry GB200 and GB300 accelerators.

Feeding that compute is 192 GB of HBM3e memory delivering 7.4 TB/s of bandwidth, which again puts it in the same ballpark as Nvidia’s B200 at 192GB of HBM and 8TB/s of memory bandwidth.

For chip-to-chip communication, each TPU features four ICI Links, which provide 9.6 Tbps of aggregate bidirectional bandwidth, compared to 14.4 Tbps (1.8 TB/s) on the B200 and B300.

Put simply, Ironwood is Google’s most capable TPU ever, delivering performance 10x that of its TPU v5p, 4x that of its TPU v6e “Trillium” accelerators unveiled last year, and roughly matching that of Nvidia and AMD’s last chips.

Performance meets scale

But, as we alluded to earlier, Google’s real trick is the ability to scale TPUs into truly enormous compute domains. Nvidia’s NVL72 rack systems stitch 72 of its latest Blackwell accelerators into a single compute domain using its proprietary NVLink interconnect tech. AMD will do something similar with its Helios racks and the MI450 series next year.

Ironwood, by comparison, is monstrous, with Google offering the chips in pods of 256 at the low end and 9,216 on the high end. If that isn’t enough, users with sufficiently deep pockets can then scale out to additional PODs. Back in April, Google told us that its Jupiter datacenter network tech could theoretically support scale compute clusters of up to 43 TPU v7 pods — or roughly 400,000 accelerators. Having said that, while it may be supported, it’s not clear just how big Google’s TPU v7 clusters will be in practice.

To be clear, compute clusters containing hundreds of thousands of Nvidia GPUs do exist and in fact have become commonplace. The difference is that, up until the Blackwell generation, these clusters have been built using eight-way GPU boxes arranged in massive scale out domains. Nvidia’s NVL72 increased the unit of compute by a factor of nine, but still falls far short of Google’s TPU PODs.

Google’s approach to scale up compute fabrics differs considerably from Nvidia’s. Where the GPU giant has opted for a large, relatively flat switch topology for its rack-scale platforms, which we’ve discussed at length here, Google employs a 3D torus topology, where each chip connects to the others in a three dimensional mesh.

The topology eliminates the need for high-performance packet switches, which are expensive, power hungry, and, under heavy load, can introduce unwanted latency.

While torus can eliminate switch latency, the mesh topology means more hops may be required for any one chip to talk to another. As the torus grows, so does the potential for chip-to-chip latency. By using switches, Nvidia and AMD are able to ensure their GPUs are at most two hops away from the next chip.

As we understand it, which is better depends on the workload. Some workloads may benefit from large multi-hop topologies like the 2D and 3D toruses used in Google’s TPU pods, while others may perform better on the smaller switched compute domains afforded by Nvidia and AMD’s rack designs.

Because of this, Google employs a different kind of switching tech, which allows it to slice and dice its TPU pods into various shapes and sizes in order to better suit its own internal and customer workloads.

Rather than the packet switches you may be familiar with, Google employs optical circuit switches (OCS). These are more akin to the telephone switchboards of the 20th century. OCS appliances use various methods, MEMS devices being one, to patch one TPU to another. And because this connection is usually made through a physical process connecting one port to another, it introduces little if any latency.

As an added benefit, OCS also helps with fault tolerance, as if a TPU fails, the OCS appliances can drop it from the mesh and replace it with a working part.

Winning over the competition

Google has been using 2D and 3D toruses in conjunction with OCS appliances in its TPU pods since at least 2021, when TPU v4 made its debut. Google is also no stranger to operating massive compute fabrics in production.

Its TPU v4 supports PODs up to 4096 chips in size, while its TPU v5p more than doubled that to 8,960. So the jump to 9,216 TPU PODs with Ironwood shouldn’t be a stretch for Google to pull off.

The availability of these massive compute domains has certainly caught the attention of major model builders, including those for whom Google’s Gemini models are a direct competitor. Anthropic is among Google’s largest customers, having announced plans to utilize up to a million TPUs to train and serve its next generation of Claude models.

Anthropic’s embrace of Google’s TPU tech isn’t surprising when you consider that the model dev is also deploying its workloads across hundreds of thousands of Amazon’s Trainium 2 accelerators under Project Rainier, which also utilize 2D and 3D torus mesh topologies in their compute fabrics.

While Nvidia CEO Jensen Huang may play off the threat of AI ASICs to his GPU empire, it’s hard to ignore the fact that chips from the likes of Google, Amazon, and others are catching up quickly in terms of hardware capabilities and network scalability, with software often ending up being the deciding factor.

Perhaps this is why analysts keep bringing the question up quarter after quarter. ®

Performance meets scale

Winning over the competition

Similar Posts