Amazon last week revealed its Trainium3 UltraServer rack systems, and if your first thought was "boy that looks a lot like Nvidia’s GB200 NVL72," your eyes aren’t deceiving you.
As the AI boom enters its fourth year, the infrastructure driving much of the bubble has really started to look the same.
Left: Half of a Trainium3 UltraServer, Center: Nvidia’s GB200 NVL72, Right: AMD’s MI400-powered Helios rack - Click to enlarge
Amazon has deployed a ton of Nvidia’s GB200 and GB300 NVL72 racks, and given the visual similarities to its Trainium3 Ultra Servers, we wouldn’t be the least bit surprised to find out that large portions of the racks are shared between the two.
In fact, with the launch of Trainium4, Amazon has alr…
Amazon last week revealed its Trainium3 UltraServer rack systems, and if your first thought was "boy that looks a lot like Nvidia’s GB200 NVL72," your eyes aren’t deceiving you.
As the AI boom enters its fourth year, the infrastructure driving much of the bubble has really started to look the same.
Left: Half of a Trainium3 UltraServer, Center: Nvidia’s GB200 NVL72, Right: AMD’s MI400-powered Helios rack - Click to enlarge
Amazon has deployed a ton of Nvidia’s GB200 and GB300 NVL72 racks, and given the visual similarities to its Trainium3 Ultra Servers, we wouldn’t be the least bit surprised to find out that large portions of the racks are shared between the two.
In fact, with the launch of Trainium4, Amazon has already announced it’ll be able to slide its custom compute blades directly into the same MGX chassis used by Nvidia’s GPUs, so clearly we’re headed in this direction.
This makes business sense: At the scale AWS operates, the fewer one-off parts the cloud titan has to wrangle the better. For Amazon, it’s better to have one modular rack architecture than one for each chip in the datacenter.
This is one of the reasons that hyperscalers like Amazon and Meta founded standards bodies like the Open Compute Platform (OCP) in the first place. As it happens, back in October, Nvidia contributed its MGX reference designs to OCP, while AMD and Meta announced a new double-width rack based around the House of Zen’s Helios system.
But it’s not just the racks that look the same now – the compute and network fabrics do as well. Speaking at Re:Invent on Thursday, Peter Desantis showed off the Trainium3 compute blade, which pairs a Graviton CPU with four Trainium3 accelerators and a pair of Nitro data processing units. Up until now, AWS’ Trainium systems had used x86 CPUs from Intel.
Top Left: AWS Trainium3 UltraServer sled, Bottom Left: AMD Helios sled, Right: Nvidia GB200 NVL72 Sled - Click to enlarge
This configuration bears more than a passing resemblance to the compute blades found in AMD and Nvidia’s rack systems. The former combines four MI400-series GPUs with a single Venice CPU and a pair of smartNICs either from its Pensando networking division or one of its partners. The one difference we’ll note is AMD has opted for a double-wide OpenRack design. Nvidia’s GB300 follows a similar formula but uses two Grace CPUs rather than one.
Switched scale-up fabrics are taking over
Amazon’s Trn3 UltraServers employ 36 of these compute blades which are spread across what appear to be two MGX-style racks. The 144 accelerators on board are stitched together using Amazon’s all new NeuronSwitch interconnect tech. From the renders we’ve seen, each UltraServer employs about 20 of them. Unfortunately AWS isn’t quite ready to discuss the specific topology the fabric uses.
Again we see a similar configuration used in Nvidia and AMD’s rack systems. Nvidia’s GB200 and GB300 NVL72 racks use 18 switches spread across nine switch sleds. From what we’ve gathered, AMD is using 12 102.4 Tbps Ethernet switches spread across six double-wide blades.
These high-speed fabrics are what allows them to pool the compute and memory resources of 72 or 144 chips into what is functionally one giant rack-sized accelerator.
This diagram shows the switched NVLink interconnect topology used by Nvidia’s GB200 NVL72, but the same basic architecture is being used by AMD and AWS with their Helios and Trainium3 UltraServer rack systems. - Click to enlarge
While the systems architecture is broadly the same, the protocols used by each are anything but. AWS is using NeuronSwitch, while AMD is tunneling the UALink protocol over Ethernet. Nvidia, meanwhile, is using its NVLink and NVSwitch tech.
However, it appears NeuronSwitch, at least in its current form, may be short lived, with Amazon announcing this week plans to use both UALink and NVLink Fusion in its next-gen Trainium4 accelerators.
While switched fabrics have become rather popular in recent years, they’re not the only option. In fact, all the way through Amazon’s Trainium2 accelerators, the company had employed a compute mesh using 2D and 3D Torus topologies.
Here’s how each Trainium2 accelerators connect to one another in Amazon’s Trn2 Ultra Server - Click to enlarge
According to Nafea Bshara, co-founder of AWS’s Annapurna Labs division, both topologies have their benefits, but for the heavy workloads we’re now seeing, he argues that switched scale-up fabrics are the way to go.
"We moved from a 3D Torus, which is very good, by the way, for large models, and very good for training, to a switch topology," he told El Reg.
"Inference has two parts, prefill and decode. For prefill, the switch doesn’t make as much of a difference," he said. "In the decode, because you’re doing token by token generation, we want to go as wide as we can so we can leverage all the aggregate memory with very low latency."
Using NeuronSwitch, AWS is able to do this. The benefit, Bshara notes, is most evident when running at larger batch sizes. "If you’re running low batch this, you may not need a switch. What the switch allows us is to keep the low-latency while maximizing the concurrency," he said.
The downside of course is complexity. Fabrics need switches and meshes don’t. Switches do have the potential for fewer overall hops, and therefore lower latency, but we’ve yet to scale much beyond 144 accelerators.
The odd man out
Google’s 7th-gen Ironwood TPU clusters use 2D and 3D toruses which can scale to 9,216 TPUs in a single compute domain.
One of the reasons that Google is able to do this is because it’s using optics, whereas Nvidia, AMD, and presumably AWS have avoided them due to their higher power consumption. From what we gather, some of this higher power consumption is mitigated by the lack of packet switches.
The Chocolate Factory does rather famously use optical circuit switches, but they have more in common with a telephone switch booth than a packet switch. These appliances are essentially an automated patch panel for optics, and allow Google to slice up its TPU pods into smaller clusters based on the workload.
Optical circuit switching (OCS) also addresses one of the bigger headaches of mesh topologies: failures. If a TPU fails, OCS allows Google’s to drop it from the pod and slot a fresh one all with the push of a button.
However, with Amazon’s move to switch compute fabrics, Google is now one of the only major infrastructure providers using torus topologies in their AI inference and training clusters. ®