Arista Modular Switches Aim At Scale Across Networks, Hit Scale Out, Too

Chip designs are busting out beyond the reticle limits of lithography machines, making chiplets and high-bandwidth, in-package die-to-die interconnects inevitable. And AI training workloads are busting through datacenter walls, making scale out networks to lash together datacenters or even multiple regions into a single, logical datacenter just as inevitable.

Given all of the attention that leaf and spine and now super spine network architectures have been getting for the past decade, you might be thinking it is a bit retro for any switch or router supplier to be talking about big, beefy modular switches like the “Magnum” InfiniBand beast, with 3,456 ports, that Andy Bechtolsheim invented for the Constellation supercomputer that was eventually acquired by Sun Microsystems and shipped in June 2007. The idea of such behemoth modular switches was to cut down on the use of network cables, which are expensive, and make what is effectively a one-hop leaf/spine network inside of a box that presents a huge radix from which to build even larger networks.

AI clusters commonly having 50,000 to 100,000 accelerators and pushing on up to 1 million or more in the coming months and years, but leaf/spine setups have trouble getting to this scale. A two-tier leaf/spine network with 256 leafs and 40 spines can scale to 4,608 ports at 800 Gb/sec. A three-tier leaf/spine/super-spine network runs out of ports at 27,648 GPUs, and that takes 270 super-spines, 540 spines, and 1,728 leafs. (This is for non-blocking network topologies.) If you want to lash together more GPUs than that with the currently shipping ASICs from Broadcom, Nvidia, or Cisco Systems, then you need to make modular switches from them or keep adding more aggregation tiers, which radically increases the number of hops between accelerators and therefore latency.

This is unacceptable, of course, which is why when Bechtolsheim founded Kealia and created the Magnum switch he had a modular design with thousands of ports. This was in an era when linking together a few thousand CPU nodes was a supercomputer. To be fair, Sun was very aggressive about supercomputing in the mid-2000s, and added ClearSpeed’s revolutionary CX600 floating point accelerators to the CPU nodes to radically boost their FP64 and FP32 performance. (Notably used in the original “Tsubame” supercomputer at Tokyo Institute of Technology.) So the analogy with today’s GPU-accelerated machines actually holds. The scale has, as we said, gone absolutely mad, however.

What we need is Infinity Fabric that spans datacenters together, gluing them together into massive compute complexes that can hold hundreds of thousands to millions of GPUs or XPUs in a single cluster. (When we say “we,” we mean “them,” as in the model builders, notably OpenAI, Anthropic, and Google.) But to make that cross-datacenter Infinity Fabric, as well as to build larger compute domains within a datacenter, in some cases we need modular switches today for the same reason that Bechtolsheim, co-founder and chief architect at Arista Networks, knew we needed them more than two decades ago: To get rid of as many expensive cables as possible and to increase the radix of the switch and routing domains inside the datacenter and across them.

The thing is, you don’t have to wait for the “Jericho 4” ASIC, which was announced in August by Broadcom, and its related “Qumran 4” fabric chips and “Ramon” aggregation chips to do it. You can do it with the Jericho 3+ ASICs, as Brendan Gibbs, vice president of AI, routing, and switching platforms at Arista Networks, explains to The Next Platform. All of these Jericho-class chips are known as the StrataDNX family and include deep packet buffers to deal with congestion.

To be specific, the new Arista 7800R4 modular switches are based on the Jericho 3+ ASIC, and importantly for scale across AI networks, include the same HyperPort capability that was revealed for the Jericho 4 chips this summer for delivery next year. With HyperPorts, four 800 Gb/sec ports are lashed together to look like a single 3.2 Tb/sec port, offering 44 percent lower AI job completion time compared to doing ECMP adaptive routing over four independent 800 Gb/sec ports. With HyperPorts, to get a certain level of bandwidth, you have one-fourth the number of managed ports, which can drive the utilization of the links up by 70 percent, according to Broadcom.

This is how you use the very high radix – port count over a certain amount of effective aggregate bandwidth at the front panel – of a modular switch to build a scale across network. The deep HBM buffers help you cope with momentary congestion, but in the long run, radix might be more important.

“With those two value attributes – the buffering and the port capacity in radix – it is up to customer choice to get the shallow buffer, what I call hierarchical switch architectures – or the deep HBM buffer on package,” explains Gibbs. “You are only going to need that HBM if you overrun the buffer. So it’s like an insurance policy. But if you tune your network to the point that you are guaranteeing packets and you are sure that you are never going to exceed that shallow buffer, then you don’t invoke the deep buffer. You can choose a modular chassis because you like deep buffering today, but you still have the other big advantage of port radix and that value doesn’t go away. Some customers want to deploy these modular chasses for the port radix and also want to engineer their workload so they stay within that ultra-low latency, shallow buffer on the chip. Even if customers don’t use the deep buffering of these modular boxes, the port capacity gives them lowest cost infrastructure, the most efficiency.”

The idea is that it is cheaper to do a modular chassis full of switch chips interlinked on system boards than try to do three or four tiers of leaf–spine–superspine–hyperspine–ultraspine with electrical and optical cables interlinking them all. PCB traces do not fail anywhere close to the rate of optical transceivers or electrical or fiber optic cables.

Breaking The Walls With Jericho

The Jericho 3 switch ASIC is etched in 5 nanometer processes from Taiwan Semiconductor Manufacturing Co, and we do not know much about it or the Jericho 3+ chip that Gibbs says it is using in the R4 generation of switches from Arista Networks. The Jericho 3 chip has 28.8 Tb/sec of aggregate bandwidth and is for leaf switches or line cards in modular switches. It has 304 SerDes running at 100 Gb/sec with PAM-4 encoding, with 144 SerDes being used for downlinks and 160 SerDes being used to uplink to 51.2 Tb/sec Ramon 3 spine fabric switches. The Ramon 3 is also used in a super spine layer and that is how Arista Networks can get 32,768 GPUs interlinked in a three-tier network.

The Jericho 3-AI, which we detailed back in April 2023 and which Arista Networks has been shipping in R4 generation switches for more than a year now, has optimizations to support packet spraying and other advanced features to tune for AI workloads. We are not sure what makes a Jericho 3+ and can find no spec sheet on it.

Here is what the R4 family of switches from Arista Networks looks like, and remember that all of the Jericho 3-class chips have SRAM buffers on chip and deep buffers in HBM stacked DRAM on the ASIC package:

The family starts with the fixed-port 7020R leaf switches with 10 Gb/sec and 25 Gb/sec downlinks and 100 Gb/sec uplinks:

The 7020R4 switches are aimed at top of rack jobs in the datacenter where deep buffers are useful and where high bandwidth (relatively speaking) is not a priority, but perhaps energy savings compared to the 7020R3 line is important. The 7020R4 burns 50 percent less juice per port than the 7020R3.

The 7280R4 presents 32 ports running at 800 Gb/sec and is based on the Jericho 3+ chip. Here are its specs:

The energy savings with the are less impressive with the 7280R4 leaf and spine switches, with only a 20 percent to 25 percent reduction in juice consumption compared to the 7280R3 switches, as you can see from the specs above.

For larger, disaggregated leaf/spine networks (which means you build the functionality of a low-end to midrange modular switch out of fixed port switches and unite and manage it all with software, creating a kind of virtual modular switch), Arista Networks has launched the 7700R4 line already, which scales to 18 ports running at 800 Gb/sec and can do 27,648 in a leaf/spine network. The 7700R4 uses the Jericho 3-AI ASIC.

That leaves the big bad switch, the 7800R4 modular boxes, which come in four different chassis configurations with capacity of four, eight, twelve, or sixteen line cards linked in a non-blocking fashion. Tale a look:

These switches are used to build large two-tier Clos networks, which is the preferred topology for the hyperscalers and the cloud builders for datacenter-scale networks and also are aimed at datacenter interconnect workloads where huge amounts of bandwidth in aggregated pipes is important.

The 7800R4 with the full-on sixteen line cards can drive 1,152 ports running at 400 Gb/sec or 576 ports running at 800 Gb/sec, and it offers 65 percent lower power consumption than the 7800R3 based on the Jericho 2 ASICs from Broadcom. The 7800R4’s line card using the Jericho3+ ASIC, has 36 ports running at 800 Gb/sec and offers OSFP and QSFP-DD connectors.

It is funny to think of a modular beast like the 7800R4 being used as a leaf switch, but this is the only way to get into 100,000 XPUs or GPUs and beyond.

The 7800R4 line card is available now, and the one that supports HyperPort quad-bandwidth aggregation will be available in the first quarter of next year, well ahead of the Jericho 4 with its own HyperPorts, which we do not expect to see in products until around this time next year. The 7280R4 with 100 Gb/sec or 800 Gb/sec ports are both available now. The 7020R4 variants with either 10 Gb/sec or 25 Gb/sec ports will both be available in the first quarter.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between. Subscribe now

Breaking The Walls With Jericho

Sign up to our Newsletter

Similar Posts