Perplexity shows how to run monster AI models more efficiently on aging GPUs, AWS networks

AI search provider Perplexity’s research wing has developed a new set of software optimizations that allows for trillion parameter or large models to run efficiently across older, cheaper hardware using a variety of existing network technologies, including Amazon’s proprietary Elastic Fabric Adapter.

These innovations, detailed in a paper published this week and released on GitHub for further scrutiny, present a novel approach to addressing one of the biggest challenges in serving large-scale mixture of experts models (MoE) at scale: memory and network latency.

Mo parameters, mo problems

MoE models, like DeepSeek V3 and R1 or Moonshot AI’s Kimi K2, are big, ranging from 671 billion to 1 trillion p…

Mo parameters, mo problems

MoE models, like DeepSeek V3 and R1 or Moonshot AI’s Kimi K2, are big, ranging from 671 billion to 1 trillion parameters. This means they’re too large to run on eight-GPU systems using older H100 or H200 GPUs at scale. Sure, in some cases you might be able to fit the model weights, but you won’t have enough memory left over for the key value caches (the model’s short-term memory) required to serve it at any reasonable scale.

To get around this, you either need bigger systems or shard your models across multiple smaller ones.

The easy answer would be to deploy these models on Nvidia’s GB200 or GB300 NVL72 rack systems, which essentially function as one great big server with 72 192GB or 288GB of GPUs on board, more than enough for even larger multi-trillion parameter LLMs.

Unfortunately, these systems are expensive, in extremely high demand, and may not be available in every geography — cough, cough, China. By comparison, older H100 or H200-based systems are plentiful and comparatively cheap, but require distributing models across multiple nodes, which has traditionally incurred steep performance penalties.

Those penalties are further exacerbated by the shift from dense models, where the entirety of the weights is read from memory for every token generated, to sparse MoE models, where requests are routed to a smaller subset of weights, which we call experts. Each token (think word fragment or punctuation mark) may be generated by a different set of experts.

This architecture has the benefit of reducing the amount of memory bandwidth required to achieve a desired level of interactivity. On the flip side, it is also significantly chattier from a networking standpoint.

For a single node or rack system, high-speed interconnects like NVLink or AMD’s Infinity Fabric can easily accommodate the additional traffic. But for models distributed across multiple nodes, model experts may be running on GPUs in different systems connected by interconnects that are 7 to 14x slower.

To get around this, developers behind DeepSeek V3 developed DeepEP — the EP here standing for expert parallelism — a software framework designed to minimize the performance hit associated with running its models across multiple H800-based systems connected using Nvidia’s ConnectX NICs.

Cutting the EFA overhead

The problem, as you might have surmised, is that not everyone is using Nvidia’s NICs in their compute environments. Amazon Web Services (AWS) is a prime example.

Rather than standard Ethernet or Nvidia’s InfiniBand interconnect tech, AWS has developed its networking protocol, which it calls Elastic Fabric Adapter (EFA).

Like Nvidia’s ConnectX-7 NICs commonly deployed in its Hopper generation of HGX and DGX systems, EFA supports up to 400 Gbps of aggregate bandwidth. But as Perplexity points out in its research, these NICs fall short in several notable areas.

For one, the AI search provider notes that EFA falls short of Nvidia’s NICs on the message sizes exchanged during MoE dispatch and combine. Second, EFA lacks support for GPUDirect Async, a technology that allows the NICs to bypass the host CPU and communicate directly with the GPU. Because of this, EFA incurs a latency penalty in certain workloads as data has to be proxied through the CPU first.

To get around this, Perplexity has developed a new set of kernels — optimized software routines that handle communication between GPUs — which the company claims achieves even lower latency than DeepSeek’s DeepEP on ConnectX-7 NICs, and makes using EFA for distributed inference on MoE models viable.

Compared to DeepSeek’s existing DeepEP libraries, Perplexity’s kernels achieved slightly better performance in certain metrics when running on Nvidia’s ConnectX-7 while also bringing EFA latencies down to acceptable levels. - Click to enlarge

To validate these tests, Perplexity tested the kernels on its in-house inference engine using both DeepSeek V3 and Kimi K2 running on a series of AWS H200 p5en instances using EFA for inter-node communications.

While DeepSeek V3 isn’t a trillion-parameter model, weighing in at just under 700 billion parameters, it’s small enough to fit onto a single H200 instance, and therefore serves as a baseline for evaluating performance gains.

In testing, Perplexity compared the performance of a single eight-GPU system against multi-instance setups with 16 (two instances) or 32 GPUs (four instances). While performance at low and high batch sizes remained fairly consistent, Perplexity observed that higher degrees of expert parallelism in the multi-node arrangements allowed for higher performance at medium batch sizes.

Compared to a single node baseline, Perplexity’s optimized kernels delivered a substantial performance uplift when distributing the model across two and four node configurations. - Click to enlarge

These performance characteristics also extended to larger models like Kimi K2, which is too large to fit on a single instance. Despite the bandwidth limitations compared to Nvidia’s NVLink or AMD’s Infinity Fabric, which can be as much as 14x faster than Ethernet, Perplexity was still able to demonstrate meaningful performance gains at medium batch sizes for multi-node inference.

Perplexity’s optimized EFA kernels also showed a performance uplift at medium batch sizes in the larger 1 trillion parameter Kimi K2 model - Click to enlarge

Work to further optimize Perplexity’s kernels for Amazon’s EFA network tech is ongoing. The AI search provider says it’s following updates to Amazon’s libfabric libraries in order to reduce data plane overheads, and plans to experiment with efa-direct to further curb latency and improve overall performance.

However, the real benefit may be for the folks who can now sweat their existing hardware longer, or take advantage of discounted instance types on the largest cloud provider in the world, without missing out on next-gen frontier models. ®

Mo parameters, mo problems

Mo parameters, mo problems

Cutting the EFA overhead

Similar Posts