Enabling Trillion-Parameter Models on AWS EFA
research.perplexity.ai·2h·
Discuss: Hacker News
Flag this post

At Perplexity, we use the best models for our product, our APIs, and our research teams. Large open-source Mixture-of-Experts models, such as Kimi-K2 pose particular challenges, as the largest inference nodes with 8x NVIDIA H200 GPUs cannot efficiently accommodate them, necessitating multi-node deployments. We present a set of kernels for expert parallelism which achieve state-of-the-art latencies on ConnectX-7, exceeding the performance of DeepEP. The same kernels are also the first to achieve viable latencies on AWS Elastic Fabric Adapter (EFA), enabling trillion-parameter model deployments.

Try our kernels on GitHub and read the full research paper on arXiv.

Introduction

Mixture-of-Experts (…

Similar Posts

Loading similar posts...