Explorations of RDMA in LLM Systems

Last week, our team summarized some recent progress we made on point-to-point communication for LLM systems and posted a paper on arXiv. We also open-sourced the code on GitHub.

We built an RDMA communication library based on the idea of Unordered Reliable Datagram (URD) semantics. It runs on both AWS EFA and NVIDIA ConnectX. We applied this library to three scenarios: KvCache transfer in disaggregated inference, model-parameter updates in RL post-training, and MoE communication. The MoE kernel actually runs slightly faster than DeepEP on ConnectX-7 during decode, and on EFA we achieved the first actually-usable performance as well.

In this post, I want to share the backstory — the motivation…

Last week, our team summarized some recent progress we made on point-to-point communication for LLM systems and posted a paper on arXiv. We also open-sourced the code on GitHub.

In this post, I want to share the backstory — the motivation, the design decisions, and some fun debugging moments along the way. If you want the full technical details, check the paper, source code, and linked blog posts at the end.

The Pain of Collective Communication

Collective communication has been battle-tested in structured parallelism (data parallelism, tensor parallelism, etc.), but in some newer scenarios it feels awkward to use — direct RDMA P2P is often simpler and faster. Why awkward?

First, collectives require a fixed “world” of participants. Nodes can’t be added or removed. This is a production nightmare. In disaggregated inference, Prefillers and Decoders need to exchange KvCache. But real production traffic fluctuates — replica count must scale up and down. Machines also fail. If the communication group is static, you can neither scale elastically nor replace bad nodes.

Second, initializing the collective world is blocking and requires every participant to join. So every time you scale up or down, the whole world must pause. Sure, you can hack together async world-construction logic, but it’s painful and error-prone.

Third, collectives guarantee a global ordering semantics. This simplifies application logic, but networks inherently deliver messages out of order. So the library may need extra buffering or synchronization to preserve ordering. The funny part is: some applications don’t even want this guarantee. For example, KvCache transfer — we only care that all pages eventually arrive; the order doesn’t matter. Same for RL post-training weight updates: as long as all model parameters arrive, the order is irrelevant.

Fourth, collectives require all participants to share the same tensor shape and dtype. This can hurt code ergonomics and sometimes kills performance. E.g., using collectives for RPC forces you to always send the maximum possible message size. MoE is even worse — each rank sends a variable number of tokens to every other rank. With collectives, you must size everything for the worst case (all tokens routed to the same rank), which can be 64× larger than the average — completely wasteful in something like DP64-EP64.

The Pain of RDMA

So why do people still cling to collectives? I think there are two main reasons:

Lack of a simple, easy-to-use RDMA library.
RDMA libraries are basically all tied to NVIDIA’s ConnectX NICs.

The joke “NVIDIA GPUs has fast network speed” started as a meme from the 2010s, but after Mellanox got acquired, it became literal truth. In our measurements, ConnectX really is fast. And you can just buy it off the shelf. Fast, available, and with massive ecosystem support — no wonder everyone builds around ConnectX.

Cloud vendors, on the other hand, have good reasons to build their own NICs: virtualization, security isolation, CPU offload, etc. As ML and HPC workloads grew, they slowly patched in more features, aiming to stay compatible with existing ecosystems. For example, AWS has a dedicated NCCL plugin that lets NCCL run on EFA.

But being able to run NCCL doesn’t mean you can run other communication libraries. Great projects like Mooncake or DeepEP can’t run on EFA at all. In March, we tried building an MoE communication kernel using NVSHMEM, but performance was terrible. Even with IBGDA on ConnectX-7, it was far slower than DeepEP using raw mlx5 driver. On EFA it was 40× slower — completely unusable.

Why is RDMA code so hard to port across NICs? A few reasons:

1. Programs rely on protocol-level ordering. Most RDMA code uses the RC (Reliable Connection) protocol, which has in-order delivery. EFA, however, uses SRD (Scalable Reliable Datagram) — reliable but unordered.

If your code assumes ordering (e.g., marking the final packet to signal completion), it simply breaks under SRD. Libraries like NVSHMEM must add buffering, synchronization, or message reordering to emulate ordering — which tanks performance.

2. NIC bandwidth differs. ConnectX-7 can do 400 Gbps on a single card. EFA on p5 is 100 Gbps, and on p5en is 200 Gbps. You need multiple EFAs to match a single ConnectX-7.

3. NIC features differ. IBGDA lets GPUs directly initiate NIC operations — but only ConnectX supports it. Without it, you need CPU mediation to initiate “GPU-side” RDMA, and depending on the software design, this can add tons of overhead.

4. High-performance code often uses non-portable interfaces. DeepEP, for example, technically uses NVSHMEM, but only as a launcher. All real polling and NIC interaction is via low-level mlx5dv, which is ConnectX-only.

How We Approached the Problem

Given all this, our strategy was to take the intersection of features supported across NIC architectures and build a general RDMA library on top:

Reliable delivery: rely on NIC/hardware for retransmission, avoid doing it in software.
Unordered delivery: embrace the hardware reality and drop ordering semantics.
Support two-sided SEND/RECV for RPC.
Support one-sided WRITE_IMM for maximum performance plus 32-bit immediate notification.
ImmCounter synchronization model: every WRITE carries a user-specified immediate; the receiver counts immediates and considers the transfer complete when the counter reaches the expected value.
Provide high-level optimized interfaces for common patterns (multi-page WRITE, multi-receiver WRITE, etc.).
Support multi-NIC aggregation: split by byte ranges, by page lists, by receiver lists, etc.
Host-proxy GPU RDMA: allow CUDA Graphs to issue RDMA by bouncing through shared memory between CPU and GPU.

Note that this library intentionally does not preserve compatibility with existing RDMA programs. We changed the underlying contract; applications must adapt to our more relaxed model to get good performance.

This whole thing started for KvCache transfers only, but later we realized it worked super well for RL weight updates — and unexpectedly, it turned out to be excellent for MoE communication too.

MoE Kernel Story

This takes us back to May at MLSys. During a chat with friends from ByteDance, we learned an important fact: CPU↔GPU PCIe latency is only ~2 μs. I ran the GDRCopy benchmark — yep, that’s real.

So after the conference I thought:

“We could use the RDMA library to implement an NVSHMEM-like mechanism ourselves. MoE takes a few hundred microseconds anyway — paying 15 μs for a host-proxy path is fine.”

Then I realized something else: if EFA is unordered, simply porting DeepEP or our previous NVSHMEM implementation may leave performance on the table. To get the best results, you need to design the kernel for unordered delivery from day one.

My teammate wasn’t satisfied with the previous NVSHMEM MoE kernel either, so he took on the task. He absolutely crushed it — a brand-new MoE kernel in a short time.

When the new kernel ran on EFA, we were excited — but we had no idea whether the performance was bad because of our implementation or because EFA itself was slow. So the next goal: run it on ConnectX-7 and compare against DeepEP.

We estimated that if we could get within 15 μs of DeepEP (the PCIe + CPU overhead), we could call it done.

To do that, we first needed ConnectX-7 machines. After asking around everywhere, NVIDIA generously offered a 64-GPU dev cluster. Huge thanks to them.

With hardware in hand, I started adding ConnectX-7 support to our RDMA library. I initially assumed libfabric could handle ConnectX, but nothing worked. Folks from libfabric and NVIDIA both said ConnectX wasn’t a development focus — even if it did run, performance wouldn’t be good. So I wrote a verbs backend myself. Fortunately our abstraction layers were clean, so only the lowest hardware-dependent layer had to change.

While implementing verbs, I basically compared SRD vs RC side by side. Even though EFA is slower, I actually like SRD’s programming model better:

SRD is datagram-based: you can send messages directly as long as you know the address. RC requires connection setup. Setting up RC requires exchanging QPN/PSN, which itself requires another communication channel. I ended up using UD to bootstrap RC.
I considered whether to maintain connections in the network library or in the application. Ultimately, I decided to let the network library handle the complexity. After all, I already found maintaining connections to be a pain when I’m working on the network library, not to mention application developers.
libfabric receives WRITE_IMM immediates without consuming RECVs; verbs consumes RECVs, and in FIFO order. This means WRITE_IMM and two-sided RECV cannot share the same QP. I solved this by using two RC QPs — one for one-sided ops, one for two-sided ops.

After getting point-to-point tests working, I nervously ran the MoE kernel on ConnectX-7 — and shockingly, it worked on the first try. A great validation of the library’s portability.

Then my teammate and I started optimizing:

algorithmic parallelism
RDMA/NVLink synchronization
PTX tuning
load/store optimizations
reducing CPU allocations
adding multi-receiver optimized APIs
ConnectX-specific and EFA-specific paths
etc.

In the end, to our surprise, decode on ConnectX-7 became faster than DeepEP. To ensure fairness, I ran DeepEP using our benchmarking code — the results matched DeepEP’s official benchmark program’s results.

Even better, many optimizations carried over to EFA as well, so performance improved significantly there too. Huge thanks to AWS and especially the EFA team — their responses were incredibly fast and super helpful.

After decode, we also tested prefill. Our implementation uses the same code path for decode and prefill, and we haven’t optimized prefill specifically yet. We can saturate RDMA bandwidth, but DeepEP is still much faster.

The weird part: DeepEP’s measured latency is lower than the theoretical lower bound based on bytes transferred and NIC bandwidth. After double-checking everything, the only explanation was: DeepEP sends fewer bytes.

We later confirmed this with folks from DeepEP. During dispatch: if one token is routed to multiple ranks on the same machine, DeepEP sends it once over RDMA, then NVLink forwards internally. During combine: if multiple local ranks send the same token to the same remote rank, DeepEP does a partial reduction via NVLink before sending a single token over RDMA.

It’s an incredibly clever design. This also explains why DeepSeek-V3/R1 limits expert routing to at most four nodes.

Final Thoughts

After finishing this round of optimizations, we wrote up our findings and open-sourced the library. Here are the links:

arXiv 2510.27656: RDMA Point-to-Point Communication for LLM Systems
GitHub: pplx-garden
Company blog posts:
RDMA Point-to-Point Communication for LLM Systems
Enabling Trillion-Parameter Models on AWS EFA
Weight Transfer for RL Post-Training in under 2 seconds
Disaggregated Prefill and Decode
My personal blog:
Journey to 2-second Inter-node RL Weight Transfer and A Quick Follow-up
Harnessing 3200 Gbps Network

I’ve learned so much from talking to peers in the MLSys community and from open-source projects. Writing blogs internally and externally, and pushing for open sourcing our code, feels like my small way of giving back to the community.

In less than a year, I learned an enormous amount about RDMA. Last December, I didn’t even know how to get RDMA working on AWS. I spent Christmas break figuring out how to use EFA, and now we have a fairly complete solution. Our tiny 4-person Inference Team (grew to 7 recently) shipped a ton of work over the past 18 months. When I joined in April last year we were still relying entirely on TensorRT-LLM. Then we built our own inference engine, routing layer, kernels, networking stack… At first I was just engineering known optimizations properly; then we started doing engineering innovations; now we’re publishing things that feel like actual research contributions. The journey has far exceeded what I expected.

If you’re in the U.S. and interested in systems optimization at our company, feel free to reach out. If you’re interested in RL post-training, AI product, or other roles, I can help refer you.

The Pain of Collective Communication

The Pain of RDMA

How We Approached the Problem

MoE Kernel Story

Final Thoughts

Similar Posts