Communication-Semantic-Aware RDMA Loss Recovery for QP-scalable Hyperscale AI Training (opens in new tab)
Current artificial intelligence (AI) infrastructures widely adopt Remote Direct Memory Access (RDMA) to support high-performance communication. Training trillion-parameter models involves frequent collective communication operations, such as All-Reduce and All-to-All, which generate intensive RDMA traffic. Existing RDMA deployments predominantly use the reliable connection (RC) model, where each process pair requires a dedicated queue pair (QP)....
Read the original article