RDMACell: Token-Based Flowcell-Level RDMA Load Balancing for Large-Scale AI Training (opens in new tab)
Remote Direct Memory Access (RDMA) is a core technol ogy for high-performance data center networks. However, its default Equal-Cost Multi-Path (ECMP) load-balancing mechanism often suffers from severe performance degra dation due to hash collisions and elephant flows. Existing solutions have obvious limitations: flowlet-based approaches lack the necessary time gaps to trigger flowlet switching, while packet-level approaches suffer from severe pa...
Read the original article