Abstract:The rise of distributed AI and large-scale applications has impacted the communication operations of data-center and Supercomputer interconnection networks, leading to dramatic incast or in-network congestion scenarios and challenging existing congestion control mechanisms, such as injection throttling (e.g., DCQCN) or congestion isolation (CI). While DCQCN provides a scalable traffic rate adjustment for congesting flows at end nodes (which is slow) and CI effectively isolates these flows in special network resources (which requires extra logic in the switches), their combined use, although it diminishes their particular drawbacks, leads to false congestion scena…
Abstract:The rise of distributed AI and large-scale applications has impacted the communication operations of data-center and Supercomputer interconnection networks, leading to dramatic incast or in-network congestion scenarios and challenging existing congestion control mechanisms, such as injection throttling (e.g., DCQCN) or congestion isolation (CI). While DCQCN provides a scalable traffic rate adjustment for congesting flows at end nodes (which is slow) and CI effectively isolates these flows in special network resources (which requires extra logic in the switches), their combined use, although it diminishes their particular drawbacks, leads to false congestion scenarios identification and signaling, excessive throttling, and inefficient network resource utilization. In this paper, we propose a new CI mechanism, called Improved Congestion Isolation (ICI), which efficiently combines CI and DCQCN so that the information of the isolated congesting flows is used to guide the ECN marking performed by DCQCN in a way that victim flows do not end up being marked. This coordination reduces false-positive congestion detection, suppresses unnecessary closed-loop feedback (i.e., wrong congestion notifications), and improves responsiveness to communication microbursts. Evaluated under diverse traffic patterns, including incast and Data-center workloads, ICI reduces the number of generated BECNs by up to 32x and improves tail latency by up to 31%, while maintaining high throughput and scalability.
| Comments: | 26 pages, 6 figures |
| Subjects: | Networking and Internet Architecture (cs.NI) |
| Cite as: | arXiv:2511.04639 [cs.NI] |
| (or arXiv:2511.04639v1 [cs.NI] for this version) | |
| https://doi.org/10.48550/arXiv.2511.04639 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Alberto Merino [view email] [v1] Thu, 6 Nov 2025 18:33:27 UTC (421 KB)