Communications in Data Centers: Optimization of data transmission for distributed AI training

Efficient transmission of large volumes of data between geographically distributed computing nodes is a fundamental challenge in high-performance computing and data processing. A key system limitation for such tasks is the suboptimal utilization of network bandwidth, which leads to significant time losses, idle expensive resources, and reduced overall efficiency of distributed systems, including computing clusters.
To address this issue, researchers at the Wireless Networks Laboratory have developed a new method for controlling the data rate of end devices as part of the project with an industrial partner. The proposed method enables optimal allocation of available bandwidth. Its implementation significantly increases data rates and reduces latency, particularly under heterogeneous traffic conditions typical for real-world sc The proposed method enables optimal allocation of available bandwidth. Its implementation significantly increases data rates and reduces latency, particularly under heterogeneous traffic conditions typical for real-world scenarios.
The practical significance of this work is most evident in scenarios where distributed systems require continuous data synchronization. Applying the developed method minimizes system downtime and dramatically improves the overall efficiency of coordinated computational processes, which is critically important for tasks such as distributed training of large language models.
