Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry

Ming-Tao Ji; Yi-Bo Jin; Zhu-Zhong Qian; Tuo Cao; Bao-Liu Ye

doi:10.1007/s11390-024-3342-y

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry

Ming-Tao Ji, Yi-Bo Jin, Zhu-Zhong Qian(), Tuo Cao, Bao-Liu Ye

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

Show Author Information

Abstract

Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our proposed AGG algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.

Keywords

in-network aggregation distributed learning acceleration in-band network telemetry optimization datacenter network

Electronic Supplementary Material

Download File(s)

JCST-2304-13342-Highlights.pdf (664.6 KB)

References

[1]

Xu M, Du H, Niyato D, Kang J, Xiong Z, Mao S, Han Z, Jamalipour A, Kim D I, Shen X, Leung V C M, Poor H V. Unleashing the power of edge-cloud generative AI in mobile networks: A survey of AIGC services. IEEE Communications Surveys & Tutorials, 2024, 26(2): 1127–1170. DOI: 10.1109/COMST.2024.3353265.

Crossref Google Scholar

[2]

Mishra R, Gupta H P, Banga G, Das S K. Fed-RAC: Resource-aware clustering for tackling heterogeneity of participants in federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(7): 1207–1220. DOI: 10.1109/TPDS.2024.3379933.

Crossref Google Scholar

[3]

Wu D, Ullah R, Rodgers P, Kilpatrick P, Spence I, Varghese B. EcoFed: Efficient communication for DNN partitioning-based federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(3): 377–390. DOI: 10.1109/TPDS.2024.3349617.

Crossref Google Scholar

[4]

Feng A, Dong D, Lei F, Ma J, Yu E, Wang R. In-network aggregation for data center networks: A survey. Computer Communications, 2023, 198: 63–76. DOI: 10.1016/j.comcom.2022.11.004.

Crossref Google Scholar

[5]

Liu J, Zhai Y, Zhao G, Xu H, Fang J, Zeng Z, Zhu Y. InArt: In-Network aggregation with route selection for accelerating distributed training. In Proc. the 2024 ACM on Web Conference, May 2024, pp.2879–2889. DOI: 10.1145/3589334.3645394.