Article Link
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Electronic Supplementary Material
References
Show full outline
Hide outline
Regular Paper

Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Show Author Information

Abstract

Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our proposed AGG algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.

Electronic Supplementary Material

Download File(s)
JCST-2304-13342-Highlights.pdf (664.6 KB)

References

[1]

Xu M, Du H, Niyato D, Kang J, Xiong Z, Mao S, Han Z, Jamalipour A, Kim D I, Shen X, Leung V C M, Poor H V. Unleashing the power of edge-cloud generative AI in mobile networks: A survey of AIGC services. IEEE Communications Surveys & Tutorials, 2024, 26(2): 1127–1170. DOI: 10.1109/COMST.2024.3353265.

[2]

Mishra R, Gupta H P, Banga G, Das S K. Fed-RAC: Resource-aware clustering for tackling heterogeneity of participants in federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(7): 1207–1220. DOI: 10.1109/TPDS.2024.3379933.

[3]

Wu D, Ullah R, Rodgers P, Kilpatrick P, Spence I, Varghese B. EcoFed: Efficient communication for DNN partitioning-based federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(3): 377–390. DOI: 10.1109/TPDS.2024.3349617.

[4]

Feng A, Dong D, Lei F, Ma J, Yu E, Wang R. In-network aggregation for data center networks: A survey. Computer Communications, 2023, 198: 63–76. DOI: 10.1016/j.comcom.2022.11.004.

[5]
Liu J, Zhai Y, Zhao G, Xu H, Fang J, Zeng Z, Zhu Y. InArt: In-Network aggregation with route selection for accelerating distributed training. In Proc. the 2024 ACM on Web Conference, May 2024, pp.2879–2889. DOI: 10.1145/3589334.3645394.
[6]
Li Y, Liu I J, Yuan Y, Chen D, Schwing A, Huang J. Accelerating distributed reinforcement learning with in-switch computing. In Proc. the 46th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2019, pp.279–291.
[7]
Cui T, Zhang W, Zhang K, Krishnamurthy A. Offloading load balancers onto SmartNICs. In Proc. the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, Aug. 2021, pp.56–62. DOI: 10.1145/3476886.3477505.
[8]
Lao C, Le Y, Mahajan K, Chen Y, Wu W, Akella A, Swift M. ATP: In-network aggregation for multi-tenant learning. In Proc. the 18th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2021, pp.741–761.
[9]
Sapio A, Canini M, Ho C Y, Nelson J, Kalnis P, Kim C, Krishnamurthy A, Mossrc M, Ports D R K, Richtárik P. Scaling distributed machine learning with in-network aggregation. In Proc. the 18th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2021, pp.785–808.
[10]
Gebara N, Costa P, Ghobadi M. PANAMA: In-network aggregation for shared machine learning clusters. In Proc. the 4th Conference on Machine Learning and Systems, Apr. 2021.
[11]
Wang H, Qin Y, Lao C I, Le Y, Wu W, Chen K. Efficient data-plane memory scheduling for in-network aggregation. arXiv: 2201.06398, 2022. https://arxiv.org/abs/2201.06398, Sept. 2024.
[12]
Sapio A, Abdelaziz I, Canini M, Kalnis P. DAIET: A system for data aggregation inside the network. In Proc. the 2017 Symposium on Cloud Computing, Sept. 2017, p.626. DOI: 10.1145/3127479.3132018.
[13]
Segal R, Avin C, Scalosub G. Constrained in-network computing with low congestion in datacenter networks. In Proc. the 2022 IEEE Conference on Computer Communications, May. 2022, pp.1639–1648. DOI: 10.1109/INFOCOM48880.2022.9796980.
[14]

Ji M, Su C, Fan Y, Jin Y, Qian Z, Yan Y, Chen Y, Cao T, Zhang S, Ye B. INTaaS: Provisioning in-band network telemetry as a service via online learning. Computer Networks, 2024, 241: 110211. DOI: 10.1016/j.comnet.2024.110211.

[15]
Ji M, Su C, Yan Y, Qian Z, Chen Y, Jin Y, Zhang S, Ye B. INTView: Adaptive planner for in-band network telemetry without detours. In Proc. the 2023 IEEE International Conference on Communications, May 28–Jun. 1, 2023, pp.5490–5495. DOI: 10.1109/ICC45041.2023.10279624.
[16]
Ji M, Su C, Yan Y, Qian Z, Zhang S, Chen Y, Cao T, Shi X, Vasquez L, Ye B. Adaptive provisioning in-band network telemetry at computing power network [invited]. In Proc. the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS), Jun. 2023. DOI: 10.1109/IWQoS57198.2023.10188738.
[17]

Salkin H M, De Kluyver C A. The knapsack problem: A survey. Naval Research Logistics Quarterly, 1975, 22(1): 127–144. DOI: 10.1002/nav.3800220110.

[18]
Fang J, Zhao G, Xu H, Yu Z, Shen B, Xie L. GOAT: Gradient scheduling with collaborative in-network aggregation for distributed training. In Proc. the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS), Jun. 2023. DOI: 10.1109/IWQoS57198.2023.10188783.
[19]
Yang M, Baban A, Kugel V, Libby J, Mackie S, Kananda S S R, Wu C H, Ghobadi M. Using Trio: Juniper networks’ programmable chipset-for emerging in-network applications. In Proc. the 2022 ACM SIGCOMM Conference, Aug. 2022, pp.633–648. DOI: 10.1145/3544216.3544262.
[20]
Gao W, Sun P, Wen Y, Zhang T. Titan: A scheduler for foundation model fine-tuning workloads. In Proc. the 13th Symposium on Cloud Computing, Nov. 2022, pp.348–354. DOI: 10.1145/3542929.3563460.
[21]
Zhang Q, Zhou R, Wu C, Jiao L, Li Z. Online scheduling of heterogeneous distributed machine learning jobs. In Proc. the 21st International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Oct. 2020, pp.111–120. DOI: 10.1145/3397166.3409128.
[22]

Alizadeh M, Edsall T, Dharmapurikar S, Vaidyanathan R, Chu K, Fingerhut A, Lam V T, Matus F, Pan R, Yadav N, Varghese G. CONGA: Distributed congestion-aware load balancing for datacenters. ACM SIGCOMM Computer Communication Review, 2014, 44(4): 503–514. DOI: 10.1145/2740070.2626316.

[23]
Katta N, Hira M, Kim C, Sivaraman A, Rexford J. HULA: Scalable load balancing using programmable data planes. In Proc. the 2016 Symposium on SDN Research, Mar. 2016, Article No. 10. DOI: 10.1145/2890955.2890968.
[24]
Katta N, Ghag A, Hira M, Keslassy I, Bergman A, Kim C, Rexford J. Clove: Congestion-aware load balancing at the virtual edge. In Proc. the 13th International Conference on Emerging Networking Experiments and Technologies, Nov. 2017, pp.323–335. DOI: 10.1145/3143361.3143401.
[25]

Zheng J, Qin L, Liu K, Tian B, Tian C, Li B, Chen G. Django: Bilateral coflow scheduling with predictive concurrent connections. Journal of Parallel and Distributed Computing, 2021, 152: 45–56. DOI: 10.1016/j.jpdc.2021.01.006.

[26]
Pan T, Song E, Bian Z, Lin X, Peng X, Zhang J, Huang T, Liu B, Liu Y. INT-path: Towards optimal path planning for in-band network-wide telemetry. In Proc. the 2019 IEEE Conference on Computer Communications, Apr. 29–May 2, 2019, pp.487–495. DOI: 10.1109/INFOCOM.2019.8737529.
[27]
Kim C, Sivaraman A, Katta N, Bas A, Dixit A, Wobker L J, Networks B. In-band network telemetry via programmable dataplanes. In Proc. the 2015 ACM SIGCOMM, Aug. 2015.
[28]
Guo C, Yuan L, Xiang D, Dang Y, Huang R, Maltz D, Liu Z, Wang Y, Pang B, Chen H, Lin Z W, Kurien V. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proc. the 2015 ACM Conference on Special Interest Group on Data Communication, Aug. 2015, pp.139–152. DOI: 10.1145/2785956.2787496.
[29]

Meyer R R. A class of nonlinear integer programs solvable by a single linear program. SIAM Journal on Control and Optimization, 1977, 15(6): 935–946. DOI: 10.1137/0315059.

[30]
Pan H, Cui P, Li Z, Jia R, Zhang P, Zhang L, Yang Y, Wu J, Dong J, Cao Z, Li Q, Liu H H, Laurent M, Xie G. Enabling fast and flexible distributed deep learning with programmable switches. arXiv: 2205.05243, 2022, https://arxiv.org/abs/2205.05243, Sept. 2024.
[31]

Parizotto R, Coelho B L, Nunes D C, Haque I, Schaeffer-Filho A. Offloading machine learning to programmable data planes: A systematic survey. ACM Computing Surveys, 2023, 56(1): Article No. 18. DOI: 10.1145/3605153.

[32]
Costa P, Donnelly A, Rowstron A, O’Shea G. Camdoop: Exploiting in-network aggregation for big data applications. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, Apr. 2012, Article No. 3.
[33]
Mai L, Rupprecht L, Alim A, Costa P, Migliavacca M, Pietzuch P, Wolf A L. NetAgg: Using middle-boxes for application-specific on-path aggregation in data centres. In Proc. the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, Dec. 2014, pp.249–262. DOI: 10.1145/2674005.2674996.
[34]
Graham R L, Bureddy D, Lui P, Rosenstock H, Shainer G, Bloch G, Goldenerg D, Dubman M, Kotchubievsky S, Koushnir V, Levi L, Margolin A, Ronen T, Shpiner A, Wertheim O, Zahavi E. Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In Proc. the 1st Workshop on Optimization of Communication in HPC, Nov. 2016.
[35]
Sapio A, Abdelaziz I, Aldilaijan A, Canini M, Kalnis P. In-Network computation is a dumb idea whose time has come. In Proc. the 16th ACM Workshop on Hot Topics in Networks, Nov. 2017, pp.150–156. DOI: 10.1145/3152434.3152461.
[36]
Yang F, Wang Z, Ma X, Yuan G, An X. SwitchAgg: A further step towards in-network computing. In Proc. the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Dec. 2019, pp.36–45. DOI: 10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00017.
[37]

Ye Z, Gao W, Hu Q, Sun P, Wang X, Luo Y, Zhang T, Wen Y. Deep learning workload scheduling in GPU datacenters: A survey. ACM Computing Surveys, 2024, 56(6): Article No. 146. DOI: 10.1145/3638757.

[38]
Rajasekaran S, Ghobadi M, Akella A. CASSINI: Network-aware job scheduling in machine learning clusters. In Proc. the 21st USENIX Symposium on Networked Systems Design and Implementation, Apr. 2024.
[39]
Gu D, Zhao Y, Zhong Y, Xiong Y, Han Z, Cheng P, Yang F, Huang G, Jin X, Liu X. ElasticFlow: An elastic serverless training platform for distributed deep learning. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Jan. 2023, pp.266–280. DOI: 10.1145/3575693.3575721.
[40]
Jayaram Subramanya S, Arfeen D, Lin S, Qiao A, Jia Z, Ganger G R. Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling. In Proc. the 29th Symposium on Operating Systems Principles, Oct. 2023, pp.642–657. DOI: 10.1145/3600006.3613175.
[41]

Zhou Q, Wang K, Li P, Zeng D, Guo S, Ye B, Guo M. Fast coflow scheduling via traffic compression and stage pipelining in datacenter networks. IEEE Trans. Computers, 2019, 68(12): 1755–1771. DOI: 10.1109/TC.2019.2931716.

[42]

Jin Y, Jiao L, Ji M, Qian Z, Zhang S, Chen N, Lu S. Scheduling in-band network telemetry with convergence-preserving federated learning. IEEE/ACM Trans. Networking, 2023, 31(5): 2313–2328. DOI: 10.1109/TNET.2023.3253302.

[43]
Ji M, Qian Z, Ye B. When CPN meets AI: Resource provisioning for inference query upon computing power network. In Proc. the 29th International Conference on Parallel and Distributed Systems (ICPADS), Dec. 2023, pp.2261–2268. DOI: 10.1109/ICPADS60453.2023.00304.
[44]
Ji M, Zhang Z, Zhang Y, Qian Z, Cao T, Su C, Ye B. Incentivizing edge AI with accuracy preserving via online randomized auctions. In Proc. the 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Sept. 2023, pp.384–385. DOI: 10.1109/SECON58729.2023.10287443.
Journal of Computer Science and Technology
Pages 196-214
Cite this article:
Ji M-T, Jin Y-B, Qian Z-Z, et al. Orchestrating In-Network Aggregation for Distributed Machine Learning via In-Band Network Telemetry. Journal of Computer Science and Technology, 2025, 40(1): 196-214. https://doi.org/10.1007/s11390-024-3342-y
Metrics & Citations  
Article History
Copyright
Return