Distributed machine learning systems train models via iterative updates between parallel workers and the parameter server. To expedite the transmissions, in-network aggregation of updates along with the packet forwarding at those programmable switches decreases the network traffic over these bottleneck links. However, existing in-network aggregation schemas do not adequately prepare the most suitable switches for various worker distributions and fail to capture the dynamic network status. Based on the status derived from in-band network telemetry, we aim to select the best switches upon the optimization we formulate with the objective of minimum transmission latency. Although the problem is actually a non-linear integer program, by adopting delicate transformations, a substitute with totally unimodular constraints and separable convex objective is then solved to obtain the integral optimum. We implement our in-network aggregation protocol and reconstruct in-band network telemetry protocol upon real devices, i.e., Barefoot Wedge100BF and Dell servers. Then, we evaluate the performance of our proposed AGG algorithm and the results indicate that the completion of related coflows decreases 40% on average compared with other strategies, improving at least 30% performance, compared with the state-of-the-art.
Xu M, Du H, Niyato D, Kang J, Xiong Z, Mao S, Han Z, Jamalipour A, Kim D I, Shen X, Leung V C M, Poor H V. Unleashing the power of edge-cloud generative AI in mobile networks: A survey of AIGC services. IEEE Communications Surveys & Tutorials, 2024, 26(2): 1127–1170. DOI: 10.1109/COMST.2024.3353265.
Mishra R, Gupta H P, Banga G, Das S K. Fed-RAC: Resource-aware clustering for tackling heterogeneity of participants in federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(7): 1207–1220. DOI: 10.1109/TPDS.2024.3379933.
Wu D, Ullah R, Rodgers P, Kilpatrick P, Spence I, Varghese B. EcoFed: Efficient communication for DNN partitioning-based federated learning. IEEE Trans. Parallel and Distributed Systems, 2024, 35(3): 377–390. DOI: 10.1109/TPDS.2024.3349617.
Feng A, Dong D, Lei F, Ma J, Yu E, Wang R. In-network aggregation for data center networks: A survey. Computer Communications, 2023, 198: 63–76. DOI: 10.1016/j.comcom.2022.11.004.
Ji M, Su C, Fan Y, Jin Y, Qian Z, Yan Y, Chen Y, Cao T, Zhang S, Ye B. INTaaS: Provisioning in-band network telemetry as a service via online learning. Computer Networks, 2024, 241: 110211. DOI: 10.1016/j.comnet.2024.110211.
Salkin H M, De Kluyver C A. The knapsack problem: A survey. Naval Research Logistics Quarterly, 1975, 22(1): 127–144. DOI: 10.1002/nav.3800220110.
Alizadeh M, Edsall T, Dharmapurikar S, Vaidyanathan R, Chu K, Fingerhut A, Lam V T, Matus F, Pan R, Yadav N, Varghese G. CONGA: Distributed congestion-aware load balancing for datacenters. ACM SIGCOMM Computer Communication Review, 2014, 44(4): 503–514. DOI: 10.1145/2740070.2626316.
Zheng J, Qin L, Liu K, Tian B, Tian C, Li B, Chen G. Django: Bilateral coflow scheduling with predictive concurrent connections. Journal of Parallel and Distributed Computing, 2021, 152: 45–56. DOI: 10.1016/j.jpdc.2021.01.006.
Meyer R R. A class of nonlinear integer programs solvable by a single linear program. SIAM Journal on Control and Optimization, 1977, 15(6): 935–946. DOI: 10.1137/0315059.
Parizotto R, Coelho B L, Nunes D C, Haque I, Schaeffer-Filho A. Offloading machine learning to programmable data planes: A systematic survey. ACM Computing Surveys, 2023, 56(1): Article No. 18. DOI: 10.1145/3605153.
Ye Z, Gao W, Hu Q, Sun P, Wang X, Luo Y, Zhang T, Wen Y. Deep learning workload scheduling in GPU datacenters: A survey. ACM Computing Surveys, 2024, 56(6): Article No. 146. DOI: 10.1145/3638757.
Zhou Q, Wang K, Li P, Zeng D, Guo S, Ye B, Guo M. Fast coflow scheduling via traffic compression and stage pipelining in datacenter networks. IEEE Trans. Computers, 2019, 68(12): 1755–1771. DOI: 10.1109/TC.2019.2931716.
Jin Y, Jiao L, Ji M, Qian Z, Zhang S, Chen N, Lu S. Scheduling in-band network telemetry with convergence-preserving federated learning. IEEE/ACM Trans. Networking, 2023, 31(5): 2313–2328. DOI: 10.1109/TNET.2023.3253302.