Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Deep learning has become the cornerstone of artificial intelligence, playing an increasingly important role in human production and lifestyle. However, as the complexity of problem-solving increases, deep learning models become increasingly intricate, resulting in a proliferation of large language models with an astonishing number of parameters. Pipeline model parallelism (PMP) has emerged as one of the mainstream approaches to addressing the significant challenge of training “big models”. This paper presents a comprehensive review of PMP. It covers the basic concepts and main challenges of PMP. It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches, and discusses the main techniques to achieve load balance for both intra-node and inter-node training. Furthermore, the main techniques to optimize computation, storage, and communication are presented, with potential research directions being discussed.
Hinton G, Deng L, Yu D, Dahl G E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97. DOI: 10.1109/MSP.2012.2205597.
Li J Y. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal and Information Processing, 2022, 11(1): e8. DOI: 10.1561/116.00000050.
Dabre R, Chu C H, Kunchukuttan A. A survey of multilingual neural machine translation. ACM Computing Surveys, 2021, 53(5): Article No. 99. DOI: 10.1145/3406095.
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
Gan W S, Lin J C W, Fournier-Viger P, Chao H C, Yu P S. A survey of parallel sequential pattern mining. ACM Trans. Knowledge Discovery from Data, 2019, 13(3): 25. DOI: 10.1145/3314107.
Pouyanfar S, Sadiq S, Yan Y L, Tian H M, Tao Y D, Reyes M P, Shyu M L, Chen S C, Iyengar S S. A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys, 2019, 51(5): 92. DOI: 10.1145/3234150.
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 2020, 52(4): 65. DOI: 10.1145/3320060.
Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys, 2021, 53(1): Article No. 3. DOI: 10.1145/3363554.
Liang P, Tang Y, Zhang X D, Bai Y H, Su T, Lai Z Q, Qiao L B, Li D S. A survey on auto-parallelism of large-scale deep learning training. IEEE Trans. Parallel and Distributed Systems, 2023, 34(8): 2377–2390. DOI: 10.1109/TPDS.2023.3281931.
Patarasuk P, Yuan X. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 2009, 69(2): 117–124. DOI: 10.1016/j.jpdc.2008.09.002.
Lai Z Q, Li S W, Tang X D, Ge K S, Liu W J, Duan Y B, Qiao L B, Li D S. Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans. Parallel and Distributed Systems, 2023, 34(5): 1466–1478. DOI: 10.1109/TPDS.2023.3247001.
Ramashekar T, Bondhugula U. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Architecture and Code Optimization, 2013, 10(4): 60. DOI: 10.1145/2544100.
Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999, 12(1): 145–151. DOI: 10.1016/S0893-6080(98)00116-6.
Zhao S X, Li F X, Chen X S, Guan X X, Jiang J Y, Huang D, Qing Y, Wang S, Wang P, Zhang G, Li C, Luo P, Cui H M. VPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel and Distributed Systems, 2022, 33(3): 489–506. DOI: 10.1109/TPDS.2021.3094364.
Liao X K, Pang Z B, Wang K F, Lu Y T, Xie M, Xia J, Dong D Z, Suo G. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, 2015, 30(2): 259–272. DOI: 10.1007/s11390-015-1520-7.
Yang X J, Liao X K, Lu K, Hu Q F, Song J Q, Su J S. The TianHe-1A supercomputer: Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3): 344–351. DOI: 10.1007/s02011-011-1137-8.