| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Survey

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Lei Guan^¹, Dong-Sheng Li^²(), Ji-Ye Liang^³, Wen-Jian Wang^³, Ke-Shi Ge^², Xi-Cheng Lu^²

1College of Science, National University of Defense Technology, Changsha 410073, China

2College of Computer, National University of Defense Technology, Changsha 410073, China

3School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China

Show Author Information

Abstract

Deep learning has become the cornerstone of artificial intelligence, playing an increasingly important role in human production and lifestyle. However, as the complexity of problem-solving increases, deep learning models become increasingly intricate, resulting in a proliferation of large language models with an astonishing number of parameters. Pipeline model parallelism (PMP) has emerged as one of the mainstream approaches to addressing the significant challenge of training “big models”. This paper presents a comprehensive review of PMP. It covers the basic concepts and main challenges of PMP. It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches, and discusses the main techniques to achieve load balance for both intra-node and inter-node training. Furthermore, the main techniques to optimize computation, storage, and communication are presented, with potential research directions being discussed.

Keywords

deep learning pipeline schedule load balance multi-GPU system pipeline model parallelism (PMP)

Electronic Supplementary Material

Download File(s)

JCST-2310-13872-Highlights.pdf (156.5 KB)

References

[1]

He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/CVPR.2016.90.

[2]

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.1725–1732. DOI: 10.1109/CVPR.2014.223.

[3]

Hinton G, Deng L, Yu D, Dahl G E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97. DOI: 10.1109/MSP.2012.2205597.

Crossref Google Scholar

[4]

Li J Y. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal and Information Processing, 2022, 11(1): e8. DOI: 10.1561/116.00000050.

Crossref Google Scholar

[5]

Wu Y, Schuster M, Chen Z F et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv: 1609.08144, 2016. https://arxiv.org/abs/1609.08144, May 2024.

[6]

Dabre R, Chu C H, Kunchukuttan A. A survey of multilingual neural machine translation. ACM Computing Surveys, 2021, 53(5): Article No. 99. DOI: 10.1145/3406095.

Crossref Google Scholar

[7]

Chen C Y, Seff A, Kornhauser A, Xiao J X. DeepDriving: Learning affordance for direct perception in autonomous driving. In Proc. the 2015 IEEE International Conference on Computer Vision, Dec. 2015, pp.2722–2730. DOI: 10.1109/ICCV.2015.312.

[8]

Bojarski M, Del Testa D, Dworakowski D et al. End to end learning for self-driving cars. arXiv: 1604.07316, 2016. https://arxiv.org/abs/1604.07316, May 2024.

[9]

Real E, Aggarwal A, Huang Y P, Le Q V. Regularized evolution for image classifier architecture search. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.4780–4789. DOI: 10.1609/aaai.v33i01.33014780.

[10]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: 10.18653/V1/N19-1423.

[11]

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9.

[12]

Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. arXiv: 2005.14165, 2020. https://arxiv.org/abs/2005.14165, May 2024.

[13]

Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.

[14]

Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 158.

[15]

Zeng W, Ren X Z, Su T et al. PanGu-α: Largescale autoregressive pretrained Chinese language models with auto-parallel computation. arXiv: 2104.12369, 2021. https://arxiv.org/abs/2104.12369, May 2024.

[16]

Wang S H, Sun Y, Xiang Y et al. ERNIE 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv: 2112.12731, 2021. https://arxiv.org/abs/2112.12731, May 2024.

[17]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.

[18]

Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In Proc. the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp.248–255. DOI: 10.1109/CVPR.2009.5206848.

[19]

Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S. YouTube-8M: A large-scale video classification benchmark. arXiv: 1609.08675, 2016. https://arxiv.org/abs/1609.08675, May 2024.

[20]

Narayanan D, Shoeybi M, Casper J et al. Efficient large-scale language model training on GPU clusters using megatron-LM. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, Article No. 58. DOI: 10.1145/3458817.3476209.

[21]

Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y Q, He K M. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv: 1706.02677, 2017. https://arxiv.org/abs/1706.02677, May 2024.

[22]

You Y, Gitman I, Ginsburg B. Scaling SGD batch size to 32k for ImageNet training. arXiv: 1708.03888, 2017. https://arxiv.org/abs/1708.03888v1?2, May 2024.

[23]

Assran M, Loizou N, Ballas N, Rabbat M. Stochastic gradient push for distributed deep learning. In Proc. the 36th International Conference on Machine Learning, Jun. 2019, pp.344–353.

[24]

Dean J, Corrado G S, Monga R et al. Large scale distributed deep networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1223–1231.

[25]

Shazeer N, Cheng Y L, Parmar N et al. Mesh-TensorFlow: Deep learning for supercomputers. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.10435–10444.

[26]

Jia Z H, Zaharia M, Aiken A. Beyond data and model parallelism for deep neural networks. In Proc. the 2019 SysML Conference, Mar. 31–Apr. 2, Apr. 2019, pp.1–13.

[27]

Gan W S, Lin J C W, Fournier-Viger P, Chao H C, Yu P S. A survey of parallel sequential pattern mining. ACM Trans. Knowledge Discovery from Data, 2019, 13(3): 25. DOI: 10.1145/3314107.

Crossref Google Scholar

[28]

Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur N R, Ganger G R, Gibbons P B, Zaharia M. PipeDream: Generalized pipeline parallelism for DNN training. In Proc. the 27th ACM Symposium on Operating Systems Principles, Oct. 2019, pp.1–15. DOI: 10.1145/3341301.3359646.

[29]

Huang Y P, Cheng Y L, Bapna A, Firat O, Chen M X, Chen D H, Lee H, Ngiam J, Le Q V, Wu Y H, Chen Z F. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 10.

[30]

Pouyanfar S, Sadiq S, Yan Y L, Tian H M, Tao Y D, Reyes M P, Shyu M L, Chen S C, Iyengar S S. A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys, 2019, 51(5): 92. DOI: 10.1145/3234150.

Crossref Google Scholar

[31]

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 2020, 52(4): 65. DOI: 10.1145/3320060.

Crossref Google Scholar

[32]

Tang Z H, Shi S H, Wang W, Li B, Chu X W. Communication-efficient distributed deep learning: A comprehensive survey. arXiv: 2003.06307, 2020. https://arxiv.org/abs/2003.06307, May 2024.

[33]

Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys, 2021, 53(1): Article No. 3. DOI: 10.1145/3363554.

Crossref Google Scholar

[34]

Liang P, Tang Y, Zhang X D, Bai Y H, Su T, Lai Z Q, Qiao L B, Li D S. A survey on auto-parallelism of large-scale deep learning training. IEEE Trans. Parallel and Distributed Systems, 2023, 34(8): 2377–2390. DOI: 10.1109/TPDS.2023.3281931.

Crossref Google Scholar

[35]

Shen L, Sun Y, Yu Z Y, Ding L, Tian X M, Tao D C. On efficient training of large-scale deep learning models: A literature review. arXiv: 2304.03589, 2023. https://arxiv.org/abs/2304.03589, May 2024.

[36]

Kumar S. Introduction to Parallel Programming. Cambridge University Press, 2022.

[37]

Abadi M, Barham P, Chen J N et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Conference on Operating Systems Design and Implementation, Nov. 2016, pp.265–283.

[38]

Paszke A, Gross S, Massa F et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. the 33rd Conference on Neural Information Processing Systems, Dec. 2019, Article No. 721.

[39]

Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. https://arxiv.org/abs/1802.05799, May 2024.

[40]

Li M, G. Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In Proc. the 11th USENIX Conference on Operating Systems Design and Implementation, Oct. 2014, pp.583–598.

[41]

Cui H G, Zhang H, Ganger G R, Gibbons P B, Xing E P. GeePs: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proc. the 11th European Conference on Computer Systems, Apr. 2016, Article No. 4. DOI: 10.1145/2901318.2901323.

[42]

Patarasuk P, Yuan X. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 2009, 69(2): 117–124. DOI: 10.1016/j.jpdc.2008.09.002.

Crossref Google Scholar

[43]

Alistarh D, Grubic D, Li J Z, Tomioka R, Vojnovic M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.1707–1718.

[44]

Jia Z H, Lin S N, Qi C R, Aiken A. Exploring hidden dimensions in parallelizing convolutional neural networks. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp.2279–2288.

[45]

Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. DOI: 10.1109/SC41405.2020.00024.

[46]

Gusak J, Cherniuk D, Shilova A et al. Survey on efficient training of large neural networks. In Proc. the 31st International Joint Conference on Artificial Intelligence, Jul. 2022, pp.5494–5501. DOI: 10.24963/ijcai.2022/769.

[47]

Li S G, Hoefler T. Chimera: Efficiently training large-scale neural networks with bidirectional pipelines. In Proc. the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2021, Article No. 27. DOI: 10.1145/3458817.3476145.

[48]

Fan S Q, Rong Y, Meng C et al. DAPPLE: A pipelined data parallel approach for training large models. In Proc. the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2021, pp.431–445. DOI: 10.1145/3437801.3441593.

[49]

Mirhoseini A, Pham H, Le Q V, Steiner B, Larsen R, Zhou Y F, Kumar N, Norouzi M, Bengio S, Dean J. Device placement optimization with reinforcement learning. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.2430–2439.

[50]

Krizhevsky A. One weird trick for parallelizing convolutional neural networks. arXiv: 1404.5997, 2014. https://arxiv.org/abs/1404.5997, May 2024.

[51]

Li S G, Liu H X, Bian Z D, Fang J R, Huang H C, Liu Y L, Wang B X, You Y. Colossal-AI: A unified deep learning system for large-scale parallel training. In Proc. the 52nd International Conference on Parallel Processing, Aug. 2023, pp.766–775. DOI: 10.1145/3605573.3605613.

[52]

Lai Z Q, Li S W, Tang X D, Ge K S, Liu W J, Duan Y B, Qiao L B, Li D S. Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans. Parallel and Distributed Systems, 2023, 34(5): 1466–1478. DOI: 10.1109/TPDS.2023.3247001.

Crossref Google Scholar

[53]

Ramashekar T, Bondhugula U. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Architecture and Code Optimization, 2013, 10(4): 60. DOI: 10.1145/2544100.

Crossref Google Scholar

[54]

Jain A, Awan A A, Aljuhani A M, Hashmi J M, Anthony Q G, Subramoni H, Panda D K, Machiraju R, Parwani A. GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training. In Proc. the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020, pp.1–15. DOI: 10.1109/SC41405.2020.00049.

[55]

Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv: 1909.08053, 2019. https://arxiv.org/abs/1909.08053, May 2024.

[56]

Qi P, Wan X, Huang G, Lin M. Zero bubble pipeline parallelism. In Proc. the 11th International Conference on Learning Representations, Jul. 2023.

[57]

Gaunt A L, Johnson M A, Riechert M, Tarlow D, Tomioka R, Vytiniotis D, Webster S. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arXiv: 1705.09786, 2017. https://arxiv.org/abs/1705.09786, May 2024.

[58]

Narayanan D, Phanishayee A, Shi K Y, Chen X, Zaharia M. Memory-efficient pipeline-parallel DNN training. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.7937–7947.

[59]

Chen C C, Yang C L, Cheng H Y. Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. arXiv: 1809.02839, 2018. https://arxiv.org/abs/1809.02839, May 2024.

[60]

Guan L, Yin W T, Li D S, Lu X C. XPipe: Efficient pipeline model parallelism for multi-GPU DNN training. arXiv: 1911.04610, 2019. https://arxiv.org/abs/1911.04610, May 2024.

[61]

Chen Z H, Xu C, Qian W N, Zhou A Y. Elastic averaging for efficient pipelined DNN training. In Proc. the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Feb. 2023, pp.380–391. DOI: 10.1145/3572848.3577484.

[62]

Yang P C, Zhang X M, Zhang W P, Yang M, Wei H. Group-based interleaved pipeline parallelism for large-scale DNN training. In Proc. the 10th International Conference on Learning Representations, Apr. 2022.

[63]

Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999, 12(1): 145–151. DOI: 10.1016/S0893-6080(98)00116-6.

Crossref Google Scholar

[64]

Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In Proc. the 30th International Conference on Machine Learning, Jun. 2013, pp.III-1139–III-1147.

[65]

Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[66]

Yang B W, Zhang J, Li J, Ré C, Aberger C R, De Sa C. PipeMare: Asynchronous pipeline parallel DNN training. arXiv: 1910.05124, 2019. https://arxiv.org/abs/1910.05124, May 2024.

[67]

Zhang S X, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.685–693.

[68]

Guan L, Qiao L B, Li D S, Sun T, Ge K S, Lu X C. An efficient ADMM-based algorithm to Nonconvex penalized support vector machines. In Proc. the 2018 IEEE International Conference on Data Mining Workshops, Nov. 2018, pp.1209–1216. DOI: 10.1109/ICDMW.2018.00173.

[69]

Zeng Z H, Liu C B, Tang Z, Chang W L, Li K L. Training acceleration for deep neural networks: A hybrid parallelization strategy. In Proc. the 58th ACM/IEEE Design Automation Conference, Dec. 2021, pp.1165–1170. DOI: 10.1109/DAC18074.2021.9586300.

[70]

Zheng L M, Li Z H, Zhang H, Zhuang Y H, Chen Z F, Huang Y P, Wang Y D, Xu Y Z, Zhuo D Y, Xing E P, Gonzalez J E, Stoica I. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.559–578.

[71]

Liu W J, Lai Z Q, Li S W, Duan Y B, Ge K S, Li D S. AutoPipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In Proc. the 2022 IEEE International Conference on Cluster Computing, Sept. 2022, pp.301–312. DOI: 10.1109/CLUSTER51413.2022.00042.

[72]

Unger C, Jia Z H, Wu W et al. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.267–284.

[73]

Zhao S X, Li F X, Chen X S, Guan X X, Jiang J Y, Huang D, Qing Y, Wang S, Wang P, Zhang G, Li C, Luo P, Cui H M. VPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel and Distributed Systems, 2022, 33(3): 489–506. DOI: 10.1109/TPDS.2021.3094364.

Crossref Google Scholar

[74]

Osawa K, Li S, Hoefler T. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In Proc. the 6th Conference on Machine Learning and Systems, May 2023.

[75]

Tarnawski J, Narayanan D, Phanishayee A. Piper: Multidimensional planner for DNN parallelization. In Proc. the 35th International Conference on Neural Information Processing Systems, Dec. 2021, Article No. 1902.

[76]

Jiang W, Wang B, Ma S, Hou X, Huang L B, Dai Y, Fang J B. PipeFB: An optimized pipeline parallelism scheme to reduce the peak memory usage. In Proc. the 22nd International Conference on Algorithms and Architectures for Parallel Processing, Oct. 2022, pp.590–604. DOI: 10.1007/978-3-031-22677-9_31.

[77]

Chen T Q, Xu B, Zhang C Y, Guestrin C. Training deep nets with sublinear memory cost. arXiv: 1604.06174, 2016. https://arxiv.org/abs/1604.06174, May 2024.

[78]

Kim T, Kim H, Yu G I, Chun B G. BPIPE: Memory-balanced pipeline parallelism for training large language models. In Proc. the 40th International Conference on Machine Learning, Jul. 2023, Article No. 682.

[79]

Wang L N, Ye J M, Zhao Y Y, Wu W, Li A, Song S L, Xu Z L, Kraska T. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proc. the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2018, pp.41–53. DOI: 10.1145/3178487.3178491.

[80]

Jiang W, Xu R, Ma S, Wang Q, Hou X, Lu H Y. A memory saving mechanism based on data transferring for pipeline parallelism. In Proc. the 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Sept. 30–Oct. 3, 2021, pp.1230–1235. DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00169.

[81]

Zhou Q, Wang H Q, Yu X Y, Li C, Bai Y H, Yan F, Xu Y L. MPress: Democratizing billion-scale model training on multi-gpu servers via memory-saving inter-operator parallelism. In Proc. the 29th IEEE International Symposium on High-Performance Computer Architecture, Feb. 25–Mar. 1, 2023, pp.556–569. DOI: 10.1109/HPCA56546.2023.10071077.

[82]

Le Scao T, Fan A, Akiki C et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv: 2211.05100, 2022. https://arxiv.org/abs/2211.05100, May 2024.

[83]

Guan L. Weight prediction boosts the convergence of AdamW. In Proc. the 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining, May 2023, pp.329–340. DOI: 10.1007/978-3-031-33374-3_26.

[84]

Guan L, Li D S, Shi Y Q, Meng J. XGrad: Boosting gradient-based optimizers with weight prediction. IEEE Trans. Pattern Analysis and Machine Intelligence. DOI: 10.1109/TPAMI.2024.3387399.

[85]

Shi N C, Li D W, Hong M Y, Sun R Y. RMSprop converges with proper hyper-parameter. In Proc. the ICLR 2021, May 2021.

[86]

Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv: 1711.05101, 2017. https://arxiv.org/abs/1711.05101, May 2024.

[87]

Zhuang J T, Tang T, Ding Y F et al. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, pp.18795–18806.

[88]

Wang Y Z, Kang Y, Qin C, Wang H, Xu Y, Zhang Y L, Fu Y. Momentum is all you need for data-driven adaptive optimization. In Proc. the 23rd IEEE International Conference on Data Mining, Dec. 2023, pp.1385–1390. DOI: 10.1109/ICDM58522.2023.00179.

[89]

Liao X K, Pang Z B, Wang K F, Lu Y T, Xie M, Xia J, Dong D Z, Suo G. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology, 2015, 30(2): 259–272. DOI: 10.1007/s11390-015-1520-7.

Crossref Google Scholar

[90]

Yang X J, Liao X K, Lu K, Hu Q F, Song J Q, Su J S. The TianHe-1A supercomputer: Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3): 344–351. DOI: 10.1007/s02011-011-1137-8.

Crossref Google Scholar

[91]

Zhan J, Zhang J H. Pipe-torch: Pipeline-based distributed deep learning in a GPU cluster with heterogeneous networking. In Proc. the 7th International Conference on Advanced Cloud and Big Data, Sept. 2019, pp.55–60. DOI: 10.1109/CBD.2019.00020.

[92]

Park J H, Yun G, Yi C M, Nguyen N T, Lee S, Choi J, Noh S H, Choi Y R. HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In Proc. the 2020 USENIX Conference on Usenix Annual Technical Conference, Jul. 2020, Article No. 21.

Journal of Computer Science and Technology

Volume 39 Issue 3,
May 2024

Pages 567-584

DOI: 10.1007/s11390-024-3872-3

Cite this article:

Guan L, Li D-S, Liang J-Y, et al. Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview. Journal of Computer Science and Technology, 2024, 39(3): 567-584. https://doi.org/10.1007/s11390-024-3872-3

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号