[11]
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9.
[14]
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 158.
[17]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
[23]
Assran M, Loizou N, Ballas N, Rabbat M. Stochastic gradient push for distributed deep learning. In Proc. the 36th International Conference on Machine Learning, Jun. 2019, pp.344–353.
[24]
Dean J, Corrado G S, Monga R et al. Large scale distributed deep networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1223–1231.
[25]
Shazeer N, Cheng Y L, Parmar N et al. Mesh-TensorFlow: Deep learning for supercomputers. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.10435–10444.
[26]
Jia Z H, Zaharia M, Aiken A. Beyond data and model parallelism for deep neural networks. In Proc. the 2019 SysML Conference, Mar. 31–Apr. 2, Apr. 2019, pp.1–13.
[29]
Huang Y P, Cheng Y L, Bapna A, Firat O, Chen M X, Chen D H, Lee H, Ngiam J, Le Q V, Wu Y H, Chen Z F. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 10.
[36]
Kumar S. Introduction to Parallel Programming. Cambridge University Press, 2022.
[37]
Abadi M, Barham P, Chen J N et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Conference on Operating Systems Design and Implementation, Nov. 2016, pp.265–283.
[38]
Paszke A, Gross S, Massa F et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. the 33rd Conference on Neural Information Processing Systems, Dec. 2019, Article No. 721.
[40]
Li M, G. Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In Proc. the 11th USENIX Conference on Operating Systems Design and Implementation, Oct. 2014, pp.583–598.
[43]
Alistarh D, Grubic D, Li J Z, Tomioka R, Vojnovic M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.1707–1718.
[44]
Jia Z H, Lin S N, Qi C R, Aiken A. Exploring hidden dimensions in parallelizing convolutional neural networks. In Proc. the 35th International Conference on Machine Learning, Jul. 2018, pp.2279–2288.
[49]
Mirhoseini A, Pham H, Le Q V, Steiner B, Larsen R, Zhou Y F, Kumar N, Norouzi M, Bengio S, Dean J. Device placement optimization with reinforcement learning. In Proc. the 34th International Conference on Machine Learning, Aug. 2017, pp.2430–2439.
[56]
Qi P, Wan X, Huang G, Lin M. Zero bubble pipeline parallelism. In Proc. the 11th International Conference on Learning Representations, Jul. 2023.
[58]
Narayanan D, Phanishayee A, Shi K Y, Chen X, Zaharia M. Memory-efficient pipeline-parallel DNN training. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.7937–7947.
[62]
Yang P C, Zhang X M, Zhang W P, Yang M, Wei H. Group-based interleaved pipeline parallelism for large-scale DNN training. In Proc. the 10th International Conference on Learning Representations, Apr. 2022.
[64]
Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In Proc. the 30th International Conference on Machine Learning, Jun. 2013, pp.III-1139–III-1147.
[65]
Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
[67]
Zhang S X, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.685–693.
[70]
Zheng L M, Li Z H, Zhang H, Zhuang Y H, Chen Z F, Huang Y P, Wang Y D, Xu Y Z, Zhuo D Y, Xing E P, Gonzalez J E, Stoica I. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.559–578.
[72]
Unger C, Jia Z H, Wu W et al. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, Jul. 2022, pp.267–284.
[74]
Osawa K, Li S, Hoefler T. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In Proc. the 6th Conference on Machine Learning and Systems, May 2023.
[75]
Tarnawski J, Narayanan D, Phanishayee A. Piper: Multidimensional planner for DNN parallelization. In Proc. the 35th International Conference on Neural Information Processing Systems, Dec. 2021, Article No. 1902.
[78]
Kim T, Kim H, Yu G I, Chun B G. BPIPE: Memory-balanced pipeline parallelism for training large language models. In Proc. the 40th International Conference on Machine Learning, Jul. 2023, Article No. 682.
[85]
Shi N C, Li D W, Hong M Y, Sun R Y. RMSprop converges with proper hyper-parameter. In Proc. the ICLR 2021, May 2021.
[87]
Zhuang J T, Tang T, Ding Y F et al. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Proc. the 34th International Conference on Neural Information Processing Systems, Dec. 2020, pp.18795–18806.
[92]
Park J H, Yun G, Yi C M, Nguyen N T, Lee S, Choi J, Noh S H, Choi Y R. HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In Proc. the 2020 USENIX Conference on Usenix Annual Technical Conference, Jul. 2020, Article No. 21.