In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): Article No. 9.
Lin T, Wang Y, Liu X, Qiu X. A survey of transformers. AI Open, 2022, 3: 111–132. DOI: 10.1016/j.aiopen.2022.10.001.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 140.
Chowdhery A, Narang S, Devlin J et al. PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24(1): 240.
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): 109. DOI: 10.1145/3530811.
Li A, Song S L, Chen J, Li J, Liu X, Tallent N R, Barker K J. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel and Distributed Systems, 2020, 31(1): 94–110. DOI: 10.1109/TPDS.2019.2928289.
Lai Z, Li S, Tang X, Ge K, Liu W, Duan Y, Qiao L, Li D. Merak: An efficient distributed DNN training framework with automated 3D parallelism for giant foundation models. IEEE Trans. Parallel and Distributed Systems, 2023, 34(5): 1466–1478. DOI: 10.1109/TPDS.2023.3247001.
Zeng Z, Liu C, Tang Z, Li K, Li K. AccTFM: An effective intra-layer model parallelization strategy for training large-scale transformer-based models. IEEE Trans. Parallel and Distributed Systems, 2022, 33(12): 4326–4338. DOI: 10.1109/TPDS.2022.3187815.
Zhao S, Li F, Chen X, Guan X, Jiang J, Huang D, Qing Y, Wang S, Wang P, Zhang G, Li C, Luo P, Cui H. vPipe: A virtualized acceleration system for achieving efficient and scalable pipeline parallel DNN training. IEEE Trans. Parallel and Distributed Systems, 2022, 33(3): 489–506. DOI: 10.1109/TPDS.2021.3094364.
Miao X, Wang Y, Jiang Y, Shi C, Nie X, Zhang H, Cui B. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. Proc. the VLDB Endowment, 2022, 16(3): 470–479. DOI: 10.14778/3570690.3570697.
Fang J R, Zhu Z L, Li S G, Su H, Yu Y, Zhou J, You Y. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Trans. Parallel and Distributed Systems, 2023, 34(1): 304–315. DOI: 10.1109/TPDS.2022.3219819.
Zong Z, Lin L, Lin L, Wen L, Sun Y. STR: Hybrid tensor re-generation to break memory wall for DNN training. IEEE Trans. Parallel and Distributed Systems, 2023, 34(8): 2403–2418. DOI: 10.1109/TPDS.2023.3266110.
He S, Chen P, Chen S, Li Z, Yang S, Chen W, Shou L. HOME: A holistic GPU memory management framework for deep learning. IEEE Trans. Computers, 2022, 72(3): 826–838. DOI: 10.1109/TC.2022.3180991.
Yang Y, Deng L, Wu S, Yan T, Xie Y, Li G. Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 2020, 125: 70–82. DOI: 10.1016/j.neunet.2019.12.027.
Liu C, Zhang X, Zhang R, Li L, Zhou S, Huang D, Li Z, Du Z, Liu S, Chen T. Rethinking the importance of quantization bias, toward full low-bit training. IEEE Trans. Image Processing, 2022, 31: 7006–7019. DOI: 10.1109/TIP.2022.3216776.