PDF (1.6 MB)
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Show full outline
Hide outline
Open Access | Just Accepted

Accelerating Distributed Training of Large Concurrent-Branch Models through Bidirectional Pipeline Coordination

Zan Zong1Yuyang Chen2Qi Zhang1Daming Zhao1Jianjiang Li3Yijun Jing4Jidong Zhai1()

1 Department of Computer Science and Technology, Tsinghua University, Beijing, China

2 Shanghai AI Laboratory, Shanghai, China

3 School of Computer & Communication Engineering, University of Science and Technology Beijing, China

4 Mathematical School, Sun Yat-sen University, China

Show Author Information

Abstract

Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-toend training throughput by 20% on average.

Tsinghua Science and Technology
Cite this article:
Zong Z, Chen Y, Zhang Q, et al. Accelerating Distributed Training of Large Concurrent-Branch Models through Bidirectional Pipeline Coordination. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2024.9010233
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return