Accelerating Distributed Training of Large Concurrent-Branch Models through Bidirectional Pipeline Coordination

Zan Zong^¹, Yuyang Chen^², Qi Zhang^¹, Daming Zhao^¹, Jianjiang Li^³, Yijun Jing^⁴, Jidong Zhai^¹()

¹ Department of Computer Science and Technology, Tsinghua University, Beijing, China

² Shanghai AI Laboratory, Shanghai, China

³ School of Computer & Communication Engineering, University of Science and Technology Beijing, China

⁴ Mathematical School, Sun Yat-sen University, China

Show Author Information

Abstract

Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-toend training throughput by 20% on average.

Keywords

parallel training system pipeline parallelism large model framework

Tsinghua Science and Technology

Cite this article:

Zong Z, Chen Y, Zhang Q, et al. Accelerating Distributed Training of Large Concurrent-Branch Models through Bidirectional Pipeline Coordination. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2024.9010233