Scholar - SciOpen

Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-toend training throughput by 20% on average.

Open Access Just Accepted

Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

Zan Zong, Minkun Guo, Mingshu Zhai, Yinan Tang, Jianjiang Li, Jidong Zhai

Big Data Mining and Analytics

Available online: 27 March 2025

Abstract

PDF (5.2 MB) Collect Collected

Downloads：137

As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.

Perspective Issue

Unified Programming Models for Heterogeneous High-Performance Computers

Zi-Xuan Ma, Yu-Yang Jin, Shi-Zhi Tang, Hao-Jie Wang, Wei-Cheng Xue, Ji-Dong Zhai, Wei-Min Zheng

Journal of Computer Science and Technology 2023, 38(1): 211-218

Published: 28 February 2023

Abstract Collect Collected

Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.

Issue

Efficient memory allocator for the New Generation Sunway supercomputer

Haojie WANG, Zixuan MA, Liyan ZHENG, Yuanwei WANG, Fei WANG, Jidong ZHAI

Journal of Tsinghua University (Science and Technology) 2022, 62(5): 943-951

Published: 15 May 2022

Abstract

PDF (5.9 MB) Collect Collected

Downloads：7

Supercomputers provide enormous computing power for large applications. Traditional supercomputers have mainly targeted scientific computing problems. However, other applications have new requirements for the both supercomputer software and hardware designs. The New Generation Sunway supercomputer has an inefficient memory allocator when running in the dynamic mode. This study develops an efficient memory allocator, SWAlloc, that reduces the memory allocation time of the brain scale pretrained model training framework, BaGuaLu, by up to 75 839 times. Evaluations using PARSEC also show that SWAlloc can speed up the memory allocation by up to 51 times (36% on average). SWAlloc has been deployed on the New Generation Sunway supercomputer for use by various large applications, including SWPytorch and SWTensorFlow.

Open Access Issue

AIPerf: Automated Machine Learning as an AI-HPC Benchmark

Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, Wenguang Chen

Big Data Mining and Analytics 2021, 4(3): 208-220

Published: 12 May 2021

Abstract

PDF (10.3 MB) Collect Collected

Downloads：94

The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the de facto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which is measured in an analytical and systematic approach, as a major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmark’s stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scale on and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.

Total 5

<1/11>GOpage