The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations. Among various methods, the linearly scaling three-dimensional fragment (LS3DF) method exhibits excellent scalability in large-scale simulations. Based on algorithmic and system-level optimizations, we propose a highly scalable and highly efficient implementation of LS3DF on the Sugon supercomputer, a domestic supercomputer equipped with deep computing units. In terms of algorithmic optimizations, the original all-band conjugate gradient algorithm is refined to achieve faster convergence, and mixed precision computing is adopted to increase overall efficiency. In terms of system-level optimizations, the original two-layer parallel structure is replaced by a coarse-grained parallel method. Optimization strategies such as multi-stream, kernel fusion, and redundant computation removal are proposed to increase further utilization of the computational power provided by the heterogeneous machines. As a result, our optimized LS3DF can scale to a 10-million silicon atoms system, attaining a peak performance of 34.8 PFLOPS (21.2% of the peak). All the improvements can be adapted to the next-generation supercomputers for larger simulations.
- Article type
- Year
- Co-author
Agile hardware development methodology has been widely adopted over the past decade. Despite the research progress, the industry still doubts its applicability, especially for the functional verification of complicated processor chips. Functional verification commonly employs a simulation-based method of co-simulating the design under test with a reference model and checking the consistency of their outcomes given the same input stimuli. We observe limited collaboration and information exchange through the design and verification processes, dramatically leading to inefficiencies when applying the conventional functional verification workflow to agile development. In this paper, we propose workflow integration with collaborative task delegation and dynamic information exchange as the design principles to effectively address the challenges on functional verification under the agile development model. Based on workflow integration, we enhance the functional verification workflows with a series of novel methodologies and toolchains. The diff-rule based agile verification methodology (DRAV) reduces the overhead of building reference models with runtime execution information from designs under test. We present the RISC-V implementation for DRAV, DiffTest, which adopts information probes to extract internal design behaviors for co-simulation and debugging. It further integrates two plugins, namely XFUZZ for effective test generation guided by design coverage metrics and LightSSS for efficient fault analysis triggered by co-simulation mismatches. We present the integrated workflows for agile hardware development and demonstrate their effectiveness in designing and verifying RISC-V processors with 33 functional bugs found in NutShell. We also illustrate the efficiency of the proposed toolchains with a case study on a functional bug in the L2 cache of XiangShan.
Genomic sequence alignment is the most critical and time-consuming step in genomic analysis. Alignment algorithms generally follow a seed-and-extend model. Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), and graphics processing unit (GPU) (e.g., the Smith-Waterman algorithm). Compared with the extension phase, the seeding phase is more critical and essential. However, the seeding phase is bounded by memory, i.e., fine-grained random memory access and limited parallelism on conventional system. In this paper, we argue that the processing-in-memory (PIM) concept could be a viable solution to address these problems. This paper describes “PIM-Align”—application-driven near-data processing architecture for sequence alignment. In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory (DRAM) technology, we propose a lightweight message mechanism between different memory partitions, and a specialized hardware prefetcher for memory access patterns of sequence alignment. Our evaluation shows that the proposed architecture can achieve 20x and 1820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU, respectively.