Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a "black-box" manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional
- Article type
- Year
- Co-author
Transcendental functions are important functions in various high performance computing applications. Because these functions are time-consuming and the vector units on modern processors become wider and more scalable, there is an increasing demand for developing and using vector transcendental functions in such performance-hungry applications. However, the performance of vector transcendental functions as well as their accuracy remain largely unexplored. To address this issue, we perform a comprehensive evaluation of two Single Instruction Multiple Data (SIMD) intrinsics based vector math libraries on two ARMv8 compatible processors. We first design dedicated microbenchmarks that help us understand the performance behavior of vector transcendental functions. Then, we propose a piecewise, quantitative evaluation method with a set of meaningful metrics to quantify their performance and accuracy. By analyzing the experimental results, we find that vector transcendental functions achieve good performance speedups thanks to the vectorization and algorithm optimization. Moreover, vector math libraries can replace scalar math libraries in many cases because of improved performance and satisfactory accuracy. Despite this, the implementations of vector math libraries are still immature, which means further optimization is needed, and our evaluation reveals feasible optimization solutions for future vector math libraries.
This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known rooine model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.