Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Jian-Bin Fang; Xiang-Ke Liao; Chun Huang; De-Zun Dong

doi:10.1007/s11390-020-0741-6

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Jian-Bin Fang, Xiang-Ke Liao, Chun Huang, De-Zun Dong()

College of Computer, National University of Defense Technology, Changsha 410073, China

Show Author Information

Abstract

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications. We provide insights into the memory-relevant performance behaviours of the Phytium 2000+ system through micro-benchmarking. With the help of the well-known rooine model, we analyze the Phytium 2000+ system, taking both memory accesses and computations into account. Based on the knowledge gained from these micro-benchmarks, we evaluate two applications and use them to assess the capabilities of the Phytium 2000+ system. The results show that the ARMv8-based many-core system is capable of delivering high performance for a wide range of scientific kernels.

Keywords

many-core architecture memory-centric design performance evaluation

Electronic Supplementary Material

Download File(s)

jcst-36-1-33-Highlights.pdf (477.9 KB)

References

[1]

Laurenzano M A, Tiwari A, Cauble-Chantrenne A et al. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp. 36-45. DOI: 10.1109/ISPASS.2016.7482072.

Crossref

[2]

Stephens N. ARMv8-a next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, August 2016. DOI: 10.1109/HOTCHIPS.2016.7936203.

Crossref

[3]

Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI: 10.1109/HOTCHIPS.2015.7477454.

Crossref

[4]

You X, Yang H, Luan Z, Liu Y, Qian D. Performance evaluation and analysis of linear algebra kernels in the prototype Tianhe-3 cluster. In Proc. the 5th Asian Conference on Supercomputing Frontiers, March 2019, pp. 86-105. DOI: 10.1007/978-3-030-18645-6_6.

Crossref

[5]

Dongarra J. Report on the Fujitsu Fugaku system. Technical Report, University of Tennessee, 2020. https://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf, Nov. 2020.

[6]

Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, September 2009, pp. 261-270. DOI: 10.1109/PACT.2009.22.

Crossref

[7]

McCalpin J. Memory bandwidth and machine balance in current high performance computers. https://www.cs.virginia.edu/stream/analyses.html, Dec. 2020.

[8]

Kamil S, Husbands P, Oliker L, Shalf J, Yelick K A. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proc. the 2005 Workshop on Memory System Performance, June 2005, pp. 36-43. DOI: 10.1145/1111583.1111589.

Crossref

[9]

Williams S, Waterman A, Patterson D A. Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 2009, 52(4): 65-76. DOI: 10.1145/1498765.1498785

Crossref Google Scholar

[10]

Ilic A, Pratas F, Sousa L. Cache-aware rooine model: Upgrading the loft. IEEE Comput. Archit. Lett. , 2014, 13(1): 21-24. DOI: 10.1109/L-CA.2013.6.

Crossref Google Scholar

[11]

Liu X, Buono D, Checconi F, Choi J W, Que X, Petrini F, Gunnels J A, Stuecheli J. An early performance study of large-scale POWER8 SMP systems. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp. 263-272. DOI: 10.1109/IPDPS.2016.14.

Crossref

[12]

Goto K, van de Geijn R A. Anatomy of high performance matrix multiplication. ACM Trans. Math. Softw. , 2008, 34(3): Article No. 12. DOI: 10.1145/1356052.1356053.

Crossref Google Scholar

[13]

Frison G, Kouzoupis D, Sartor T, Zanelli A, Diehl M. BLASFEO: Basic linear algebra subroutines for embedded optimization. ACM Trans. Math. Softw. , 2018, 44(4): Article No. 42. DOI: 10.1145/3210754.

Crossref Google Scholar

[14]

Su X, Liao X, Jiang H, Yang C, Xue J. SCP: Shared cache partitioning for high-performance GEMM. ACM Transactions on Architecture and Code Optimization, 2019, 15(4): Article No. 43. DOI: 10.1145/3274654.

Crossref Google Scholar

[15]

Hollowell C, Caramarcu C, Strecker-Kellogg W, Wong A, Zaytsev A. The effect of NUMA tunings on CPU performance. Journal of Physics: Conference Series, 2015, 664(9): Article No. 092010. DOI: 10.1088/1742-6596/664/9/092010.

Crossref Google Scholar

[16]

Liu W, Vinter B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proc. the 29th ACM on International Conference on Supercomputing, June 2015, pp. 339-350. DOI: 10.1145/2751205.2751209.

Crossref

[17]

Grimes R, Kincaid D, Young D. ITPACK 2.0 user's guide. Technical Report, Center for Numerical Analysis, University of Texas, 1979.

[18]

Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. , 2014, 36(5): 401-423. DOI: 10.1137/130930352.

Crossref Google Scholar

[19]

Bell N, Garland M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. the ACM/IEEE Conference on High Performance Computing, November 2009. DOI: 10.1145/1654059.1654078.

Crossref

[20]

Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing scalability of sparse matrix-vector multiplications on Phytium FT-2000+. Int. J. Parallel Program. , 2020, 48(1): 80-97. DOI: 10.1007/s10766-019-00646-x.

Crossref Google Scholar

[21]

Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing sparse matrix-vector multiplications on an ARMv8-based many-core architecture. Int. J. Parallel Program. , 2019, 47(3): 418-432. DOI: 10.1007/s10766-018-00625-8.

Crossref Google Scholar

[22]

Chen S, Fang J, Chen D, Xu C, Wang Z. Adaptive optimization of sparse matrix-vector multiplication on emerging many-core architectures. In Proc. the 20th IEEE International Conference on High Performance Computing, June 2018, pp. 649-658. DOI: 10.1109/HPCC/SmartCity/DSS.2018.00116.

Crossref

[23]

Babka V, Tuma P. Investigating cache parameters of x86 family processors. In Proc. the 2009 SPEC Benchmark Workshop, January 2009, pp. 77-96. DOI: 10.1007/978-3-540-93799-9_5.

Crossref

[24]

Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the 5th ACM/SPEC International Conference on Performance Engineering, March 2014, pp. 137-148. DOI: 10.1145/2568088.2576799.

Crossref

[25]

Ramos S, Hoeer T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, June 2013, pp. 97-108. DOI: 10.1145/2462902.2462916.

Crossref

Journal of Computer Science and Technology

Volume 36 Issue 1,
January 2021

Pages 33-43

DOI: 10.1007/s11390-020-0741-6

Cite this article:

Fang J-B, Liao X-K, Huang C, et al. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. Journal of Computer Science and Technology, 2021, 36(1): 33-43. https://doi.org/10.1007/s11390-020-0741-6