AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

College of Computer Science, National University of Defense Technology, Changsha 410073, China
School of Computing, University of Leeds, Leeds, LS2 9JT, U.K.
Show Author Information

Abstract

Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a "black-box" manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional x86 architectures, there is little work for understanding the cache implementations on emerging ARMv8-based many-cores. This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication. Our evaluation provides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores. The quantitative performance data is shown in tables. We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors, Phytium 2000+, ThunderX2, and KP920. Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.

Electronic Supplementary Material

Download File(s)
JCST-2012-11251-Highlights.pdf (704.3 KB)

References

[1]
Laurenzano M A, Tiwari A, Cauble-Chantrenne A, Jundt A, Ward W A, Campbell R, Carrington L. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36–45. DOI: 10.1109/ISPASS.2016.7482072.
[2]
Stephens N. ARMv8-A next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, Aug. 2016. DOI: 10.1109/HOTCHIPS.2016.7936 203.
[3]
Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI: 10.1109/HOTCHIPS.2015.7477454.
[4]
Arima E, Kodama Y, Odajima T, Tsuji M, Sato M. Power/Performance/Area evaluations for next-generation HPC processors using the A64FX chip. In Proc. the 2021 IEEE Symposium in Low-Power and High-Speed Chips, Apr. 2021. DOI: 10.1109/COOLCHIPS52128.2021.9410320.
[5]
Odajima T, Kodama Y, Tsuji M, Matsuda M, Maruyama Y, Sato M. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications. In Proc. the 2020 IEEE International Conference on Cluster Computing, Sept. 2020, pp.523–530. DOI: 10.1109/CLUSTER49012.2020.00075.
[6]
Pedretti K T, Younge A J, Hammond S D, Laros III J H, Curry M L, Aguilar M J, Hoekstra R J, Brightwell R. Chronicles of Astra: Challenges and lessons from the first Petascale Arm supercomputer. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. DOI: 10.1109/ SC41405.2020.00052.
[7]

Mantovani F, Garcia-Gasulla M, Gracia J, Stafford E, Banchelli F, Josep-Fabrego M, Criado-Ledesma J, Nachtmann M. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future Gener. Comput. Syst. , 2020, 112: 800–818. DOI: 10.1016/j.future.2020.06.033.

[8]

Hill M D, Marty M R. Amdahl’s law in the multicore era. IEEE Computer , 2008, 41(7): 33–38. DOI: 10.1109/MC.2008.209.

[9]

McCalpin J D. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter , 1995, 2: 19–25.

[10]
McVoy L M, Staelin C. lmbench: Portable tools for performance analysis. In Proc. the USENIX Annual Technical Conference, Jan. 1996, pp.279–294.
[11]
Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2009, pp.261–270. DOI: 10.1109/PACT.2009.22.
[12]
Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2013, pp.97–108. DOI: 10.1145/2493123.2462916.
[13]
Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the ACM/SPEC International Conference on Performance Engineering, Mar. 2014, pp.137–148. DOI: 10.1145/2568088.2576799.
[14]

Fang J, Liao X, Huang C, Dong D. Performance evaluation of memory-centric ARMv8 many-core architectures: A case study with Phytium 2000+. Journal of Computer Science and Technology , 2021, 36(1): 33–43. DOI: 10.1007/s11390-020-0741-6.

[15]

Xia J, Cheng C, Zhou X, Hu Y, Chun P. Kunpeng 920: The first 7-nm chiplet-based 64-Core ARM SoC for cloud services. IEEE Micro , 2021, 41(5): 67–75. DOI: 10.1109/MM.2021.3085578.

[16]
Hackenberg D, Molka D, Nagel W E. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.413–422. DOI: 10.1145/1669112.1669165.
[17]
Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 2016, 3(3): Article 18. DOI: 10.1145/3015144.
[18]
Babka V, Tůma P. Investigating cache parameters of x86 family processors. In Proc. the SPEC Benchmark Workshop, Jan. 2009, pp.77–96. DOI: 10.1007/978-3-540-93799-9_5.
[19]
Wong H, Papadopoulou M, Sadooghi-Alvandi M. Demystifying GPU microarchitecture through microbenchmarking. In Proc. the 2010 IEEE International Symposium on Performance Analysis of Systems Software, Mar. 2010, pp.235–246. DOI: 10.1109/ISPASS.2010.5452013.
[20]

Mei X, Chu X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems , 2017, 28(1): 72–86. DOI: 10.1109/TPDS.2016.2549523.

[21]

Lin J, Xu Z, Cai L, Nukada A, Matsuoka S. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Computing , 2018, 77: 128–143. DOI: 10.1016/j.parco.2018.06.001.

[22]

McIntosh-Smith S, Price J, Deakin T, Poenaru A. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience , 2019, 31(16): e5110. DOI: 10.1002/cpe.5110.

Journal of Computer Science and Technology
Pages 1323-1338
Cite this article:
Gao W-R, Fang J-B, Huang C, et al. wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. Journal of Computer Science and Technology, 2023, 38(6): 1323-1338. https://doi.org/10.1007/s11390-021-1251-x

172

Views

1

Crossref

1

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 31 December 2020
Accepted: 14 November 2021
Published: 15 November 2023
© Institute of Computing Technology, Chinese Academy of Sciences 2023
Return