<inline-formula id="Z-20231205133214">  <math id="mathml_Z-20231205133214" display="inline" overflow="scroll"><mrow class="MJX-TeXAtom-ORD"><mrow class="MJX-TeXAtom-ORD"><mi mathvariant="monospace">w</mi><mi mathvariant="monospace">r</mi><mi mathvariant="monospace">B</mi><mi mathvariant="monospace">e</mi><mi mathvariant="monospace">n</mi><mi mathvariant="monospace">c</mi><mi mathvariant="monospace">h</mi></mrow></mrow></math></inline-formula>: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Wan-Rong Gao; Jian-Bin Fang; Chun Huang; Chuan-Fu Xu; Zheng Wang

doi:10.1007/s11390-021-1251-x

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

$w r B e n c h$ : Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Wan-Rong Gao^¹, Jian-Bin Fang^¹(), Chun Huang^¹, Chuan-Fu Xu^¹, Zheng Wang^²

1College of Computer Science, National University of Defense Technology, Changsha 410073, China

2School of Computing, University of Leeds, Leeds, LS2 9JT, U.K.

Show Author Information

Abstract

Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a "black-box" manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional $x 86$ architectures, there is little work for understanding the cache implementations on emerging ARMv8-based many-cores. This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop $w r B e n c h$ , a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication. Our evaluation provides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores. The quantitative performance data is shown in tables. We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors, Phytium 2000+, ThunderX2, and KP920. Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.

Keywords

ARMv8 many-core cache architecture microbenchmark core-to-core communication

Electronic Supplementary Material

Download File(s)

JCST-2012-11251-Highlights.pdf (704.3 KB)

References

[1]

Laurenzano M A, Tiwari A, Cauble-Chantrenne A, Jundt A, Ward W A, Campbell R, Carrington L. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36–45. DOI: 10.1109/ISPASS.2016.7482072.

Crossref

[2]

Stephens N. ARMv8-A next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, Aug. 2016. DOI: 10.1109/HOTCHIPS.2016.7936 203.

Crossref

[3]

Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. DOI: 10.1109/HOTCHIPS.2015.7477454.

Crossref

[4]

Arima E, Kodama Y, Odajima T, Tsuji M, Sato M. Power/Performance/Area evaluations for next-generation HPC processors using the A64FX chip. In Proc. the 2021 IEEE Symposium in Low-Power and High-Speed Chips, Apr. 2021. DOI: 10.1109/COOLCHIPS52128.2021.9410320.

Crossref

[5]

Odajima T, Kodama Y, Tsuji M, Matsuda M, Maruyama Y, Sato M. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications. In Proc. the 2020 IEEE International Conference on Cluster Computing, Sept. 2020, pp.523–530. DOI: 10.1109/CLUSTER49012.2020.00075.

Crossref

[6]

Pedretti K T, Younge A J, Hammond S D, Laros III J H, Curry M L, Aguilar M J, Hoekstra R J, Brightwell R. Chronicles of Astra: Challenges and lessons from the first Petascale Arm supercomputer. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. DOI: 10.1109/ SC41405.2020.00052.

Crossref

[7]

Mantovani F, Garcia-Gasulla M, Gracia J, Stafford E, Banchelli F, Josep-Fabrego M, Criado-Ledesma J, Nachtmann M. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future Gener. Comput. Syst. , 2020, 112: 800–818. DOI: 10.1016/j.future.2020.06.033.

Crossref Google Scholar

[8]

Hill M D, Marty M R. Amdahl’s law in the multicore era. IEEE Computer , 2008, 41(7): 33–38. DOI: 10.1109/MC.2008.209.

Crossref Google Scholar

[9]

McCalpin J D. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter , 1995, 2: 19–25.

Google Scholar

[10]

McVoy L M, Staelin C. lmbench: Portable tools for performance analysis. In Proc. the USENIX Annual Technical Conference, Jan. 1996, pp.279–294.

[11]

Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2009, pp.261–270. DOI: 10.1109/PACT.2009.22.

Crossref

[12]

Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2013, pp.97–108. DOI: 10.1145/2493123.2462916.

Crossref

[13]

Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the ACM/SPEC International Conference on Performance Engineering, Mar. 2014, pp.137–148. DOI: 10.1145/2568088.2576799.

Crossref

[14]

Fang J, Liao X, Huang C, Dong D. Performance evaluation of memory-centric ARMv8 many-core architectures: A case study with Phytium 2000+. Journal of Computer Science and Technology , 2021, 36(1): 33–43. DOI: 10.1007/s11390-020-0741-6.

Crossref Google Scholar

[15]

Xia J, Cheng C, Zhou X, Hu Y, Chun P. Kunpeng 920: The first 7-nm chiplet-based 64-Core ARM SoC for cloud services. IEEE Micro , 2021, 41(5): 67–75. DOI: 10.1109/MM.2021.3085578.

Crossref Google Scholar

[16]

Hackenberg D, Molka D, Nagel W E. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.413–422. DOI: 10.1145/1669112.1669165.

Crossref

[17]

Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 2016, 3(3): Article 18. DOI: 10.1145/3015144.

Crossref

[18]

Babka V, Tůma P. Investigating cache parameters of x86 family processors. In Proc. the SPEC Benchmark Workshop, Jan. 2009, pp.77–96. DOI: 10.1007/978-3-540-93799-9_5.

Crossref

[19]

Wong H, Papadopoulou M, Sadooghi-Alvandi M. Demystifying GPU microarchitecture through microbenchmarking. In Proc. the 2010 IEEE International Symposium on Performance Analysis of Systems Software, Mar. 2010, pp.235–246. DOI: 10.1109/ISPASS.2010.5452013.

Crossref

[20]

Mei X, Chu X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems , 2017, 28(1): 72–86. DOI: 10.1109/TPDS.2016.2549523.

Crossref Google Scholar

[21]

Lin J, Xu Z, Cai L, Nukada A, Matsuoka S. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Computing , 2018, 77: 128–143. DOI: 10.1016/j.parco.2018.06.001.

Crossref Google Scholar

[22]

McIntosh-Smith S, Price J, Deakin T, Poenaru A. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience , 2019, 31(16): e5110. DOI: 10.1002/cpe.5110.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 38 Issue 6,
November 2023

Pages 1323-1338

DOI: 10.1007/s11390-021-1251-x

Cite this article:

Gao W-R, Fang J-B, Huang C, et al.

w r B e n c h

: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. Journal of Computer Science and Technology, 2023, 38(6): 1323-1338. https://doi.org/10.1007/s11390-021-1251-x

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Abstract

Keywords

Electronic Supplementary Material

References

$w r B e n c h$ : Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems