Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Graph processing is a vital component of many AI and big data applications. However, due to its poor locality and complex data access patterns, graph processing is also a known performance killer of AI and big data applications. In this work, we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations. We first identify that memory accesses to the offset, edge, and state array have distinct locality and impact on performance. We then introduce the Skyway architecture, which consists of two primary components: 1) a dedicated direct data path between the core and memory to transfer state array elements efficiently, and 2) a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path. The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead. We evaluate Skyway on a set of diverse algorithms using large real-world graphs. On a simulated four-core system, Skyway improves the performance by 23% on average over the best-performing graph-specialized hardware optimizations.
Caetano T S, McAuley J J, Cheng L, Le Q V, Smola A J. Learning graph matching. IEEE Trans. Pattern Analysis and Machine Intelligence , 2009, 31(6): 1048–1058. DOI: 10.1109/TPAMI.2009.28.
Navlakha S, Schatz M C, Kingsford C. Revealing biological modules via graph summarization. Journal of Computational Biology , 2009, 16(2): 253–264. DOI: 10.1089/cmb.2008.11TT.
Han S, Liu X Y, Mao H Z, Pu J, Pedram A, Horowitz M A, Dally W J. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News , 2016, 44(3): 243–254. DOI: 10.1145/3007787.3001163.
Zou M, Zhang M Z, Wang R J, Sun X H, Ye X C, Fan D R, Tang Z M. Accelerating graph processing with lightweight learning-based data reordering. IEEE Computer Architecture Letters , 2022, 21(1): 5–8. DOI: 10.1109/ LCA.2022.3151087.
Zhang D, Ma X Y, Thomson M, Chiou D. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices , 2018, 53(2): 593–607. DOI: 10.1145/3296957.3173197.
Sundaram N, Satish N, Patwary M M A, Dulloor S R, Vadlamudi S G, Das D, Dubey P. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment , 2015, 8(11): 1214–1225. DOI: 10.14778/2809974.2809983.
Sanchez D, Kozyrakis C. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer Architecture News , 2013, 41(3): 475–486. DOI: 10.1145/2508148.2485963.
Li S, Yang Z Y, Reddy D, Srivastava A, Jacob B. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters , 2020, 19(2): 106–109. DOI: 10.1109/LCA.2020.2973991.
Rixner S, Dally W J, Kapasi U J, Mattson P, Owens J D. Memory access scheduling. ACM SIGARCH Computer Architecture News , 2000, 28(2): 128–138. DOI: 10.1145/342001.339668.
Davis T A, Hu Y F. The university of Florida sparse matrix collection. ACM Trans. Mathematical Software , 2011, 38(1): Article No. 1. DOI: 10.1145/2049662.2049663.
Etsion Y, Feitelson D G. Exploiting core working sets to filter the L1 cache with random sampling. IEEE Trans. Computers , 2012, 61(11): 1535–1550. DOI: 10.1109/TC.2011.197.
Wang P Y, Wang J, Li C, Wang J Z, Zhu H J, Guo M Y. Grus: Toward unified-memory-efficient high-performance graph processing on GPU. ACM Trans. Architecture and Code Optimization , 2021, 18(2): Article No. 22. DOI: 10.1145/3444844.