| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data Management

Mo Zou^{¹^,²}, Ming-Zhe Zhang^³, Ru-Jia Wang^⁴, Xian-He Sun^⁴, Xiao-Chun Ye^¹, Dong-Rui Fan^¹, Zhi-Min Tang^{¹^,²}

1Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

2University of Chinese Academy of Sciences, Beijing 100049, China

3Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100045, China

4Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616, U.S.A.

Show Author Information

Abstract

Graph processing is a vital component of many AI and big data applications. However, due to its poor locality and complex data access patterns, graph processing is also a known performance killer of AI and big data applications. In this work, we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations. We first identify that memory accesses to the offset, edge, and state array have distinct locality and impact on performance. We then introduce the Skyway architecture, which consists of two primary components: 1) a dedicated direct data path between the core and memory to transfer state array elements efficiently, and 2) a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path. The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead. We evaluate Skyway on a set of diverse algorithms using large real-world graphs. On a simulated four-core system, Skyway improves the performance by 23% on average over the best-performing graph-specialized hardware optimizations.

Keywords

graph application computer architecture memory hierarchy

Electronic Supplementary Material

Download File(s)

JCST-2210-12939-Highlights.pdf (172.6 KB)

References

[1]

Fan W F. Graph pattern matching revised for social network analysis. In Proc. the 15th International Conference on Database Theory, Mar. 2012, pp.8–21. DOI: 10.1145/2274576.2274578.

[2]

Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? In Proc. the 19th International Conference on World Wide Web, Apr. 2010, pp.591–600. DOI: 10.1145/1772690.1772751.

[3]

Tang L, Liu H. Graph mining applications to social network analysis. In Managing and Mining Graph Data, Aggarwal C C, Wang H X (eds.), Springer, 2010, pp.487–513. DOI: 10.1007/978-1-4419-6045-0_16.

[4]

Caetano T S, McAuley J J, Cheng L, Le Q V, Smola A J. Learning graph matching. IEEE Trans. Pattern Analysis and Machine Intelligence , 2009, 31(6): 1048–1058. DOI: 10.1109/TPAMI.2009.28.

Crossref Google Scholar

[5]

Navlakha S, Schatz M C, Kingsford C. Revealing biological modules via graph summarization. Journal of Computational Biology , 2009, 16(2): 253–264. DOI: 10.1089/cmb.2008.11TT.

Crossref Google Scholar

[6]

Han S, Liu X Y, Mao H Z, Pu J, Pedram A, Horowitz M A, Dally W J. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News , 2016, 44(3): 243–254. DOI: 10.1145/3007787.3001163.

Crossref Google Scholar

[7]

Mukkara A, Beckmann N, Abeydeera M, Ma X S, Sanchez D. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2018. DOI: 10.1109/MICRO.2018.00010.

[8]

Arai J, Shiokawa H, Yamamuro T, Onizuka M, Iwamura S. Rabbit order: Just-in-time parallel reordering for fast graph analysis. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium, May 2016, pp.22–31. DOI: 10.1109/IPDPS.2016.110.

[9]

Balaji V, Lucia B. When is graph reordering an optimization? Studying the effect of lightweight graph reordering across applications and input graphs. In Proc. the 2018 IEEE International Symposium on Workload Characterization, Sept. 30–Oct. 2, 2018, pp.203–214. DOI: 10.1109/IISWC.2018.8573478.

[10]

Faldu P, Diamond J, Grot B. A closer look at lightweight graph reordering. In Proc. the 2019 IEEE International Symposium on Workload Characterization, Nov. 2019. DOI: 10.1109/IISWC47752.2019.9041948.

[11]

Lakhotia K, Singapura S, Kannan R, Prasanna V. ReCALL: Reordered cache aware locality based graph processing. In Proc. the 24th International Conference on High Performance Computing, Dec. 2017, pp.273–282. DOI: 10.1109/HiPC.2017.00039.

[12]

Wei H, Yu J X, Lu C, Lin X M. Speedup graph processing by graph ordering. In Proc. the 2016 International Conference on Management of Data, Jun. 2016, pp.1813–1828. DOI: 10.1145/2882903.2915220.

[13]

Zhang Y M, Kiriansky V, Mendis C, Amarasinghe S, Zaharia M. Making caches work for graph analytics. In Proc. the 2017 IEEE International Conference on Big Data, Dec. 2017, pp.293–302. DOI: 10.1109/BigData.2017.8257937.

[14]

Zou M, Zhang M Z, Wang R J, Sun X H, Ye X C, Fan D R, Tang Z M. Accelerating graph processing with lightweight learning-based data reordering. IEEE Computer Architecture Letters , 2022, 21(1): 5–8. DOI: 10.1109/ LCA.2022.3151087.

Crossref Google Scholar

[15]

Balaji V, Crago N, Jaleel A, Lucia B. P-OPT: Practical optimal cache replacement for graph analytics. In Proc. the 2021 IEEE International Symposium on High-Performance Computer Architecture, Feb. 27–/Mar. 3, 2021, pp.668–681. DOI: 10.1109/HPCA51647.2021.00062.

[16]

Faldu P, Diamond J, Grot B. Domain-specialized cache management for graph analytics. In Proc. the 2020 IEEE International Symposium on High Performance Computer Architecture, Feb. 2020, pp.234–248. DOI: 10.1109/HPCA47549.2020.00028.

[17]

Mukkara A, Beckmann N, Sanchez D. PHI: Architectural support for synchronization- and bandwidth-efficient commutative scatter updates. In Proc. the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2019, pp.1009–1022. DOI: 10.1145/3352460.3358254.

[18]

Rahman S, Abu-Ghazaleh N, Gupta R. GraphPulse: An event-driven hardware accelerator for asynchronous graph processing. In Proc. the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2020, pp.908–921. DOI: 10.1109/MICRO50266.2020.00078.

[19]

Yan M Y, Hu X, Li S C, Basak A, Li H, Ma X, Akgun I, Feng Y J, Gu P, Deng L, Ye X C, Zhang Z M, Fan D R, Xie Y. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proc. the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2019, pp.615–628. DOI: 10.1145/3352460.3358318.

[20]

Zhang D, Ma X Y, Thomson M, Chiou D. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices , 2018, 53(2): 593–607. DOI: 10.1145/3296957.3173197.

Crossref Google Scholar

[21]

Zhang Y, Liao X F, Jin H, He L G, He B S, Liu H K, Gu L. DepGraph: A dependency-driven accelerator for efficient iterative graph processing. In Proc. the 2021 IEEE International Symposium on High-Performance Computer Architecture, Feb. 27–Mar. 3, 2021, pp.371–384. DOI: 10.1109/HPCA51647.2021.00039.

[22]

Zou M, Yan M Y, Li W M, Tang Z M, Ye X C, Fan D R. GEM: Execution-aware cache management for graph analytics. In Proc. the 22nd International Conference on Algorithms and Architectures for Parallel Processing, Oct. 2022, pp.273–292. DOI: 10.1007/978-3-031-22677-9_15.

[23]

Maass S, Min C, Kashyap S, Kang W, Kumar M, Kim T. Mosaic: Processing a trillion-edge graph on a single machine. In Proc. the 12th European Conference on Computer Systems, Apr. 2017, pp.527–543. DOI: 10.1145/3064176.3064191.

[24]

Shun J L, Blelloch G E. Ligra: A lightweight graph processing framework for shared memory. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2013, pp.135–146. DOI: 10.1145/2442516.2442530.

[25]

Beamer S, Asanović K, Patterson D. The GAP benchmark suite. arXiv: 1508.03619, 2015. https://doi.org/10.48550/arXiv.1508.03619, Jan. 2024.

[26]

Kyrola A, Blelloch G, Guestrin C. GraphChi: Large-scale graph computation on just a PC. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2012, pp.31–46.

[27]

Sundaram N, Satish N, Patwary M M A, Dulloor S R, Vadlamudi S G, Das D, Dubey P. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment , 2015, 8(11): 1214–1225. DOI: 10.14778/2809974.2809983.

Crossref Google Scholar

[28]

Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the Internet topology. In The Structure and Dynamics of Networks, Newman M, Barabási A L, Watts D J (eds.), Princeton University Press, 2006, pp.195–206. DOI: 10.1515/9781400841356.195.

[29]

Gonzalez J E, Low Y, Gu H J, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2012, pp.17–30.

[30]

Jiang L, Chen L S, Qiu J. Performance characterization of multi-threaded graph processing applications on many-integrated-core architecture. In Proc. the 2018 IEEE International Symposium on Performance Analysis of Systems and Software, Apr. 2018, pp.199–208. DOI: 10.1109/ISPASS.2018.00033.

[31]

Sanchez D, Kozyrakis C. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer Architecture News , 2013, 41(3): 475–486. DOI: 10.1145/2508148.2485963.

Crossref Google Scholar

[32]

Li S, Yang Z Y, Reddy D, Srivastava A, Jacob B. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters , 2020, 19(2): 106–109. DOI: 10.1109/LCA.2020.2973991.

Crossref Google Scholar

[33]

Basak A, Li S C, Hu X, Oh S M, Xie X F, Zhao L, Jiang X W, Xie Y. Analysis and optimization of the memory hierarchy for graph processing workloads. In Proc. the 2019 IEEE International Symposium on High Performance Computer Architecture, Feb. 2019, pp.373–386. DOI: 10.1109/HPCA.2019.00051.

[34]

Rixner S, Dally W J, Kapasi U J, Mattson P, Owens J D. Memory access scheduling. ACM SIGARCH Computer Architecture News , 2000, 28(2): 128–138. DOI: 10.1145/342001.339668.

Crossref Google Scholar

[35]

Hassan H, Patel M, Kim J S, Yaglikci A G, Vijaykumar N, Ghiasi N M, Ghose S, Mutlu O. CROW: A low-cost substrate for improving DRAM performance, energy efficiency, and reliability. In Proc. the 46th International Symposium on Computer Architecture, Jun. 2019, pp.129–142. DOI: 10.1145/3307650.3322231.

[36]

Beamer S, Asanovic K, Patterson D. Direction-optimizing breadth-first search. In Proc. the 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2012. DOI: 10.1109/SC.2012.50.

[37]

Madduri K, Ediger D, Jiang K, Bader D A, Chavarria-Miranda D. A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In Proc. the 2009 IEEE International Symposium on Parallel & Distributed Processing, May 2009. DOI: 10.1109/IPDPS.2009.5161100.

[38]

Sutton M, Ben-Nun T, Barak A. Optimizing parallel graph connectivity computation via subgraph sampling. In Proc. the 2018 IEEE International Parallel and Distributed Processing Symposium, May 2018, pp.12–21. DOI: 10.1109/IPDPS.2018.00012.

[39]

Zhang Y M, Brahmakshatriya A, Chen X Y, Dhulipala L, Kamil S, Amarasinghe S, Shun J. Optimizing ordered graph algorithms with Graphit. In Proc. the 18th ACM/IEEE International Symposium on Code Generation and Optimization, Feb. 2020, pp.158–170. DOI: 10.1145/3368826.3377909.

[40]

Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A nucleus for a web of open data. In Proc. the 6th International Semantic Web Conference on the Semantic Web, Nov. 2007, pp.722–735. DOI: 10.1007/978-3-540-76298-0_52.

[41]

Lehmberg O, Meusel R, Bizer C. Graph structure in the web: Aggregated by pay-level domain. In Proc. the 2014 ACM Conference on Web Science, Jun. 2014, pp.119–128. DOI: 10.1145/2615569.2615674.

[42]

Kunegis J. KONECT: The Koblenz network collection. In Proc. the 22nd International Conference on World Wide Web, May 2013, pp.1343–1350. DOI: 10.1145/2487788.2488173.

[43]

Cha M, Haddadi H, Benevenuto F, Gummadi K. Measuring user influence in Twitter: The million follower fallacy. In Proc. the 2010 International AAAI Conference on Web and Social Media, May 2010, pp.10–17. DOI: 10.1609/icwsm.v4i1.14033.

[44]

Davis T A, Hu Y F. The university of Florida sparse matrix collection. ACM Trans. Mathematical Software , 2011, 38(1): Article No. 1. DOI: 10.1145/2049662.2049663.

Crossref Google Scholar

[45]

Wang Y H, Orosa L, Peng X J, Guo Y, Ghose S, Patel M, Kim J S, Luna J G, Sadrosadati M, Ghiasi N M, Mutlu O. FIGARO: Improving system performance via fine-grained In-DRAM data relocation and caching. In Proc. the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2020, pp.313–328. DOI: 10.1109/MICRO50266.2020.00036.

[46]

Lin B, Healy M B, Miftakhutdinov R, Emma P G, Patt Y. Duplicon cache: Mitigating off-chip memory bank and bank group conflicts via data duplication. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture, Oct. 2018, pp.285–297. DOI: 10.1109/MICRO.2018.00031.

[47]

Muralimanohar N, Balasubramonian R, Jouppi N P. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proc. the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2007, pp.3–14. DOI: 10.1109/MICRO.2007.33.

[48]

Jaleel A, Theobald K B, Steely S C, Emer J. High performance cache replacement using re-reference interval prediction (RRIP). In Proc. the 37th International Symposium on Computer Architecture, Jun. 2010, pp.60–71. DOI: 10.1145/1815961.1815971.

[49]

Gupta S, Gao H L, Zhou H Y. Adaptive cache bypassing for inclusive last level caches. In Proc. the 27th International Symposium on Parallel and Distributed Processing, May 2013, pp.1243–1253. DOI: 10.1109/IPDPS.2013.16.

[50]

Xiang L X, Chen T Z, Shi Q S, Hu W. Less reused filter: Improving L2 cache performance via filtering less reused lines. In Proc. the 23rd International Conference on Supercomputing, Jun. 2009, pp.68–79. DOI: 10.1145/1542275.1542290.

[51]

John L K, Subramanian A. Design and performance evaluation of a cache assist to implement selective caching. In Proc. the 1997 International Conference on Computer Design VLSI in Computers and Processors, Oct. 1997, pp.510–518. DOI: 10.1109/ICCD.1997.628916.

[52]

Malkowski K, Link G, Raghavan P, Irwin M J. Load miss prediction-exploiting power performance trade-offs. In Proc. the 2007 IEEE International Parallel and Distributed Processing Symposium, Mar. 2007. DOI: 10.1109/IPDPS.2007.370536.

[53]

Etsion Y, Feitelson D G. Exploiting core working sets to filter the L1 cache with random sampling. IEEE Trans. Computers , 2012, 61(11): 1535–1550. DOI: 10.1109/TC.2011.197.

Crossref Google Scholar

[54]

Collins J D, Tullsen D M. Hardware identification of cache conflict misses. In Proc. the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, Nov. 1999, pp.126–135. DOI: 10.1109/MICRO.1999.809450.

[55]

Jalminger J, Stenstrom P. A novel approach to cache block reuse predictions. In Proc. the 2003 International Conference on Parallel Processing, Oct. 2003, pp.294–302. DOI: 10.1109/ICPP.2003.1240592.

[56]

Wang P Y, Wang J, Li C, Wang J Z, Zhu H J, Guo M Y. Grus: Toward unified-memory-efficient high-performance graph processing on GPU. ACM Trans. Architecture and Code Optimization , 2021, 18(2): Article No. 22. DOI: 10.1145/3444844.

Crossref Google Scholar

[57]

Wang P Y, Li C, Wang J, Wang T L, Zhang L, Leng J W, Chen Q, Guo M Y. Skywalker: Efficient alias-method-based graph sampling and random walk on GPUs. In Proc. the 30th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2021, pp.304–317. DOI: 10.1109/PACT52795.2021.00029.

[58]

Sabet A H N, Zhao Z J, Gupta R. Subway: Minimizing data transfer during out-of-GPU-memory graph processing. In Proc. the 15th European Conference on Computer Systems, Apr. 2020, Article No. 12. DOI: 10.1145/3342195.3387537.

Journal of Computer Science and Technology

Volume 39 Issue 4,
July 2024

Pages 871-894

DOI: 10.1007/s11390-023-2939-x

Cite this article:

Zou M, Zhang M-Z, Wang R-J, et al. Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data Management. Journal of Computer Science and Technology, 2024, 39(4): 871-894. https://doi.org/10.1007/s11390-023-2939-x

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号