| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings

Ye Chi^{¹^,²^,³^,⁴^,⁵}, Ren-Tong Guo^{¹^,²^,³^,⁴}, Xiao-Fei Liao^{¹^,²^,³^,⁴}(), Hai-Kun Liu^{¹^,²^,³^,⁴}, Jianhui Yue^⁶

1National Engineering Research Center for Big Data Technology and System, Wuhan 430074, China

2Services Computing Technology and System Laboratory, Wuhan 430074, China

3Cluster and Grid Computing Laboratory, Wuhan 430074, China

4School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

5School of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China

6Department of Computer Science, Michigan Technological University, Houghton 49931-1295, U.S.A.

Show Author Information

Abstract

Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.

Keywords

die-stacked dynamic random access memory (DRAM)cache set-associative direct-mapped hit latency

Electronic Supplementary Material

Download File(s)

JCST-2206-12561-Highlights.pdf (546.4 KB)

References

[1]

Jun H, Cho J, Lee K, Son H Y, Kim K, Jin H, Kim K. HBM (high bandwidth memory) DRAM technology and architecture. In Proc. the 2017 IEEE International Memory Workshop (IMW), May 2017, pp.1–4. DOI: 10.1109/IMW.2017.7939084.

[2]

Hadidi R, Asgari B, Mudassar B A, Mukhopadhyay S, Yalamanchili S, Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Proc. the 2017 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2017, pp.66–75. DOI: 10.1109/IISWC.2017.8167757.

[3]

Shahab A, Zhu M, Margaritov A, Grot B. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2018, pp.559–572. DOI: 10.1109/MICRO.2018.00052.

[4]

Volos S, Jevdjic D, Falsafi B, Grot B. Fat caches for scale-out servers. IEEE Micro, 2017, 37(2): 90–103. DOI: 10.1109/MM.2017.32.

Crossref Google Scholar

[5]

Nassif N, Munch A O, Molnar C L, Pasdast G, Lyer S V, Yang Z, Mendoza O, Huddart M, Venkataraman S, Kandula S, Marom R, Kern A M, Bowhill B, Mulvihill D R, Nimmagadda S, Kalidindi V, Krause J, Haq M M, Sharma R, Duda K. Sapphire rapids: The next-generation intel Xeon scalable processor. In Proc. the 17th IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2022, pp.44–46. DOI: 10.1109/ISSCC42614.2022.9731107.

[6]

Zahran M. The future of high-performance computing. In Proc. the 17th International Computer Engineering Conference (ICENCO), Dec. 2021, pp.129–134. DOI: 10.1109/ICENCO49852.2021.9698918.

[7]

Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.454–464. DOI: 10.1145/2155620.2155673.

[8]

Loh G, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78. DOI: 10.1109/MM.2012.25.

Crossref Google Scholar

[9]

Qureshi M K, Loh G H. Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.235–246. DOI: 10.1109/MICRO.2012.30.

[10]

Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41(3): 404–415. DOI: 10.1145/2508148.2485957.

Crossref Google Scholar

[11]

Shin D, Jang H, Oh K, Lee J W. An energy-efficient dram cache architecture for mobile platforms with PCM-based main memory. ACM Trans. Embedded Computing Systems (TECS), 2022, 21(1): 1–22. DOI: 10.1145/3451995.

Crossref Google Scholar

[12]

Zhang Q, Sui X, Hou R, Zhang L. Line-coalescing DRAM cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449. DOI: 10.1016/j.suscom.2020.100449.

Crossref Google Scholar

[13]

Zhou F, Wu S, Yue J, Jin H, Shen J. Object Fingerprint Cache for Heterogeneous Memory System. IEEE Transactions on Computers, 2023, 72(9): 2496–2507. DOI: 10.1109/TC.2023.3251852.

Crossref Google Scholar

[14]

Chi Y, Yue J, Liao X, Liu H, Jin H. A hybrid memory architecture supporting fine-grained data migration. Frontiers of Computer Science, 2024, 18(2): 182103. DOI: 10.1007/s11704-023-2675-y.

Crossref Google Scholar

[15]

Hameed F, Bauer L, Henkel J. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(4): 651–664. DOI: 10.1109/TCAD.2015.2488488.

Crossref Google Scholar

[16]

Hameed F, Bauer L, Henkel J. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In Proc. the 16th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept. 29–Oct. 4, 2013. DOI: 10.1109/CASES.2013.6662515.

[17]

Behnam P, Bojnordi M N. Adaptively reduced DRAM caching for energy-efficient high bandwidth memory. IEEE Trans. Computers, 2022, 71(10): 2675–2686. DOI: 10.1109/TC.2022.3140897.

Crossref Google Scholar

[18]

Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.376–388. DOI: 10.1109/MICRO.2012.42.

[19]

Huang C C, Nagarajan V. ATCache: Reducing DRAM cache latency via a small SRAM tag cache. In Proc. the 23rd International Conference on Parallel Architectures and Compilation (PACT), Aug. 2014, pp.51–60. DOI: 10.1145/2628071.2628089.

[20]

Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In Proc. the 51st Annual Design Automation Conference (DAC), Jun. 2014. DOI: 10.1145/2593069.2593197.

[21]

Chou C, Jaleel A, Qureshi M K. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Computer Architecture News, 2015, 43(3S): 198–210. DOI: 10.1145/2872887.2750387.

Crossref Google Scholar

[22]

Hameed F, Khan A A, Castrillon J. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Computers, 2021, 70(11): 1914–1927. DOI: 10.1109/TC.2020.3029615.

Crossref Google Scholar

[23]

Kawano M, Wang X Y, Ren Q, Loh W L, Rao B C, Chui K J. One-step TSV process development for 4-layer wafer stacked DRAM. In Proc. the 71st IEEE Electronic Components and Technology Conference (ECTC), Jun. 1–Jul. 4, 2021, pp.673–679. DOI: 10.1109/ECTC32696.2021.00117.

[24]

Jiang X, Zuo F, Wang S, Zhou X, Wang Y, Liu Q, Ren Q, Liu M. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-State Circuits Letters, 2022, 5: 110–113. DOI: 10.1109/LSSC.2022.3171862.

Crossref Google Scholar

[25]

Bose B, Thakkar I. Characterization and mitigation of electromigration effects in TSV-based power delivery network enabled 3D-stacked DRAMs. In Proc. the 31st Great Lakes Symposium on VLSI, Jun. 2021, pp.101–107. DOI: 10.1145/3453688.3461503.

[26]

Agarwalla B, Das S, Sahu N. Process variation aware DRAM-Cache resizing. Journal of Systems Architecture, 2022, 123: 102364. DOI: 10.1016/j.sysarc.2021.102364.

Crossref Google Scholar

[27]

Cheng W, Cai R, Zeng L, Feng D, Brinkmann A, Wang Y. IMCI: An efficient fingerprint retrieval approach based on 3D stacked memory. Science China Information Sciences, 2020, 63: 179101. DOI: 10.1007/s11432-019-2672-5.

Crossref Google Scholar

[28]

Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.38–50. DOI: 10.1109/MICRO.2014.36.

[29]

Jiang S, Chen F, Zhang X. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proc. the 2005 Annual Conference on USENIX Annual Technical Conference, Apr. 2005.

[30]

Janapsatya A, Ignjatović A, Peddersen J, Parameswaran S. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), Mar. 2010, pp.920–925. DOI: 10.1109/DATE.2010.5456920.

[31]

Bansal S, Modha D S. CAR: Clock with adaptive replacement. In Proc. the 3rd USENIX Conference on File and Storage Technologies (FAST), Mar. 2004, pp.187–200.

[32]

Li C. CLOCK-pro+: Improving CLOCK-pro cache replacement with utility-driven adaptation. In Proc. the 12th ACM International Conference on Systems and Storage (SYSTOR), May 2019, pp.1–7. DOI: 10.1145/3319647.3325838.

[33]

Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: 10.1145/2024716.2024718.

Crossref Google Scholar

[34]

Poremba M, Xie Y. NVMain: An architectural-level main memory simulator for emerging non-volatile memories. In Proc. the 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Aug. 2012, pp.392–397. DOI: 10.1109/ISVLSI.2012.82.

[35]

Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: A scalable and effective die-stacked DRAM cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.25–37. DOI: 10.1109/MICRO.2014.51.

[36]

Chou C C, Jaleel A, Qureshi M K. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.1–12. DOI: 10.1109/MICRO.2014.63.

[37]

Sim J, Loh G H, Kim H, OConnor M, Thottethodi M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.247–257. DOI: 10.1109/MICRO.2012.31.

[38]

Young V, Chishti Z A, Qureshi M K. TicToc: Enabling bandwidth-efficient DRAM caching for both hits and misses in hybrid memory systems. In Proc. the 37th IEEE International Conference on Computer Design (ICCD), Nov. 2019, pp.341–349. DOI: 10.1109/ICCD46524.2019.00055.

[39]

Zhang M, Kim J G, Yoon S K, Kim S D. Dynamic recognition prefetch engine for DRAM-PCM hybrid main memory. The Journal of Supercomputing, 2022, 78(2): 1885–1902. DOI: 10.1007/s11227-021-03948-5.

Crossref Google Scholar

[40]

Choi S G, Kim J G, Kim S D. Adaptive granularity based last-level cache prefetching method with eDRAM prefetch buffer for graph processing applications. Applied Sciences, 2021, 11(3): 991. DOI: 10.3390/app11030991.

Crossref Google Scholar

[41]

Kilic O O, Tallent N R, Friese R D. Rapid memory footprint access diagnostics. In Proc. the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug. 2020, pp.273–284. DOI: 10.1109/ISPASS48437.2020.00047.

[42]

Oh Y S, Chung E Y. Energy-efficient shared cache using way prediction based on way access dominance detection. IEEE Access, 2021, 9: 155048–155057. DOI: 10.1109/ACCESS.2021.3126739.

Crossref Google Scholar

[43]

Jang H, Lee Y, Kim J, Kim Y, Kim J, Jeong J, Lee J W. Efficient footprint caching for Tagless DRAM Caches. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Mar. 2016, pp.237–248. DOI: 10.1109/HPCA.2016.7446068.

[44]

Tsukada S, Takayashiki H, Sato M, Komatsu K, Kobayashi H. A metadata prefetching mechanism for hybrid memory architectures. IEICE Trans. Electronics, 2022, E105.C(6): 232–243. DOI: 10.1587/transele.2021LHP0004.

Crossref Google Scholar

[45]

Young V, Chou C, Jaleel A, Qureshi M. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proc. the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2018, pp.328–339. DOI: 10.1109/ISCA.2018.00036.

[46]

Chen P, Yue J, Liao X, Jin H. Trade-off between hit rate and hit latency for optimizing DRAM cache. IEEE Trans. Emerging Topics in Computing, 2021, 9(1): 55–64. DOI: 10.1109/TETC.2018.2800721.

Crossref Google Scholar

[47]

Vasilakis E, Papaefstathiou V, Trancoso P, Sourdis I. Decoupled fused cache: Fusing a decoupled LLC with a DRAM cache. ACM Trans. Architecture and Code Optimization (TACO), 2018, 15(4): 65. DOI: 10.1145/3293447.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 39 Issue 6,
November 2024

Pages 1341-1360

DOI: 10.1007/s11390-023-2561-y

Cite this article:

Chi Y, Guo R-T, Liao X-F, et al. P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings. Journal of Computer Science and Technology, 2024, 39(6): 1341-1360. https://doi.org/10.1007/s11390-023-2561-y

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号