Article Link
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Electronic Supplementary Material
References
Show full outline
Hide outline
Regular Paper

P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings

Ye Chi1,2,3,4,5Ren-Tong Guo1,2,3,4Xiao-Fei Liao1,2,3,4()Hai-Kun Liu1,2,3,4Jianhui Yue6
National Engineering Research Center for Big Data Technology and System, Wuhan 430074, China
Services Computing Technology and System Laboratory, Wuhan 430074, China
Cluster and Grid Computing Laboratory, Wuhan 430074, China
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
School of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China
Department of Computer Science, Michigan Technological University, Houghton 49931-1295, U.S.A.
Show Author Information

Abstract

Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.

Electronic Supplementary Material

Download File(s)
JCST-2206-12561-Highlights.pdf (546.4 KB)

References

[1]
Jun H, Cho J, Lee K, Son H Y, Kim K, Jin H, Kim K. HBM (high bandwidth memory) DRAM technology and architecture. In Proc. the 2017 IEEE International Memory Workshop (IMW), May 2017, pp.1–4. DOI: 10.1109/IMW.2017.7939084.
[2]
Hadidi R, Asgari B, Mudassar B A, Mukhopadhyay S, Yalamanchili S, Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In Proc. the 2017 IEEE International Symposium on Workload Characterization (IISWC), Oct. 2017, pp.66–75. DOI: 10.1109/IISWC.2017.8167757.
[3]
Shahab A, Zhu M, Margaritov A, Grot B. Farewell my shared LLC! A case for private die-stacked DRAM caches for servers. In Proc. the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct. 2018, pp.559–572. DOI: 10.1109/MICRO.2018.00052.
[4]

Volos S, Jevdjic D, Falsafi B, Grot B. Fat caches for scale-out servers. IEEE Micro, 2017, 37(2): 90–103. DOI: 10.1109/MM.2017.32.

[5]
Nassif N, Munch A O, Molnar C L, Pasdast G, Lyer S V, Yang Z, Mendoza O, Huddart M, Venkataraman S, Kandula S, Marom R, Kern A M, Bowhill B, Mulvihill D R, Nimmagadda S, Kalidindi V, Krause J, Haq M M, Sharma R, Duda K. Sapphire rapids: The next-generation intel Xeon scalable processor. In Proc. the 17th IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2022, pp.44–46. DOI: 10.1109/ISSCC42614.2022.9731107.
[6]
Zahran M. The future of high-performance computing. In Proc. the 17th International Computer Engineering Conference (ICENCO), Dec. 2021, pp.129–134. DOI: 10.1109/ICENCO49852.2021.9698918.
[7]
Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.454–464. DOI: 10.1145/2155620.2155673.
[8]

Loh G, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78. DOI: 10.1109/MM.2012.25.

[9]
Qureshi M K, Loh G H. Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp.235–246. DOI: 10.1109/MICRO.2012.30.
[10]

Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41(3): 404–415. DOI: 10.1145/2508148.2485957.

[11]

Shin D, Jang H, Oh K, Lee J W. An energy-efficient dram cache architecture for mobile platforms with PCM-based main memory. ACM Trans. Embedded Computing Systems (TECS), 2022, 21(1): 1–22. DOI: 10.1145/3451995.

[12]

Zhang Q, Sui X, Hou R, Zhang L. Line-coalescing DRAM cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449. DOI: 10.1016/j.suscom.2020.100449.

[13]

Zhou F, Wu S, Yue J, Jin H, Shen J. Object Fingerprint Cache for Heterogeneous Memory System. IEEE Transactions on Computers, 2023, 72(9): 2496–2507. DOI: 10.1109/TC.2023.3251852.

[14]

Chi Y, Yue J, Liao X, Liu H, Jin H. A hybrid memory architecture supporting fine-grained data migration. Frontiers of Computer Science, 2024, 18(2): 182103. DOI: 10.1007/s11704-023-2675-y.

[15]

Hameed F, Bauer L, Henkel J. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(4): 651–664. DOI: 10.1109/TCAD.2015.2488488.

[16]
Hameed F, Bauer L, Henkel J. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In Proc. the 16th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept. 29–Oct. 4, 2013. DOI: 10.1109/CASES.2013.6662515.
[17]

Behnam P, Bojnordi M N. Adaptively reduced DRAM caching for energy-efficient high bandwidth memory. IEEE Trans. Computers, 2022, 71(10): 2675–2686. DOI: 10.1109/TC.2022.3140897.

[18]
Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.376–388. DOI: 10.1109/MICRO.2012.42.
[19]
Huang C C, Nagarajan V. ATCache: Reducing DRAM cache latency via a small SRAM tag cache. In Proc. the 23rd International Conference on Parallel Architectures and Compilation (PACT), Aug. 2014, pp.51–60. DOI: 10.1145/2628071.2628089.
[20]
Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In Proc. the 51st Annual Design Automation Conference (DAC), Jun. 2014. DOI: 10.1145/2593069.2593197.
[21]

Chou C, Jaleel A, Qureshi M K. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. ACM SIGARCH Computer Architecture News, 2015, 43(3S): 198–210. DOI: 10.1145/2872887.2750387.

[22]

Hameed F, Khan A A, Castrillon J. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Computers, 2021, 70(11): 1914–1927. DOI: 10.1109/TC.2020.3029615.

[23]
Kawano M, Wang X Y, Ren Q, Loh W L, Rao B C, Chui K J. One-step TSV process development for 4-layer wafer stacked DRAM. In Proc. the 71st IEEE Electronic Components and Technology Conference (ECTC), Jun. 1–Jul. 4, 2021, pp.673–679. DOI: 10.1109/ECTC32696.2021.00117.
[24]

Jiang X, Zuo F, Wang S, Zhou X, Wang Y, Liu Q, Ren Q, Liu M. A 1596-GB/s 48-Gb stacked embedded DRAM 384-core SoC with hybrid bonding integration. IEEE Solid-State Circuits Letters, 2022, 5: 110–113. DOI: 10.1109/LSSC.2022.3171862.

[25]
Bose B, Thakkar I. Characterization and mitigation of electromigration effects in TSV-based power delivery network enabled 3D-stacked DRAMs. In Proc. the 31st Great Lakes Symposium on VLSI, Jun. 2021, pp.101–107. DOI: 10.1145/3453688.3461503.
[26]

Agarwalla B, Das S, Sahu N. Process variation aware DRAM-Cache resizing. Journal of Systems Architecture, 2022, 123: 102364. DOI: 10.1016/j.sysarc.2021.102364.

[27]

Cheng W, Cai R, Zeng L, Feng D, Brinkmann A, Wang Y. IMCI: An efficient fingerprint retrieval approach based on 3D stacked memory. Science China Information Sciences, 2020, 63: 179101. DOI: 10.1007/s11432-019-2672-5.

[28]
Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.38–50. DOI: 10.1109/MICRO.2014.36.
[29]
Jiang S, Chen F, Zhang X. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proc. the 2005 Annual Conference on USENIX Annual Technical Conference, Apr. 2005.
[30]
Janapsatya A, Ignjatović A, Peddersen J, Parameswaran S. Dueling CLOCK: Adaptive cache replacement policy based on the CLOCK algorithm. In Proc. the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), Mar. 2010, pp.920–925. DOI: 10.1109/DATE.2010.5456920.
[31]
Bansal S, Modha D S. CAR: Clock with adaptive replacement. In Proc. the 3rd USENIX Conference on File and Storage Technologies (FAST), Mar. 2004, pp.187–200.
[32]
Li C. CLOCK-pro+: Improving CLOCK-pro cache replacement with utility-driven adaptation. In Proc. the 12th ACM International Conference on Systems and Storage (SYSTOR), May 2019, pp.1–7. DOI: 10.1145/3319647.3325838.
[33]

Binkert N, Beckmann B, Black G, Reinhardt S K, Saidi A, Basu A, Hestness J, Hower D R, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill M D, Wood D A. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1–7. DOI: 10.1145/2024716.2024718.

[34]
Poremba M, Xie Y. NVMain: An architectural-level main memory simulator for emerging non-volatile memories. In Proc. the 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Aug. 2012, pp.392–397. DOI: 10.1109/ISVLSI.2012.82.
[35]
Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: A scalable and effective die-stacked DRAM cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.25–37. DOI: 10.1109/MICRO.2014.51.
[36]
Chou C C, Jaleel A, Qureshi M K. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.1–12. DOI: 10.1109/MICRO.2014.63.
[37]
Sim J, Loh G H, Kim H, OConnor M, Thottethodi M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2012, pp.247–257. DOI: 10.1109/MICRO.2012.31.
[38]
Young V, Chishti Z A, Qureshi M K. TicToc: Enabling bandwidth-efficient DRAM caching for both hits and misses in hybrid memory systems. In Proc. the 37th IEEE International Conference on Computer Design (ICCD), Nov. 2019, pp.341–349. DOI: 10.1109/ICCD46524.2019.00055.
[39]

Zhang M, Kim J G, Yoon S K, Kim S D. Dynamic recognition prefetch engine for DRAM-PCM hybrid main memory. The Journal of Supercomputing, 2022, 78(2): 1885–1902. DOI: 10.1007/s11227-021-03948-5.

[40]

Choi S G, Kim J G, Kim S D. Adaptive granularity based last-level cache prefetching method with eDRAM prefetch buffer for graph processing applications. Applied Sciences, 2021, 11(3): 991. DOI: 10.3390/app11030991.

[41]
Kilic O O, Tallent N R, Friese R D. Rapid memory footprint access diagnostics. In Proc. the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Aug. 2020, pp.273–284. DOI: 10.1109/ISPASS48437.2020.00047.
[42]

Oh Y S, Chung E Y. Energy-efficient shared cache using way prediction based on way access dominance detection. IEEE Access, 2021, 9: 155048–155057. DOI: 10.1109/ACCESS.2021.3126739.

[43]
Jang H, Lee Y, Kim J, Kim Y, Kim J, Jeong J, Lee J W. Efficient footprint caching for Tagless DRAM Caches. In Proc. the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Mar. 2016, pp.237–248. DOI: 10.1109/HPCA.2016.7446068.
[44]

Tsukada S, Takayashiki H, Sato M, Komatsu K, Kobayashi H. A metadata prefetching mechanism for hybrid memory architectures. IEICE Trans. Electronics, 2022, E105.C(6): 232–243. DOI: 10.1587/transele.2021LHP0004.

[45]
Young V, Chou C, Jaleel A, Qureshi M. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proc. the 45th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), Jun. 2018, pp.328–339. DOI: 10.1109/ISCA.2018.00036.
[46]

Chen P, Yue J, Liao X, Jin H. Trade-off between hit rate and hit latency for optimizing DRAM cache. IEEE Trans. Emerging Topics in Computing, 2021, 9(1): 55–64. DOI: 10.1109/TETC.2018.2800721.

[47]

Vasilakis E, Papaefstathiou V, Trancoso P, Sourdis I. Decoupled fused cache: Fusing a decoupled LLC with a DRAM cache. ACM Trans. Architecture and Code Optimization (TACO), 2018, 15(4): 65. DOI: 10.1145/3293447.

Journal of Computer Science and Technology
Pages 1341-1360
Cite this article:
Chi Y, Guo R-T, Liao X-F, et al. P3DC: Reducing DRAM Cache Hit Latency by Hybrid Mappings. Journal of Computer Science and Technology, 2024, 39(6): 1341-1360. https://doi.org/10.1007/s11390-023-2561-y
Metrics & Citations  
Article History
Copyright
Return