Die-stacked dynamic random access memory (DRAM) caches are increasingly advocated to bridge the performance gap between the on-chip cache and the main memory. To fully realize their potential, it is essential to improve DRAM cache hit rate and lower its cache hit latency. In order to take advantage of the high hit-rate of set-association and the low hit latency of direct-mapping at the same time, we propose a partial direct-mapped die-stacked DRAM cache called P3DC. This design is motivated by a key observation, i.e., applying a unified mapping policy to different types of blocks cannot achieve a high cache hit rate and low hit latency simultaneously. To address this problem, P3DC classifies data blocks into leading blocks and following blocks, and places them at static positions and dynamic positions, respectively, in a unified set-associative structure. We also propose a replacement policy to balance the miss penalty and the temporal locality of different blocks. In addition, P3DC provides a policy to mitigate cache thrashing due to block type variations. Experimental results demonstrate that P3DC can reduce the cache hit latency by 20.5% while achieving a similar cache hit rate compared with typical set-associative caches. P3DC improves the instructions per cycle (IPC) by up to 66% (12% on average) compared with the state-of-the-art direct-mapped cache—BEAR, and by up to 19% (6% on average) compared with the tag-data decoupled set-associative cache—DEC-A8.
Many systems have been built to employ the delta-based iterative execution model to support iterative algorithms on distributed platforms by exploiting the sparse computational dependencies between data items of these iterative algorithms in a synchronous or asynchronous approach. However, for large-scale iterative algorithms, existing synchronous solutions suffer from slow convergence speed and load imbalance, because of the strict barrier between iterations; while existing asynchronous approaches induce excessive redundant communication and computation cost as a result of being barrier-free. In view of the performance trade-off between these two approaches, this paper designs an efficient execution manager, called Aiter-R, which can be integrated into existing delta-based iterative processing systems to efficiently support the execution of delta-based iterative algorithms, by using our proposed group-based iterative execution approach. It can efficiently and correctly explore the middle ground of the two extremes. A heuristic scheduling algorithm is further proposed to allow an iterative algorithm to adaptively choose its trade-off point so as to achieve the maximum efficiency. Experimental results show that Aiter-R strikes a good balance between the synchronous and asynchronous policies and outperforms state-of-the-art solutions. It reduces the execution time by up to 54.1% and 84.6% in comparison with existing asynchronous and the synchronous models, respectively.
Non-Volatile Main Memories (NVMMs) have recently emerged as a promising technology for future memory systems. Generally, NVMMs have many desirable properties such as high density, byte-addressability, non-volatility, low cost, and energy efficiency, at the expense of high write latency, high write power consumption, and limited write endurance. NVMMs have become a competitive alternative of Dynamic Random Access Memory (DRAM), and will fundamentally change the landscape of memory systems. They bring many research opportunities as well as challenges on system architectural designs, memory management in operating systems (OSes), and programming models for hybrid memory systems. In this article, we first revisit the landscape of emerging NVMM technologies, and then survey the state-of-the-art studies of NVMM technologies. We classify those studies with a taxonomy according to different dimensions such as memory architectures, data persistence, performance improvement, energy saving, and wear leveling. Second, to demonstrate the best practices in building NVMM systems, we introduce our recent work of hybrid memory system designs from the dimensions of architectures, systems, and applications. At last, we present our vision of future research directions of NVMMs and shed some light on design challenges and opportunities.