The Memory-Bounded Speedup Model and Its Impacts in Computing

Xian-He Sun; Xiaoyang Lu

doi:10.1007/s11390-022-2911-1

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Review

The Memory-Bounded Speedup Model and Its Impacts in Computing

Xian-He Sun, Xiaoyang Lu

Department of Computer Science, Illinois Institute of Technology, Chicago 60616, U.S.A.

Show Author Information

Abstract

With the surge of big data applications and the worsening of the memory-wall problem, the memory system, instead of the computing unit, becomes the commonly recognized major concern of computing. However, this “memory-centric” common understanding has a humble beginning. More than three decades ago, the memory-bounded speedup model is the first model recognizing memory as the bound of computing and provided a general bound of speedup and a computing-memory trade-off formulation. The memory-bounded model was well received even by then. It was immediately introduced in several advanced computer architecture and parallel computing textbooks in the 1990’s as a must-know for scalable computing. These include Prof. Kai Hwang’s book “Scalable Parallel Computing” in which he introduced the memory-bounded speedup model as the Sun-Ni’s Law, parallel with the Amdahl’s Law and the Gustafson’s Law. Through the years, the impacts of this model have grown far beyond parallel processing and into the fundamental of computing. In this article, we revisit the memory-bounded speedup model and discuss its progress and impacts in depth to make a unique contribution to this special issue, to stimulate new solutions for big data applications, and to promote data-centric thinking and rethinking.

Keywords

memory-bounded speedup scalable computing memory-wall performance modeling and optimization data-centric design

Electronic Supplementary Material

Download File(s)

JCST-2210-12911-Highlights.pdf (447.3 KB)

References

[1]

Wulf W A, McKee S A. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 1995, 23(1): 20-24. DOI: 10.1145/216585.216588.

[2]

Sun X H, Ni L M. Scalable problems and memory-bounded speedup. Journal of Parallel and Distributed Computing, 1993, 19(1): 27-37. DOI: 10.1006/jpdc.1993.1087.

[3]

Sun X H, Ni L M. Another view on parallel speedup. In Proc. the 1990 ACM/IEEE Conference on Supercomputing, Nov. 1990, pp.324-333. DOI: 10.1109/SUPERC.1990.130037.

[4]

Amdahl G M. Validity of the single processor approach to achieving large scale computing capabilities. In Proc. the Spring Joint Computer Conference, Apr. 1967, pp.483-485. DOI: 10.1145/1465482.1465560.

[5]

Gustafson J L. Reevaluating Amdahl’s law. Communications of the ACM, 1988, 31(5): 532-533. DOI: 10.1145/42411.42415.

[6]

Bashe C J, Johnson L R, Palmer J H, Pugh E W. IBM’s Early Computers. MIT Press, 1986.

[7]

Sun X H, Chen Y. Reevaluating Amdahl’s law in the multicore era. Journal of Parallel and Distributed Computing, 2010, 70(2): 183-188. DOI: 10.1016/j.jpdc.2009.05.002.

[8]

Pan C Y, Naeemi A. System-level optimization and benchmarking of graphene PN junction logic system based on empirical CPI model. In Proc. the IEEE International Conference on IC Design & Technology, Jun. 2012. DOI: 10.1109/ICICDT.2012.6232850.

[9]

Kogge P M. Hardware Evolution Trends of Extreme Scale Computing. Technical Reprt, University of Notre Dame, South Bend, 2011.

[10]

Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach (6th edition). Elsevier, 2017.

[11]

Liu Y H, Sun X H. LPM: A systematic methodology for concurrent data access pattern optimization from a matching perspective. IEEE Trans. Parallel and Distributed Systems, 2019, 30(11): 2478-2493. DOI: 10.1109/TPDS.2019.2912573.

[12]

Lo Y J, Williams S, Straalen B V, Ligocki T J, Cordery M J, Wright N J, Hall M W, Oliker L. Roofline model toolkit: A practical tool for architectural and program analysis. In Proc. the 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov. 2014, pp.129-148. DOI: 10.1007/978-3-319-17248-4_7.

[13]

Saini S, Chang J, Jin H Q. Performance evaluation of the Intel sandy bridge based NASA Pleiades using scientific and engineering applications. In Proc. the 4th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov. 2013, pp.25-51. DOI: 10.1007/978-3-319-10214-6_2.

[14]

Sun X H, Gustafson J L. Toward a better parallel performance metric. Parallel Computing, 1991, 17(10/11): 1093-1109. DOI: 10.1016/S0167-8191(05)80028-6.

[15]

Kumar V, Singh V. Scalability of parallel algorithms for the all-pairs shortest-path problem. Journal of Parallel and Distributed Computing, 1991, 13(2): 124-138. DOI: 10.1016/0743-7315(91)90083-L.

[16]

Kumar V, Grama A, Gupta A, Karypis G. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings, 1994.

[17]

Sun X H, Chen Y, Wu M. Scalability of heterogeneous computing. In Proc. the International Conference on Parallel Processing (ICPP’05), Jun. 2005, pp.557-564. DOI: 10.1109/ICPP.2005.69.

[18]

Sun X H, Rover D T. Scalability of parallel algorithm-machine combinations. IEEE Trans. Parallel and Distributed Systems, 1994, 5(6): 599-613. DOI: 10.1109/71.285606.

[19]

Sun X H, Pantano M, Fahringer T. Integrated range comparison for data-parallel compilation systems. IEEE Trans. Parallel and Distributed Systems, 1999, 10(5): 448-458. DOI: 10.1109/71.770134.

[20]

Sun X H. Scalability versus execution time in scalable systems. Journal of Parallel and Distributed Computing, 2002, 62(2): 173-192. DOI: 10.1006/jpdc.2001.1773.

[21]

Hill M D, Marty M R. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33-38. DOI: 10.1109/MC.2008.209.

[22]

Sun X H, Chen Y, Byna S. Scalable computing in the multicore era. In Proc. the 2008 International Symposium on Parallel Architectures, Algorithms and Programming, Sept. 2008.

[23]

Dwork C, Goldberg A, Naor M. On memory-bound functions for fighting spam. In Proc. the 23rd Annual International Cryptology Conference, Aug. 2003, pp.426-444. DOI: 10.1007/978-3-540-45146-4_25.

[24]

Abadi M, Burrows M, Manasse M, Wobber T. Moderately hard, memory-bound functions. ACM Trans. Internet Technology, 2005, 5(2): 299-327. DOI: 10.1145/1064340.1064341.

[25]

Hart P E, Nilsson N J, Raphael B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Systems Science and Cybernetics, 1968, 4(2): 100-107. DOI: 10.1109/TSSC.1968.300136.

[26]

Korf R E. Depth-first iterative-deepening: An optimal admissible tree search. Artificial Intelligence, 1985, 27(1): 97-109. DOI: 10.1016/0004-3702(85)90084-0.

[27]

Korf R E, Reid M, Edelkamp S. Time complexity of iterative-deepening-A*. Artificial Intelligence, 2001, 129(1/2): 199-218. DOI: 10.1016/S0004-3702(01)00094-7.

[28]

Russell S. Efficient memory-bounded search methods. In Proc. the 10th European Conference on Artificial intelligence, Aug. 1992.

[29]

Lovinger J, Zhang X Q. Enhanced simplified memory-bounded a star (SMA*+). In Proc. the 3rd Global Conference on Artificial Intelligence, Oct. 2017, pp.202-212. DOI: 10.29007/v7zc.

[30]

Seuken S, Zilberstein S. Memory-bounded dynamic programming for DEC-POMDPs. In Proc. the 20th International Joint Conference on Artifical Intelligence, Jan. 2007, pp.2009-2015.

[31]

Seuken S, Zilberstein S. Improved memory-bounded dynamic programming for decentralized pomdps. arXiv: 1206.5295, 2012. https://arxiv.org/abs/1206.5295, Dec. 2022.

[32]

Chen Z Y, Zhang W X, Deng Y C, Chen D D, Li Q. RMB-DPOP: Refining MB-DPOP by reducing redundant inferences. arXiv: 2002.10641, 2020. https://doi.org/10.48550/arXiv.2002.10641, Dec. 2022.

[33]

Brito I, Meseguer P. Improving DPOP with function filtering. In Proc. the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1, May 2010, pp.141-148.

[34]

Petcu A, Faltings B. ODPOP: An algorithm for open/distributed constraint optimization. In Proc. the 21st National Conference on Artificial Intelligence, Jul. 2006, pp.703-708.

[35]

Petcu A, Faltings B. A hybrid of inference and local search for distributed combinatorial optimization. In Proc. the IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT’07), Nov. 2007, pp.342-348. DOI: 10.1109/IAT.2007.12.

[36]

Petcu A, Faltings B. MB-DPOP: A new memory-bounded algorithm for distributed optimization. In Proc. the 20th International Joint Conference on Artifical Intelligence, Jan. 2007, pp.1452-1457.

[37]

Williams S W. Auto-tuning performance on multicore computers [Ph.D. Thesis]. University of California, Berkeley, 2008.

[38]

Williams S, Waterman A, Patterson D. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 2009, 52(4): 65-76. DOI: 10.1145/1498765.1498785.

[39]

Lu X Y, Wang R J, Sun X H. APAC: An accurate and adaptive prefetch framework with concurrent memory access analysis. In Proc. the 38th IEEE International Conference on Computer Design (ICCD), Oct. 2020, pp.222-229. DOI: 10.1109/ICCD50377.2020.00048.

[40]

Lu X Y, Wang R J, Sun X H. Premier: A concurrency-aware pseudo-partitioning framework for shared last-level cache. In Proc. the 39th IEEE International Conference on Computer Design (ICCD), Oct. 2021, pp.391-394. DOI: 10.1109/ICCD53106.2021.00068.

[41]

Liu J, Espina P, Sun X H. A study on modeling and optimization of memory systems. Journal of Computer Science and Technology, 2021, 36(1): 71-89. DOI: 10.1007/s11390-021-0771-8.

[42]

Glew A. MLP yes! ILP no. In Proc. ASPLOS Wild and Crazy Idea Session, Oct. 1998.

[43]

Qureshi M K, Lynch D N, Mutlu O, Patt Y N. A case for MLP-aware cache replacement. In Proc. the 33rd International Symposium on Computer Architecture (ISCA’06), Jun. 2006, pp.167-178. DOI: 10.1109/ISCA.2006.5.

[44]

Sun X H, Wang D W. Concurrent average memory access time. Computer, 2014, 47(5): 74-80. DOI: 10.1109/MC.2013.227.

[45]

Najafi H, Lu X, Liu J, Sun X H. A generalized model for modern hierarchical memory system. In Proc. Winter Simulation Conference (WSC), Dec. 2022.

[46]

Lu X, Wang R, Sun X H. CARE: A concurrency-aware enhanced lightweight cache management framework. In Proc. the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 25–Mar. 1, 2023.

[47]

Yan L, Zhang M Z, Wang R J, Chen X M, Zou X Q, Lu X Y, Han Y H, Sun X H. CoPIM: A concurrency-aware PIM workload offloading architecture for graph applications. In Proc. IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Jul. 2021. DOI: 10.1109/ISLPED52811.2021.9502483.

[48]

Zhang N, Jiang C T, Sun X H, Song S L. Evaluating GPGPU memory performance through the C-AMAT model. In Proc. the Workshop on Memory Centric Programming for HPC, Nov. 2017, pp.35-39. DOI: 10.1145/3145617.3158214.

[49]

Kannan S, Gavrilovska A, Schwan K, Milojicic D, Talwar V. Using active NVRAM for I/O staging. In Proc. the 2nd International Workshop on Petascal Data Analytics: Challenges and Opportunities, Nov. 2011, pp.15-22. DOI: 10.1145/2110205.2110209.

[50]

Caulfield A M, Grupp L M, Swanson S. Gordon: Using flash memory to build fast, power-efficient clusters for data-intensive applications. ACM SIGPLAN Notices, 2009, 44(3): 217-228. DOI: 10.1145/1508284.1508270.

[51]

Reed D A, Dongarra J. Exascale computing and big data. Communications of the ACM, 2015, 58(7): 56-68. DOI: 10.1145/2699414.

[52]

Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th International Conference on High Performance Computing for Computational Science, Jun. 2010. DOI: 10.1007/978-3-642-19328-6_1.

[53]

Kougkas A, Devarajan H, Sun X H. Hermes: A heterogeneous-aware multi-tiered distributed I/O buffering system. In Proc. the 27th International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2018, pp.219-230. DOI: 10.1145/3208040.3208059.

[54]

Kougkas A, Devarajan H, Sun X H. I/O acceleration via multi-tiered data buffering and prefetching. Journal of Computer Science and Technology, 2020, 35(1): 92-120. DOI: 10.1007/s11390-020-9781-1.

[55]

Tissenbaum M, Sheldon J, Abelson H. From computational thinking to computational action. Communications of the ACM, 2019, 62(3): 34-36. DOI: 10.1145/3265747.

[56]

Liu Y H, Sun X H, Wang Y, Bao Y G. HCDA: From computational thinking to a generalized thinking paradigm. Communications of the ACM, 2021, 64(5): 66-75. DOI: 10.1145/3418291.

[57]

Owens J D, Houston M, Luebke D, Green S, Stone J E, Phillips J C. GPU computing. Proceedings of the IEEE, 2008, 96(5): 879-899. DOI: 10.1109/JPROC.2008.917757.

[58]

Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113. DOI: 10.1145/1327452.1327492.

[59]

Momose H, Kaneko T, Asai T. Systems and circuits for AI chips and their trends. Japanese Journal of Applied Physics, 2020, 59(5): 050502. DOI: 10.35848/1347-4065/ab839f.

[60]

Singh G, Alser M, Cali D S, Diamantopoulos D, Gómez-Luna J, Corporaal H, Mutlu O. FPGA-based near-memory acceleration of modern data-intensive applications. IEEE Micro, 2021, 41(4): 39-48. DOI: 10.1109/MM.2021.3088396.

[61]

Choi Y K, Santillana C, Shen Y J, Darwiche A, Cong J. FPGA acceleration of probabilistic sentential decision diagrams with high-level synthesis. ACM Trans. Reconfigurable Technology and Systems, 2022. DOI: 10.1145/3561514.

[62]

Ghose S, Boroumand A, Kim J S, Gómez-Luna J, Mutlu O. Processing-in-memory: A workload-driven perspective. IBM Journal of Research and Development, 2019, 63(6): Article No. 3. DOI: 10.1147/JRD.2019.2934048.

[63]

Ghiasi N M, Park J, Mustafa H, Kim J, Olgun A, Gollwitzer A, Cali D S, Firtina C, Mao H Y, Alserr N A, Ausavarungnirun R, Vijaykumar N, Alser M, Mutlu O. GenStore: A high-performance in-storage processing system for genome sequence analysis. In Proc. the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Feb. 2022, pp.635-654. DOI: 10.1145/3503222.3507702.

[64]

Mutlu O. Intelligent architectures for intelligent computing systems. In Proc. the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Feb. 2021, pp.318-323. DOI: 10.23919/DATE51398.2021.9474073.

[65]

Sun X H, Liu Y H. Utilizing concurrency: A new theory for memory wall. In Proc. the 29th International Workshop on Languages and Compilers for Parallel Computing, Sept. 2016, pp.18-23. DOI: 10.1007/978-3-319-52709-3_2.

[66]

Kougkas A, Devarajan H, Lofstead J, Sun X H. LABIOS: A distributed label-based I/O system. In Proc. the 28th International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2019, pp.13-24. DOI: 10.1145/3307681.3325405.

[67]

Logan L, Garcia J C, Lofstead J, Sun X H, Kougkas A. LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace. In Proc. the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov. 2022, pp.309-323.

[68]

Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998.

[69]

Hwang K. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.

Journal of Computer Science and Technology

Volume 38 Issue 1,
February 2023

Pages 64-79

DOI: 10.1007/s11390-022-2911-1

Cite this article:

Sun X-H, Lu X. The Memory-Bounded Speedup Model and Its Impacts in Computing. Journal of Computer Science and Technology, 2023, 38(1): 64-79. https://doi.org/10.1007/s11390-022-2911-1