AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.3 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Exploitation of Locality for Energy Efficiency for Breadth First Search in Fine-Grain Execution Models

Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19711, USA
Show Author Information

Abstract

In the upcoming exa-scale era, the exploitation of data locality in parallel programs is very important because it benefits both program performance and energy efficiency. However, this is a hard topic for graph algorithms such as the Breadth First Search (BFS) due to the irregular data access patterns. This study analyzes the exploitation of data locality in the BFS and its impact on the energy efficiency with the Codelet fine-grain dataflow-inspired execution model. The Codelet Model more efficiently exploits data locality than the OpenMP-like execution models which traditionally focus on coarse-grain parallelism inside loops. A BFS algorithm is then given to exploit the locality between two loop iterations that belong to two different loops (inter-loop locality). This kind of locality can be exploited by the Codelet Model but not by traditional coarse-grain execution models like OpenMP. Tests were performed on fsim which is a simulation platform developed by Intel for the Ubiquitous High Performance Computing (UHPC) project to design future exa-scale architectures. The results show that this BFS algorithm saves up to 7% of the dynamic energy for memory accesses compared to a BFS implementation based on OpenMP loop scheduling.

References

[1]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, ExaScale computing study: Technology challenges in achieving exaScale systems, DARPA IPTO, Air Force Research Labs, Tech. Rep., Sep. 2008.
[2]
J. Torrellas, Architectures for extreme-scale computing, Computer, vol. 42, no. 11, pp. 28-35, 2009.
[4]
J. Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, Tiny threads: A thread virtual machine for the cyclops64 cellular architecture, presented at the Fifth Workshop on Massively Parallel Processing, in Conjuction with 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), Denver, CO, USA, 2005.
[5]
R. Knauerhase, R. Cledat, and J. Teller, For extreme parallelism, your os is sooooo last-millennium, in Proc. of the 4th USENIX Conference on Hot Topics in Parallelism, Berkeley, CA, USA: USENIX Association, 2012, pp. 3.
[6]
N. Satish, C. Kim, J. Chhugani, and P. Dubey, Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing, in Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis, Los Alamitos, CA, USA, 2012, pp. 14:1-14:11.
[7]
The graph 500 list, http://www.graph500.org/, 2012.
[8]
D. P. Scarpazza, O. Villa, and F. Petrini, Efficient breadth-first search on the cell/be processor, IEEE Trans. Parallel Distrib. Syst., vol. 19, no. 10, pp. 1381-1395, 2008.
[9]
S. Zuckerman, J. Suetterlein, R. Knauerhase, and G. R. Gao, Using a “codelet” program execution model for exascale machines: Position paper, in Proc. of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, New York, NY, USA, 2011, pp. 64-69.
[10]
DARPA-BAA-10-37, UHPC: Ubiquitous high performance computing, http://www.darpa.mil/Our_Work/MTO/Programs/Ubiquitous_High_Performance_Computing_(UHPC).aspx, 2012.
[11]
A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, Measurement and analysis of online social networks, in Proc. of the 7th ACM SIGCOMM Conference on Internet Measurement, New York, NY, USA, 2007, pp. 29-42.
[12]
A. Sud, E. Andersen, S. Curtis, M. Lin, and D. Manocha, Real-time path planning for virtual agents in dynamic environments, in ACM SIGGRAPH 2008 Classes, New York, NY, USA, 2008, pp. 55:1-55:9.
[13]
B. Awerbuch and R. Gallager, A new distributed algorithm to find breadth first search trees, Information Theory, IEEE Transactions on, vol. 33, no. 3, pp. 315-322, May 1987.
[14]
G. Y. Ananth, V. Kumar, and P. Pardalos, Parallel processing of discrete optimization problems, in In Encyclopedia of Microcomputers, Marcel Dekker Inc, 1993, pp. 129-147.
[15]
A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek, A scalable distributed parallel breadth-first search algorithm on bluegene, in Proc. of the 2005 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2005, pp. 25.
[16]
D. A. Bader and K. Madduri, Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2, in Proc. of the 2006 International Conference on Parallel Processing, Washington, DC, USA, 2006, pp. 523-530.
[17]
T. St. John, J. B. Dennis, and G. R. Gao, Massively parallel breadth first search using a tree-structured memory model, in Proc. of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, New York, NY, USA, 2012, pp. 115-123.
[18]
V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader, Scalable graph exploration on multicore processors, in Proc. of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Washington, DC, USA, 2010, pp. 1-11.
[19]
J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey, Fast and efficient graph traversal algorithm for cpus: Maximizing single-node efficiency, in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, 2012, pp. 378-389. .
[20]
A. Buluç and K. Madduri, Parallel breadth-first search on distributed memory systems, in Proc. of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 2011, pp. 65:1-65:12.
[21]
K. Ueno and T. Suzumura, Highly scalable graph search for the graph500 benchmark, in Proc. of the 21st International Symposium on High-Performance Parallel and Distributed Computing, New York, NY, USA, 2012, pp. 149-160.
Tsinghua Science and Technology
Pages 636-646
Cite this article:
Chen C, Koliai S, Gao G. Exploitation of Locality for Energy Efficiency for Breadth First Search in Fine-Grain Execution Models. Tsinghua Science and Technology, 2013, 18(6): 636-646. https://doi.org/10.1109/TST.2013.6678909

485

Views

11

Downloads

2

Crossref

N/A

Web of Science

4

Scopus

0

CSCD

Altmetrics

Received: 18 January 2013
Revised: 02 April 2013
Accepted: 20 April 2013
Published: 06 December 2013
© The author(s) 2013
Return