AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China
Beijing Smartchip Microelectronics Technology Company Limited, Beijing 100000, China
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100190, China
Show Author Information

Abstract

The dataflow architecture, which is characterized by a lack of a redundant unified control logic, has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency, especially of applications used in high-performance computing (HPC). Importantly, the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner. Therefore, a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts. Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection, or a handshake mechanism in which the data is only sent after acknowledgments are received. However, these mechanisms are characterized by both inefficient data transfer and increased area cost. Good performance of the dataflow architecture depends on the efficiency of data transfer. In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost, we propose a Look-Ahead Acknowledgment (LAA) mechanism. LAA accelerates the execution flow by speculatively acknowledging ahead without penalties. Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%, with a reduction in the average execution time by 17.4% and an increase in the average power efficiency of dataflow processors by 22.4%. Crucially, our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%. In conclusion, the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.

Electronic Supplementary Material

Download File(s)
0555_ESM.pdf (162.7 KB)

References

[1]
Dennis J B. Retrospective: A preliminary architecture for a basic data-flow processor. In Proc. the 25 Years of the International Symposia on Computer Architecture, August 1998, pp. 2-4. DOI: 10.1145/285930.285932.
[2]

Arvind, Nikhil R S. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers, 1990, 39(3): 300-318. DOI: 10.1109/12.48862.

[3]
Sankaralingam K, Nagarajan R, Liu H, Kim C, Huh J, Burger D, Keckler S W, Moore C R. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proc. the 30th Annual International Symposium on Computer Architecture, June 2003, pp. 422-433. DOI: 10.1109/ISCA.2003.1207019.
[4]
Swanson S, Michelson K, Schwerin A, Oskin M. WaveScalar. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture, December 2003, pp. 291-302. DOI: 10.1109/MICRO.2003.1253203.
[5]
Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, April 2013, pp. 177-180. DOI: 10.1109/FCCM.2013.34.
[6]

Fu H, Gan L, Clapp R G, Ruan H, Pell O, Mencer O, Flynn M, Huang X, Yang G. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40. DOI: 10.1109/MM.2013.111.

[7]
Coons K E, Chen X, Burger D, McKinley K S, Kushwaha S K. A spatial path scheduling algorithm for EDGE architectures. In Proc. the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006, pp. 129-140. DOI: 10.1145/1168857.1168875.
[8]
Liu D, Yin S, Liu L, Wei S. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proc. the 50th ACM/EDAC/IEEE Design Automation Conference, May 29-June 7, 2013, Article No. 19. DOI: 10.1145/2463209.2488757.
[9]

Nowatzki T, Sartin-Tarm M, De Carli L, Sankaralingam K, Estan C, Robatmili B. A general constraint-centric scheduling framework for spatial architectures. ACM SIGPLAN Notices, 2013, 48(6): 495-506. DOI: 10.1145/2499370.2462163.

[10]
Nowatzki T, Gangadhar V, Sankaralingam K. Exploring the potential of heterogeneous von Neumann/dataflow execution models. In Proc. the 42nd Annual International Symposium on Computer Architecture, June 2015, pp. 298-310. DOI: 10.1145/2749469.2750380.
[11]
Sankaralingam K, Nagarajan R, McDonald R et al. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, December 2006, pp. 480-491. DOI: 10.1109/MICRO.2006.19.
[12]
Putnam A, Swanson S, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers S. The microarchitecture of a pipelined WaveScalar processor: An RTL-based study. Technical Report, University of Washington, 2004. http://cseweb.ucsd.edu/swanson/papers/TR-2004-11-02.pdf, Sept. 2020.
[13]
Shimada T, Hiraki K, Nishida K, Sekiguchi S. Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations. In Proc. the 13th Annual International Symposium on Computer Architecture, June 1986, pp. 226-234.
[14]
Papadopoulos G M, Culler D E. Monsoon: An explicit token-store architecture. In Proc. the 25 Years of the International Symposia on Computer Architecture, August 1998, pp. 398-407. DOI: 10.1145/285930.285999.
[15]

Govindaraju V, Ho C H, Nowatzki T, Chhugani J, Satish N, Sankaralingam K, Kim C. DySER: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 2012, 32(5): 38-51. DOI: 10.1109/MM.2012.51.

[16]

Shen X, Ye X, Tan X, Wang D, Zhang L, Li W, Zhang Z, Fan D. An efficient network-on-chip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1): 11-25. DOI: 10.1007/s11390-017-1703-5.

[17]

Mercaldi M, Swanson S, Petersen A, Putnam A, Schwerin A, Oskin M, Eggers S J. Instruction scheduling for a tiled dataflow architecture. ACM SIGPLAN Notices, 2006, 41(11): 141-150. DOI: 10.1145/1168918.1168876.

[18]
Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st ACM/IEEE Annual International Symposium on Computer Architecture, June 2014, pp. 205-216. DOI: 10.1109/ISCA.2014.6853234.
[19]

Lee J K F, Smith A J. Branch prediction strategies and branch target buffer design. Computer, 1984, 17(1): 6-22. DOI: 10.1109/MC.1984.1658927.

[20]
Ye X, Fan D, Sun N, Tang S, Zhang M, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the 2013 International Symposium on Low Power Electronics and Design, September 2013, pp. 273-278. DOI: 10.1109/ISLPED.2013.6629308.
[21]
Han R, Lu X Y, Xu J T. On Big Data Benchmarking. In Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Zhan J, Han R, Weng C (eds. ), Springer, 2014, pp. 3-18. DOI: 10.1007/978-3-319-13021-7_1.
[22]

Burger D, Austin T M. The SimpleScalar tool set, version 2.0. SIGARCH Comput. Archit. News, 1997, 25(3): 13-25. DOI: 10.1145/268806.268810.

[23]

Kurzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11): 2045-2057. DOI: 10.1109/TPDS.2011.311.

[24]
Del Mundo C, Feng W. Towards a performance-portable FFT library for heterogeneous computing. In Proc. the 11th ACM Conference on Computing Frontiers, May 2014, Article No. 11. DOI: 10.1145/2597917.2597943.
[25]
Holewinski J, Pouchet L N, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM International Conference on Supercomputing, June 2012, pp. 311-320. DOI: 10.1145/2304576.2304619.
[26]
Stratton J A, Rodrigues C, Sung I, Obeid N, Chang L, Anssari N, Liu G D, Hwu W W. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report, University of Illinois at Urbana-Champaign, 2012. http://impact.crhc.illinois.edu/Shared/Docs/impact-12-01.parboil.pdf, Sept. 2020.
[27]
Siehl K, Zhao X. Supporting energy-efficient computing on heterogeneous CPU-GPU architectures. In Proc. the 5th IEEE International Conference on Future Internet of Things and Cloud, August 2017, pp. 134-141. DOI: 10.1109/FiCloud.2017.46.
[28]
Burtscher M, Zecena I, Zong Z. Measuring GPU power with the K20 built-in sensor. In Proc. the 7th Workshop on General Purpose Processing Using GPUs, March 2014, pp. 28-36. DOI: 10.1145/2588768.2576783.
Journal of Computer Science and Technology
Pages 942-959
Cite this article:
Feng Y-J, Li D-J, Tan X, et al. Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism. Journal of Computer Science and Technology, 2022, 37(4): 942-959. https://doi.org/10.1007/s11390-020-0555-6

476

Views

2

Crossref

1

Web of Science

2

Scopus

0

CSCD

Altmetrics

Received: 15 April 2020
Revised: 29 October 2020
Accepted: 17 December 2020
Published: 25 July 2022
©Institute of Computing Technology, Chinese Academy of Sciences 2022
Return