Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

Yu-Jing Feng; De-Jian Li; Xu Tan; Xiao-Chun Ye; Dong-Rui Fan; Wen-Ming Li; Da Wang; Hao Zhang; Zhi-Min Tang

doi:10.1007/s11390-020-0555-6

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

Yu-Jing Feng^¹, De-Jian Li^², Xu Tan^¹, Xiao-Chun Ye^¹(), Dong-Rui Fan^{¹^,³}, Wen-Ming Li^¹, Da Wang^¹, Hao Zhang^¹, Zhi-Min Tang^¹

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences Beijing 100190, China

Beijing Smartchip Microelectronics Technology Company Limited, Beijing 100000, China

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

Show Author Information

Abstract

The dataflow architecture, which is characterized by a lack of a redundant unified control logic, has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency, especially of applications used in high-performance computing (HPC). Importantly, the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner. Therefore, a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts. Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection, or a handshake mechanism in which the data is only sent after acknowledgments are received. However, these mechanisms are characterized by both inefficient data transfer and increased area cost. Good performance of the dataflow architecture depends on the efficiency of data transfer. In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost, we propose a Look-Ahead Acknowledgment (LAA) mechanism. LAA accelerates the execution flow by speculatively acknowledging ahead without penalties. Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%, with a reduction in the average execution time by 17.4% and an increase in the average power efficiency of dataflow processors by 22.4%. Crucially, our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%. In conclusion, the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.

Keywords

dataflow model control-flow model high-performance computing application data transfer power efficiency

Electronic Supplementary Material

Download File(s)

0555_ESM.pdf (162.7 KB)

References

[1]

Dennis J B. Retrospective: A preliminary architecture for a basic data-flow processor. In Proc. the 25 Years of the International Symposia on Computer Architecture, August 1998, pp. 2-4. DOI: 10.1145/285930.285932.

[2]

Arvind, Nikhil R S. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers, 1990, 39(3): 300-318. DOI: 10.1109/12.48862.

Crossref Google Scholar

[3]

Sankaralingam K, Nagarajan R, Liu H, Kim C, Huh J, Burger D, Keckler S W, Moore C R. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proc. the 30th Annual International Symposium on Computer Architecture, June 2003, pp. 422-433. DOI: 10.1109/ISCA.2003.1207019.

[4]

Swanson S, Michelson K, Schwerin A, Oskin M. WaveScalar. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture, December 2003, pp. 291-302. DOI: 10.1109/MICRO.2003.1253203.

[5]

Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, April 2013, pp. 177-180. DOI: 10.1109/FCCM.2013.34.

[6]

Fu H, Gan L, Clapp R G, Ruan H, Pell O, Mencer O, Flynn M, Huang X, Yang G. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40. DOI: 10.1109/MM.2013.111.

Crossref Google Scholar

[7]

Coons K E, Chen X, Burger D, McKinley K S, Kushwaha S K. A spatial path scheduling algorithm for EDGE architectures. In Proc. the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006, pp. 129-140. DOI: 10.1145/1168857.1168875.

[8]

Liu D, Yin S, Liu L, Wei S. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proc. the 50th ACM/EDAC/IEEE Design Automation Conference, May 29-June 7, 2013, Article No. 19. DOI: 10.1145/2463209.2488757.

[9]

Nowatzki T, Sartin-Tarm M, De Carli L, Sankaralingam K, Estan C, Robatmili B. A general constraint-centric scheduling framework for spatial architectures. ACM SIGPLAN Notices, 2013, 48(6): 495-506. DOI: 10.1145/2499370.2462163.

Crossref Google Scholar

[10]

Nowatzki T, Gangadhar V, Sankaralingam K. Exploring the potential of heterogeneous von Neumann/dataflow execution models. In Proc. the 42nd Annual International Symposium on Computer Architecture, June 2015, pp. 298-310. DOI: 10.1145/2749469.2750380.

[11]

Sankaralingam K, Nagarajan R, McDonald R et al. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, December 2006, pp. 480-491. DOI: 10.1109/MICRO.2006.19.

[12]

Putnam A, Swanson S, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers S. The microarchitecture of a pipelined WaveScalar processor: An RTL-based study. Technical Report, University of Washington, 2004. http://cseweb.ucsd.edu/swanson/papers/TR-2004-11-02.pdf, Sept. 2020.

[13]

Shimada T, Hiraki K, Nishida K, Sekiguchi S. Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations. In Proc. the 13th Annual International Symposium on Computer Architecture, June 1986, pp. 226-234.

[14]

Papadopoulos G M, Culler D E. Monsoon: An explicit token-store architecture. In Proc. the 25 Years of the International Symposia on Computer Architecture, August 1998, pp. 398-407. DOI: 10.1145/285930.285999.

[15]

Govindaraju V, Ho C H, Nowatzki T, Chhugani J, Satish N, Sankaralingam K, Kim C. DySER: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro, 2012, 32(5): 38-51. DOI: 10.1109/MM.2012.51.

Crossref Google Scholar

[16]

Shen X, Ye X, Tan X, Wang D, Zhang L, Li W, Zhang Z, Fan D. An efficient network-on-chip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1): 11-25. DOI: 10.1007/s11390-017-1703-5.

Crossref Google Scholar

[17]

Mercaldi M, Swanson S, Petersen A, Putnam A, Schwerin A, Oskin M, Eggers S J. Instruction scheduling for a tiled dataflow architecture. ACM SIGPLAN Notices, 2006, 41(11): 141-150. DOI: 10.1145/1168918.1168876.

Crossref Google Scholar

[18]

Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st ACM/IEEE Annual International Symposium on Computer Architecture, June 2014, pp. 205-216. DOI: 10.1109/ISCA.2014.6853234.

[19]

Lee J K F, Smith A J. Branch prediction strategies and branch target buffer design. Computer, 1984, 17(1): 6-22. DOI: 10.1109/MC.1984.1658927.

Crossref Google Scholar

[20]

Ye X, Fan D, Sun N, Tang S, Zhang M, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the 2013 International Symposium on Low Power Electronics and Design, September 2013, pp. 273-278. DOI: 10.1109/ISLPED.2013.6629308.

[21]

Han R, Lu X Y, Xu J T. On Big Data Benchmarking. In Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Zhan J, Han R, Weng C (eds. ), Springer, 2014, pp. 3-18. DOI: 10.1007/978-3-319-13021-7_1.

[22]

Burger D, Austin T M. The SimpleScalar tool set, version 2.0. SIGARCH Comput. Archit. News, 1997, 25(3): 13-25. DOI: 10.1145/268806.268810.

Crossref Google Scholar

[23]

Kurzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11): 2045-2057. DOI: 10.1109/TPDS.2011.311.

Crossref Google Scholar

[24]

Del Mundo C, Feng W. Towards a performance-portable FFT library for heterogeneous computing. In Proc. the 11th ACM Conference on Computing Frontiers, May 2014, Article No. 11. DOI: 10.1145/2597917.2597943.

[25]

Holewinski J, Pouchet L N, Sadayappan P. High-performance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM International Conference on Supercomputing, June 2012, pp. 311-320. DOI: 10.1145/2304576.2304619.

[26]

Stratton J A, Rodrigues C, Sung I, Obeid N, Chang L, Anssari N, Liu G D, Hwu W W. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report, University of Illinois at Urbana-Champaign, 2012. http://impact.crhc.illinois.edu/Shared/Docs/impact-12-01.parboil.pdf, Sept. 2020.

[27]

Siehl K, Zhao X. Supporting energy-efficient computing on heterogeneous CPU-GPU architectures. In Proc. the 5th IEEE International Conference on Future Internet of Things and Cloud, August 2017, pp. 134-141. DOI: 10.1109/FiCloud.2017.46.

[28]

Burtscher M, Zecena I, Zong Z. Measuring GPU power with the K20 built-in sensor. In Proc. the 7th Workshop on General Purpose Processing Using GPUs, March 2014, pp. 28-36. DOI: 10.1145/2588768.2576783.

Journal of Computer Science and Technology

Volume 37 Issue 4,
July 2022

Pages 942-959

DOI: 10.1007/s11390-020-0555-6

Cite this article:

Feng Y-J, Li D-J, Tan X, et al. Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism. Journal of Computer Science and Technology, 2022, 37(4): 942-959. https://doi.org/10.1007/s11390-020-0555-6