[2]
Qian Y, Li X, Ihara S, Zeng L, Kaiser J, Süß T, Brinkmann A. A configurable rule based classful token bucket filter network request scheduler for the Lustre file system. In Proc. the 2017 International Conference for High Performance Computing, November 2017, Article No. 6.
[3]
Rajachandrasekar R, Moody A, Mohror K, Panda D K. A 1 PB/s file system to checkpoint three million MPI tasks. In Proc. the 22nd Int. Symp. High-Performance Parallel and Distributed Computing, June 2013, pp.143-154.
[4]
Schroeder B, Lagisetty R, Merchant A. Flash reliability in production: The expected and the unexpected. In Proc. the 14th USENIX Conference on File and Storage Technologies, February 2016, pp.67-80.
[5]
Meza J, Wu Q, Kumar S, Mutlu O. A large-scale study of flash memory failures in the field. In Proc. the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, June 2015, pp.177-190.
[6]
Narayanan I, Wang D, Jeon M, Sharma B, Caulfield L, Sivasubramaniam A, Cutler B, Liu J, Khessib B M, Vaid K. SSD failures in datacenters: What? When? and Why? In Proc. the 9th ACM International on Systems and Storage Conference, June 2016, Article No. 7.
[7]
Welch B, Noer G. Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions. In Proc. the 29th IEEE Symposium on Mass Storage Systems and Technologies, May 2013, Article No. 29.
[8]
Liu N, Cope J, Carns P H, Carothers C D, Ross R B, Grider G, Crume A, Maltzahn C. On the role of burst buffers in leadership-class storage systems. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies, April 2012, Article No. 5.
[9]
Qian Y, Li X, Ihara S, Dilger A, Thomaz C, Wang S, Cheng W, Li C, Zeng L, Wang F, Feng D, Süß T, Brinkmann A. LPCC: Hierarchical persistent client caching for Lustre. In Proc. the Int. Conf. High Performance Computing, Networking, Storage and Analysis, November 2019.
[10]
Vef MA, Moti N, Süß T, Tocci T, Nou R, Miranda A, Cortes T, Brinkmann A. GekkoFS — A temporary distributed file system for HPC applications. In Proc. the 2018 IEEE Int. Conf. Cluster Computing, September 2018, pp.319-324.
[11]
Wang T, Mohror K, Moody A, Sato K, Yu W. An ephemeral burst-buffer file system for scientific applications. In Proc. the 2016 International Conference for High Performance Computing, November 2016, pp.807-818.
[17]
Zhang Z, Barbary K, Nothaft F A, Sparks E R, Zahn O, Franklin M J, Patterson D A, Perlmutter S. Scientific computing meets big data technology: An astronomy use case. In Proc. the 2015 IEEE International Conference on Big Data, October 2015, pp.918-927.
[18]
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symp. Mass Storage Systems and Technologies, May 2010, Article No. 9.
[25]
Fox G C, Qiu J, Jha S et al. Big data, simulations and HPC convergence. In Lecture Notes in Computer Science 10044, Rabl T, Nambiar R, Baru C, Bhandarkar M, Poess M, Pyne S (eds.), Springer-Verlag, 2015, pp.3-17.
[26]
Wasi-ur-Rahman M, Lu X, Islam N S, Rajachandrasekar R, Panda D K. High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In Proc. the 2015 IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.291-300.
[28]
Philp I R. Software failures and the road to a petaflop machine. In Proc. the 1st Workshop on High Performance Computing Reliability Issues, February 2005.
[29]
Petrini F. Scaling to thousands of processors with Buffered Coscheduling. In Proc. the 2002 Scaling to New Height Workshop, May 2002.
[30]
Congiu G, Narasimhamurthy S, Süß T, Brinkmann A. Improving collective I/O performance using non-volatile memory devices. In Proc. the 2016 IEEE International Conference on Cluster Computing, September 2016, pp.120-129.
[31]
Moody A, Bronevetsky G, Mohror K, de Supinski B R. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 22.
[32]
Islam T Z, Mohror K, Bagchi S, Moody A, de Supinski B R, Eigenmann R. McrEngine: A scalable checkpointing system using data-aware aggregation and compression. In Proc. the 2012 Conf. High Performance Computing Networking, Storage and Analysis, Nov. 2012, Article No. 17.
[33]
Kaiser J, Gad R, Süß T, Padua F, Nagel L, Brinkmann A. Deduplication potential of HPC applications’ checkpoints. In Proc. the 2016 IEEE International Conference on Cluster Computing, September 2016, pp.413-422.
[34]
Zhu Y, Chowdhury F, Fu H, Moody A, Mohror K, Sato K, Yu W. Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In Proc. the 26th IEEE Int. Symp. Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Sept. 2018, pp.145-156.
[37]
Kurth T, Treichler S, Romero J et al. Exascale deep learning for climate analytics. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2018, Article No. 51.
[39]
Chen D, Eisley N, Heidelberger P, Senger R M, Sugawara Y, Kumar S, Salapura V, Satterfield D L, Steinmacher-Burow B D, Parker J J. The IBM Blue Gene/Q interconnection network and message unit. In Proc. the 2011 Conference on High Performance Computing Networking, Storage and Analysis, November 2011, Article No. 26.
[40]
Faanes G, Bataineh A, Roweth D, Court T, Froese E, Alverson R, Johnson T, Kopnick J, Higgins M, Reinhard J. Cray cascade: A scalable HPC system based on a Dragonfly network. In Proc. the 2012 Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 103.
[42]
Latham R, Ross R B, Thakur R. Can MPI be used for persistent parallel services? In Proc. the 13th European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, September 2006, pp.275-284.
[43]
Grun P, Hefty S, Sur S, Goodell D, Russell R D, Pritchard H, Squyres J M. A brief introduction to the OpenFabrics interfaces — A new network API for maximizing high performance application efficiency. In Proc. the 23rd IEEE Annual Symposium on High-Performance Interconnects, August 2015, pp.34-39.
[44]
Shamis P, Venkata M G, Lopez M G et al. UCX: An open source framework for HPC network APIs and beyond. In Proc. the 23rd IEEE Annual Symposium on High-Performance Interconnects, August 2015, pp.40-43.
[46]
Soumagne J, Kimpe D, Zounmevo J A, Chaarawi M, Koziol Q, Afsahi A, Ross R B. Mercury: Enabling remote procedure call for high-performance computing. In Proc. the 2013 IEEE International Conference on Cluster Computing, September 2013, Article No. 50.
[47]
Oldfield R, Widener P M, Maccabe A B, Ward L, Kordenbrock T. Efficient data-movement for lightweight I/O. In Proc. the 2006 IEEE International Conference on Cluster Computing, September 2006, Article No. 60.
[48]
Wheeler K B, Murphy R C, Thain D. Qthreads: An API for programming with millions of lightweight threads. In Proc. the 22nd IEEE International Symposium on Parallel and Distributed Processing, April 2008.
[49]
Nakashima J, Taura K. MassiveThreads: A thread library for high productivity languages. In Concurrent Objects and Beyond — Papers Dedicated to Akinori Yonezawa on the Occasion of His 65th Birthday, Agha G, Igarashi A, Kobayashi N, Masuhara H, Matsuoka S, Shibayama E, Taura K (eds.), Springer, 2014, pp.222-238.
[51]
Dorier M, Carns P H, Harms K et al. Methodology for the rapid development of scalable HPC data services. In Proc. the 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, November 2018, pp.76-87.
[52]
Carns P H, Jenkins J, Cranor C D, Atchley S, Seo S, Snyder S, Ross R B. Enabling NVM for data-intensive scientific services. In Proc. the 4th Workshop on Interactions of NVM/Flash with Operating Systems and Workloads, November 2016, Article No. 4.
[54]
Lofstead J F, Klasky S, Schwan K, Podhorszki N, Jin C. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In Proc. the 6th International Workshop on Challenges of Large Applications in Distributed Environments, June 2008, pp.15-24.
[55]
Moore M, Bonnie D, Ligon B, Marshall M, Ligon W, Mills N, Quarles E, Sampson S, Yang S, Wilson B. OrangeFS: Advancing PVFS. In Proc. the 9th USENIX Conference on File and Storage Technologies, February 2011.
[56]
Volos H, Nalli S, Panneerselvam S, Varadarajan V, Saxena P, Swift M M. Aerie: Flexible file-system interfaces to storage-class memory. In Proc. the 9th Eurosys Conference, April 2014, Article No. 14.
[57]
Zheng Q, Cranor C D, Guo D, Ganger G R, Amvrosiadis G, Gibson G A, Settlemyer B W, Grider G, Guo F. Scaling embedded in-situ indexing with deltaFS. In Proc. the 2018 Int. Conf. High Performance Computing, Networking, Storage, and Analysis, Nov. 2018, Article No. 3.
[58]
Kelly S M, Brightwell R. Software architecture of the light weight kernel, Catamount. In Proc. the 2005 Cray User Group Annual Technical Conference, May 2005, pp.16-19.
[59]
Rajgarhia A, Gehani A. Performance and extension of user space file systems. In Proc. the 2010 ACM Symposium on Applied Computing, March 2010, pp.206-213.
[60]
Vangoor B K R, Tarasov V, Zadok E. To FUSE or not to FUSE: Performance of user-space file systems. In Proc. the 15th USENIX Conference on File and Storage Technologies, February 2017, pp.59-72.
[61]
Henson V, van de Ven A, Gud A, Brown Z. Chunkfs: Using divide-and-conquer to improve file system reliability and repair. In Proc. the 2nd Workshop on Hot Topics in System Dependability, November 2006, Article No. 8.
[63]
Zhao D, Zhang Z, Zhou X, Li T, Wang K, Kimpe D, Carns P H, Ross R B, Raicu I. FusionFS: Toward supporting data intensive scientific applications on extreme-scale highperformance computing systems. In Proc. the 2014 IEEE Int. Conf. Big Data, October 2014, pp.61-70.
[65]
Lensing P H, Cortes T, Brinkmann A. Direct lookup and hash-based metadata placement for local file systems. In Proc. the 6th Annual International Systems and Storage Conference, June 2013, Article No. 5.
[66]
Lensing P H, Cortes T, Hughes J, Brinkmann A. File system scalability with highly decentralized metadata on independent storage devices. In Proc. the 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2016, pp.366-375.
[67]
Carns P H, Ligon III W B, Ross R B, Thakur R. PVFS: A parallel file system for Linux clusters. In Proc. the 4th Annual Linux Showcase & Conference, October 2000, Article No. 4.
[68]
Dong S, Callaghan M, Galanis L, Borthakur D, Savor T, Strum M. Optimizing space amplification in RocksDB. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 30.
[69]
Oral S, Dillow D A, Fuller D, Hill J, Leverman D, Vazhkudai S S, Wang F, Kim Y, Rogers J, Simmons J, Miller R. OLCF’s 1 TB/s, next-generation Lustre file system. In Proc. the 2013 Cray User Group Conference, April 2013.
[70]
Greenberg H, Bent J, Grider G. MDHIM: A parallel key/value framework for HPC. In Proc. the 7th USENIX Workshop on Hot Topics in Storage and File Systems, July 2015, Article No. 10.
[71]
Karger D R, Lehman E, Leighton F T, Panigrahy R, Levine M S, Lewin D. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proc. the 29th Annual ACM Symposium on the Theory of Computing, May 1997, pp.654-663.
[72]
Bent J, Gibson G A, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M. PLFS: A checkpoint filesystem for parallel applications. In Proc. the 2009 ACM/IEEE Conference on High Performance Computing, November 2009, Article No. 26.
[74]
Yildiz O, Dorier M, Ibrahim S, Ross R B, Antoniu G. On the root causes of cross application I/O interference in HPC storage systems. In Proc. the 2016 IEEE Int. Parallel and Distributed Processing Symposium, May 2016, pp.750-759.
[75]
Lofstead J F, Zheng F, Liu Q, Klasky S, Oldfield R, Kordenbrock T, Schwan K, Wolf M. Managing variability in the IO performance of petascale storage systems. In Proc. the 2010 Conference on High Performance Computing Networking, Storage and Analysis, November 2010, Article No. 35.
[76]
Xie B, Chase J S, Dillow D, Drokin O, Klasky S, Oral S, Podhorszki N. Characterizing output bottlenecks in a supercomputer. In Proc. the 2012 Conference on High Performance Computing Networking, Storage and Analysis, November 2012, Article No. 8.
[79]
Kougkas A, Devarajan H, Sun X, Lofstead J F. Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers. In Proc. the 2018 IEEE Int. Conf. Cluster Computing, September 2018, pp.290-301.
[80]
Dong B, Byna S, Wu K, Prabhat, Johansen H, Johnson J N, Keen N. Data elevator: Low-contention data movement in hierarchical storage system. In Proc. the 23rd IEEE Int. Conf. High Performance Computing, December 2016, pp.152-161.
[81]
Miranda A, Jackson A, Tocci T, Panourgias I, Nou R. NORNS: Extending Slurm to support data-driven workflows through asynchronous data staging. In Proc. the 2019 IEEE International Conference on Cluster Computing, September 2019.
[82]
Subedi P, Davis P E, Duan S, Klasky S, Kolla H, Parashar M. Stacker: An autonomic data movement engine for extreme-scale data staging-based in-situ workflows. In Proc. the 2018 Int. Conf. for High Performance Computing, Networking, Storage, and Analysis, Nov. 2018, Article No. 73.
[83]
Wang T, Oral S, Pritchard M, Wang B, YuW. TRIO: Burst buffer based I/O orchestration. In Proc. the 2015 IEEE Int. Conf. Cluster Computing, Sept. 2015, pp.194-203.
[84]
Thapaliya S, Bangalore P, Lofstead J F, Mohror K, Moody A. Managing I/O interference in a shared burst buffer system. In Proc. the 45th International Conference on Parallel Processing, August 2016, pp.416-425.
[85]
Soysal M, Berghoff M, Klusácek D, Streit A. On the quality of wall time estimates for resource allocation prediction. In Proc. the 48th International Conference on Parallel Processing, August 2019, Article No. 23.
[86]
Folk M, Heber G, Koziol Q, Pourmal E, Robinson D. An overview of the HDF5 technology suite and its applications. In Proc. the 2011 EDBT/ICDT Workshop on Array Databases, March 2011, pp.36-47.
[87]
Li J, Liao W, Choudhary A N, Ross R B, Thakur R, Gropp W, Latham R, Siegel A R, Gallagher B, Zingale M. Parallel netCDF: A high-performance scientific I/O interface. In Proc. the 2003 ACM/IEEE Conf. High Performance Networking and Computing, Nov. 2003, Article No. 39.