Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Data partitioning techniques are pivotal for optimal data placement across storage devices, thereby enhancing resource utilization and overall system throughput. However, the design of effective partition schemes faces multiple challenges, including considerations of the cluster environment, storage device characteristics, optimization objectives, and the balance between partition quality and computational efficiency. Furthermore, dynamic environments necessitate robust partition detection mechanisms. This paper presents a comprehensive survey structured around partition deployment environments, outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are addressed. We discuss partitioning features pertaining to database schema, table data, workload, and runtime metrics. We then delve into the partition generation process, segmenting it into initialization and optimization stages. A comparative analysis of partition generation and update algorithms is provided, emphasizing their suitability for different scenarios and optimization objectives. Additionally, we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and solutions. This survey aims to foster the implementation, deployment, and updating of high-quality partitions for specific system scenarios.
Melnik S, Gubarev A, Long J J et al. Dremel: A decade of interactive SQL analysis at web scale. Proceedings of the VLDB Endowment, 2020, 13(12): 3461–3472. DOI: 10.14778/3415478. 3415568.
Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517. DOI: 10.1145/361002.361007.
Zhang H, Chen G, Ooi B C, Tan K L, Zhang M H. In-memory big data management and processing: A survey. IEEE Trans. Knowledge and Data Engineering, 2015, 27(7): 1920–1948. DOI: 10.1109/TKDE.2015.2427795.
Mahmud M S, Huang J Z, Salloum S et al. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 2020, 3(2): 85–101. DOI: 10.26599/BDMA.2019.9020015.
Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T. AQWA: Adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8(13): 2062–2073. DOI: 10.14778/2831360.2831361.
Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: Adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10(5): 589–600. DOI: 10.14778/3055540.3055551.
Hauglid J O, Ryeng N H, Nørvåg K. DYFRAM: Dynamic fragmentation and replica management in distributed database systems. Distributed and Parallel Databases, 2010, 28(2): 157–185. DOI: 10.1007/s10619-010-7068-1.
Curino C, Jones E, Zhang Y, Madden S. Schism: A workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 2010, 3(1/2): 48–57. DOI: 10.14778/1920841.1920853.
Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 2014, 8(3): 245–256. DOI: 10.14778/2735508.2735514.
Serafini M, Taft R, Elmore A J et al. Clay: Fine-grained adaptive partitioning for general database schemas. Proceedings of the VLDB Endowment, 2016, 10(4): 445–456. DOI: 10.14778/3025111. 3025125.
Parchas P, Naamad Y, Van Bouwel P, Faloutsos C, Petropoulos M. Fast and effective distribution-key recommendation for amazon redshift. Proceedings of the VLDB Endowment, 2020, 13(12): 2411–2423. DOI: 10.14778/3407 790.3407834.
Ward J H Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, 58(301): 236–244. DOI: 10.1080/01621459.1963. 10500845.
Sacca D, Wiederhold G. Database partitioning in a cluster of processors. ACM Trans. Database Systems, 1985, 10(1): 29–56. DOI: 10.1145/3148.3161.
Costa E, Costa C, Santos M Y. Evaluating partitioning and bucketing strategies for hive-based big data warehousing systems. Journal of Big Data, 2019, 6(1): 34. DOI: 10.1186/s40537-019-0196-1.
Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A, Zdonik S, Jones E P C, Madden S, Stonebraker M, Zhang Y, Hugg J, Abadi D J. H-store: A high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 2008, 1(2): 1496–1499. DOI: 10.14778/1454159.1454211.
Navathe S, Ceri S, Wiederhold G, Dou J L. Vertical partitioning algorithms for database design. ACM Trans. Database Systems, 1984, 9(4): 680–710. DOI: 10.1145/1994.2209.
Navathe S B, Ra M. Vertical partitioning for database design: A graphical algorithm. ACM SIGMOD Record, 1989, 18(2): 440–450. DOI: 10.1145/66926.66966.
Chu W W, Ieong I T. A transaction-based approach to vertical partitioning for relational database systems. IEEE Trans. Software Engineering, 1993, 19(8): 804–812. DOI: 10.1109/32.238583.
Gorla N, Betty P W Y. Vertical fragmentation in databases using data-mining technique. International Journal of Data Warehousing and Mining (IJDWM), 2008, 4(3): 35–53. DOI: 10.4018/jdwm.2008070103.
Sun L W, Franklin M J, Wang J N, Wu E. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10(4): 421–432. DOI: 10.14778/3025111.3025123.
Huang Y F, Lai C J. Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning. Information Sciences, 2016, 328: 288–301. DOI: 10.1016/j.ins.2015.08.047.
Rodríguez-Mazahua L, Alor-Hernández G, Li X O, Cervantes J, López-Chau A. Active rule base development for dynamic vertical partitioning of multimedia databases. Journal of Intelligent Information Systems, 2017, 48(2): 421–451. DOI: 10.1007/s10844-016-0420-9.
Liu P J, Li H Y, Wang T Y et al. Multi-stage method for online vertical data partitioning based on spectral clustering. Journal of Software, 2023, 34(6): 2804–2832. DOI: 10.13328/j.cnki.jos. 006496.
Grund M, Krüger J, Plattner H, Zeier A, Cudre-Mauroux P, Madden S. HYRISE: A main memory hybrid storage engine. Proceedings of the VLDB Endowment, 2010, 4(2): 105–116. DOI: 10.14778/1921071.1921077.
Athanassoulis M, Bøgh K S, Idreos S. Optimal column layout for hybrid workloads. Proceedings of the VLDB Endowment, 2019, 12(13): 2393–2407. DOI: 10.14778/3358701.3358707.
McCormick W T, Schweitzer P J, White T W. Problem decomposition and data reorganization by a clustering technique. Operations Research, 1972, 20(5): 993–1009. DOI: 10.1287/opre.20.5.993.
Jindal A, Palatinus E, Pavlov V, Dittrich J. A comparison of knives for bread slicing. Proceedings of the VLDB Endowment, 2013, 6(6): 361–372. DOI: 10.14778/2536336.2536338.
Al-Kateb M, Sinclair P, Au G, Ballinger C. Hybrid row-column partitioning in Teradata®. Proceedings of the VLDB Endowment, 2016, 9(13): 1353–1364. DOI: 10.14778/ 3007263.3007273.
Pinnecke M, Durand G C, Broneske D, Zoun R, Saake G. GridTables: A One-Size-Fits-Most H2TAP data store. Datenbank-Spektrum, 2020, 20(1): 43–56. DOI: 10.1007/s13222-019-00330-x.
Wang J Y, Chai C L, Liu J B, Li G L. FACE: A normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15(1): 72–84. DOI: 10.14778/3485450.3485458.