AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning Techniques

School of Information, Renmin University of China, Beijing 100872, China
Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education, Beijing 100872, China
Show Author Information

Abstract

Data partitioning techniques are pivotal for optimal data placement across storage devices, thereby enhancing resource utilization and overall system throughput. However, the design of effective partition schemes faces multiple challenges, including considerations of the cluster environment, storage device characteristics, optimization objectives, and the balance between partition quality and computational efficiency. Furthermore, dynamic environments necessitate robust partition detection mechanisms. This paper presents a comprehensive survey structured around partition deployment environments, outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are addressed. We discuss partitioning features pertaining to database schema, table data, workload, and runtime metrics. We then delve into the partition generation process, segmenting it into initialization and optimization stages. A comparative analysis of partition generation and update algorithms is provided, emphasizing their suitability for different scenarios and optimization objectives. Additionally, we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and solutions. This survey aims to foster the implementation, deployment, and updating of high-quality partitions for specific system scenarios.

Electronic Supplementary Material

Video
3538-Video.mp4
Download File(s)
JCST-2306-13538-Highlights.pdf (126.9 KB)

References

[1]

Melnik S, Gubarev A, Long J J et al. Dremel: A decade of interactive SQL analysis at web scale. Proceedings of the VLDB Endowment, 2020, 13(12): 3461–3472. DOI: 10.14778/3415478. 3415568.

[2]
Bayer R, McCreight E. Organization and maintenance of large ordered indices. In Proc. the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control, Nov. 1970, pp.107–141. DOI: 10.1145/1734663.1734671.
[3]

Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509–517. DOI: 10.1145/361002.361007.

[4]
Guttman A. R-trees: A dynamic index structure for spatial searching. In Proc. the 1984 ACM SIGMOD International Conference on Management of Data, Jun. 1984, pp.47–57. DOI: 10.1145/602259.602266.
[5]
Yuan H T, Li G L, Feng L, Sun J, Han Y. Automatic view generation with deep learning and reinforcement learning. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.1501–1512. DOI: 10.1109/ICDE48307.2020.00133.
[6]
Han Y, Li G L, Yuan H T, Sun J. An autonomous materialized view management system with deep reinforcement learning. In Proc. the 37th IEEE International Conference on Data Engineering, Apr. 2021, pp.2159–2164. DOI: 10.1109/ICDE51399.2021.00217.
[7]

Zhang H, Chen G, Ooi B C, Tan K L, Zhang M H. In-memory big data management and processing: A survey. IEEE Trans. Knowledge and Data Engineering, 2015, 27(7): 1920–1948. DOI: 10.1109/TKDE.2015.2427795.

[8]

Mahmud M S, Huang J Z, Salloum S et al. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 2020, 3(2): 85–101. DOI: 10.26599/BDMA.2019.9020015.

[9]
Sun L W, Franklin M J, Krishnan S, Xin R S. Fine-grained partitioning for aggressive data skipping. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp.1115–1126. DOI: 10.1145/2588555.2610515.
[10]

Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouzzani M, Elmeleegy H, Qadah T. AQWA: Adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment, 2015, 8(13): 2062–2073. DOI: 10.14778/2831360.2831361.

[11]
Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: Workload-aware processing of range data and range queries in Hadoop. In Proc. the 9th ACM International Conference on Web Search and Data Mining, Feb. 2016, pp.397–406. DOI: 10.1145/2835776.2835841.
[12]
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J. A robust partitioning scheme for ad-hoc query workloads. In Proc. the 2017 Symposium on Cloud Computing, Sept. 2017, pp.229–241. DOI: 10.1145/3127479.3131613.
[13]

Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: Adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 2017, 10(5): 589–600. DOI: 10.14778/3055540.3055551.

[14]
Yang Z H, Chandramouli B, Wang C et al. Qd-tree: Learning data layouts for big data analytics. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.193–208. DOI: 10.1145/3318464.3389770.
[15]
Ding J L, Minhas U F, Chandramouli B et al. Instance-optimized data layouts for cloud analytics workloads. In Proc. the 2021 International Conference on Management of Data, Jun. 2021, pp.418–431. DOI: 10.1145/3448016.3457270.
[16]
Li Z, Yiu M L, Chan T N. PAW: Data partitioning meets workload variance. In Proc. the 38th IEEE International Conference on Data Engineering, May 2022, pp.123–135. DOI: 10.1109/icde53745.2022.00014.
[17]
Rao J, Zhang C, Megiddo N, Lohman G. Automating physical database design in a parallel database. In Proc. the 2002 ACM SIGMOD International Conference on Management of Data, Jun. 2002, pp.558–569. DOI: 10.1145/564691.564757.
[18]
Agrawal S, Chu E, Narasayya V. Automatic physical design tuning: Workload as a sequence. In Proc. the 2006 ACM SIGMOD International Conference on Management of Data, Jun. 2006, pp.683–694. DOI: 10.1145/1142473.1142549.
[19]
Eadon G, Chong E I, Shankar S, Raghavan A, Srinivasan J, Das S. Supporting table partitioning by reference in oracle. In Proc. the 2008 ACM SIGMOD International Conference on Management of Data, Jun. 2008, pp.1111–1122. DOI: 10.1145/1376616.1376727.
[20]

Hauglid J O, Ryeng N H, Nørvåg K. DYFRAM: Dynamic fragmentation and replica management in distributed database systems. Distributed and Parallel Databases, 2010, 28(2): 157–185. DOI: 10.1007/s10619-010-7068-1.

[21]

Curino C, Jones E, Zhang Y, Madden S. Schism: A workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 2010, 3(1/2): 48–57. DOI: 10.14778/1920841.1920853.

[22]
Nehme R, Bruno N. Automated partitioning design in parallel database systems. In Proc. the 2011 ACM SIGMOD International Conference on Management of Data, Jun. 2011, pp.1137–1148. DOI: 10.1145/1989323.1989444.
[23]
Pavlo A, Curino C, Zdonik S. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.61–72. DOI: 10.1145/2213836.2213844.
[24]
Liroz-Gistau M, Akbarinia R, Pacitti E et al. Dynamic workload-based partitioning algorithms for continuously growing databases. In Transactions on Large-Scale Data- and Knowledge-Centered Systems XII, Hameurlain A, Küng J, Wagner R (eds.), Springer, 2013, pp.105–128. DOI: 10.1007/978-3-642-45315-1_5.
[25]
Quamar A, Kumar K A, Deshpande A. 2013. SWORD: Scalable workload-aware data placement for transactional workloads. In Proc. the 16th International Conference on Extending Database Technology, Mar. 2013, pp.430–441. DOI: 10.1145/2452376.2452427.
[26]

Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 2014, 8(3): 245–256. DOI: 10.14778/2735508.2735514.

[27]
Chen K J, Zhou Y L, Cao Y. Online data partitioning in distributed database systems. In Proc. the 18th International Conference on Extending Database Technology, Mar. 2015, pp.1–12. DOI: 10.5441/002/edbt.2015.02.
[28]
Zamanian E, Binnig C, Salama A. Locality-aware partitioning in parallel database systems. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.17–30. DOI: 10.1145/2723372.2723718.
[29]
Fetai I, Murezzan D, Schuldt H. Workload-driven adaptive data partitioning and distribution—The Cumulus approach. In Proc. the 2015 IEEE International Conference on Big Data, Oct. 29–Nov. 1, 2015, pp.1688–1697. DOI: 10.1109/BigData.2015.7363940.
[30]

Serafini M, Taft R, Elmore A J et al. Clay: Fine-grained adaptive partitioning for general database schemas. Proceedings of the VLDB Endowment, 2016, 10(4): 445–456. DOI: 10.14778/3025111. 3025125.

[31]
Marcus R, Papaemmanouil O, Semenova S, Garber S. NashDB: An end-to-end economic method for elastic database fragmentation, replication, and provisioning. In Proc. the 2018 International Conference on Management of Data, May 2018, pp.1253–1267. DOI: 10.1145/3183713.3196935.
[32]
Nam Y M, Kim M S, Han D. A graph-based database partitioning method for parallel OLAP query processing. In Proc. the 34th IEEE International Conference on Data Engineering, Apr. 2018, pp.1025–1036. DOI: 10.1109/ICDE.2018.00096.
[33]

Parchas P, Naamad Y, Van Bouwel P, Faloutsos C, Petropoulos M. Fast and effective distribution-key recommendation for amazon redshift. Proceedings of the VLDB Endowment, 2020, 13(12): 2411–2423. DOI: 10.14778/3407 790.3407834.

[34]
Hilprecht B, Binnig C, Röhm U. Learning a partitioning advisor for cloud databases. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.143–157. DOI: 10.1145/3318464.3389704.
[35]
Brendle M, Weber N, Valiyev M, May N, Schulze R, Böhm A, Moerkotte G, Grossniklaus M. SAHARA: Memory footprint reduction of cloud databases with automated table partitioning. In Proc. the 25th International Conference on Extending Database Technology, Mar. 29–Apr. 1, 2022. DOI: 10.5441/002/edbt.2022.02.
[36]
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In Proc. the 20th International Conference on Very Large Data Bases, Sept. 1994, pp.487–499.
[37]

Ward J H Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 1963, 58(301): 236–244. DOI: 10.1080/01621459.1963. 10500845.

[38]
Roussopoulos N, Kelley S, Vincent F. Nearest neighbor queries. In Proc. the 1995 ACM SIGMOD International Conference on Management of Data, May 1995, pp.71–79. DOI: 10.1145/223784.223794.
[39]

Sacca D, Wiederhold G. Database partitioning in a cluster of processors. ACM Trans. Database Systems, 1985, 10(1): 29–56. DOI: 10.1145/3148.3161.

[40]
Copeland G, Alexander W, Boughter E, Keller T. Data placement in Bubba. In Proc. the 1988 ACM SIGMOD International Conference on Management of Data, Jun. 1988, pp.99–108. DOI: 10.1145/50202.50213.
[41]
Stöhr T, Märtens H, Rahm E. Multi-dimensional database allocation for parallel data warehouses. In Proc. the 26th International Conference on Very Large Data Bases, Sept. 2000, pp.273–284.
[42]
Bruno N, Chaudhuri S. An online approach to physical design tuning. In Proc. the 23rd IEEE International Conference on Data Engineering, Apr. 2007, pp.826–835. DOI: 10.1109/ICDE.2007.367928.
[43]
Garcia-Alvarado C, Raghavan V, Narayanan S, Waas F M. Automatic data placement in MPP databases. In Proc. the IEEE 28th International Conference on Data Engineering Workshops, Apr. 2012, pp.322–327. DOI: 10.1109/ICDEW.2012.45.
[44]
Karypis G, Kumar V. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. Technical Report, TR 97-061, Univeristy of Minnesota, 1997. https://hdl.handle.net/11299/215346, Mar. 2024.
[45]
Kuhn H W. The Hungarian method for the assignment problem. In 50 Years of Integer Programming 1958-2008, Jünger M, Liebling T M, Naddef D, Nemhauser G L, Pulleyblank W R, Reinelt G, Rinaldi G, Wolsey L A (eds.), Springer, 2010, pp.29–47. DOI: 10.1007/978-3-540-68279-0_2.
[46]

Costa E, Costa C, Santos M Y. Evaluating partitioning and bucketing strategies for hive-based big data warehousing systems. Journal of Big Data, 2019, 6(1): 34. DOI: 10.1186/s40537-019-0196-1.

[47]
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing Atari with deep reinforcement learning. arXiv: 1312.5602, 2013. https://arxiv.org/abs/1312.5602, Mar. 2024.
[48]

Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A, Zdonik S, Jones E P C, Madden S, Stonebraker M, Zhang Y, Hugg J, Abadi D J. H-store: A high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 2008, 1(2): 1496–1499. DOI: 10.14778/1454159.1454211.

[49]
Hoffer J A, Severance D G. The use of cluster analysis in physical data base design. In Proc. the 1st International Conference on Very Large Data Bases, Sept. 1975, pp.69–86. DOI: 10.1145/1282480.1282486.
[50]

Navathe S, Ceri S, Wiederhold G, Dou J L. Vertical partitioning algorithms for database design. ACM Trans. Database Systems, 1984, 9(4): 680–710. DOI: 10.1145/1994.2209.

[51]

Navathe S B, Ra M. Vertical partitioning for database design: A graphical algorithm. ACM SIGMOD Record, 1989, 18(2): 440–450. DOI: 10.1145/66926.66966.

[52]

Chu W W, Ieong I T. A transaction-based approach to vertical partitioning for relational database systems. IEEE Trans. Software Engineering, 1993, 19(8): 804–812. DOI: 10.1109/32.238583.

[53]
Ng V, Law D M, Gorla N, Chan C K. Applying genetic algorithms in database partitioning. In Proc. the 2003 ACM Symposium on Applied Computing, Mar. 2003, pp.544–549. DOI: 10.1145/952532.952639.
[54]
Hankins R A, Patel J M. Data morphing: An adaptive, cache-conscious storage technique. In Proceedings 2003 VLDB Conference, Freytag J C, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds.), Elsevier, 2003, pp.417–428. DOI: 10.1016/B978-012722442-8/50044-6.
[55]
Papadomanolakis S, Ailamaki A. AutoPart: Automating schema design for large scientific databases using data partitioning. In Proc. the 16th International Conference on Scientific and Statistical Database Management, Jun. 2004, pp.383–392. DOI: 10.1109/SSDM.2004.1311234.
[56]
Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In Proc. the 2004 ACM SIGMOD International Conference on Management of Data, Jun. 2004, pp.359–370. DOI: 10.1145/1007568.1007609.
[57]

Gorla N, Betty P W Y. Vertical fragmentation in databases using data-mining technique. International Journal of Data Warehousing and Mining (IJDWM), 2008, 4(3): 35–53. DOI: 10.4018/jdwm.2008070103.

[58]
Rodríguez L, Li X O. A support-based vertical partitioning method for database design. In Proc. the 8th International Conference on Electrical Engineering, Computing Science and Automatic Control, Oct. 2011. DOI: 10.1109/ICEEE.2011.6106682.
[59]
Jindal A, Dittrich J. Relax and let the database do the partitioning online. In Proc. the 5th International Workshop on Enabling Real-Time Business Intelligence, Sept. 2011, pp.65–80. DOI: 10.1007/978-3-642-33500-6_5.
[60]
Li L Z, Gruenwald L. Self-managing online partitioner for databases (SMOPD): A vertical database partitioning system with a fully automatic online approach. In Proc. the 17th International Database Engineering & Applications Symposium, Oct. 2013, pp.168–173. DOI: 10.1145/2513591.2513649.
[61]
Rodríguez L, Li X O, Cuevas-Rasgado A D, García-Lamont F. DYVEP: An active database system with vertical partitioning functionality. In Proc. the 10th IEEE International Conference on Networking, Sensing and Control, Apr. 2013, pp.457–462. DOI: 10.1109/ICNSC.2013.6548782.
[62]
Li L Z, Gruenwald L. SMOPD-C: An autonomous vertical partitioning technique for distributed databases on cluster computers. In Proc. the 15th IEEE International Conference on Information Reuse and Integration, Aug. 2014, pp.171–178. DOI: 10.1109/IRI.2014.7051887.
[63]

Sun L W, Franklin M J, Wang J N, Wu E. Skipping-oriented partitioning for columnar layouts. Proceedings of the VLDB Endowment, 2016, 10(4): 421–432. DOI: 10.14778/3025111.3025123.

[64]

Huang Y F, Lai C J. Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning. Information Sciences, 2016, 328: 288–301. DOI: 10.1016/j.ins.2015.08.047.

[65]

Rodríguez-Mazahua L, Alor-Hernández G, Li X O, Cervantes J, López-Chau A. Active rule base development for dynamic vertical partitioning of multimedia databases. Journal of Intelligent Information Systems, 2017, 48(2): 421–451. DOI: 10.1007/s10844-016-0420-9.

[66]
Durand G C, Pinnecke M, Piriyev R et al. GridFormation: Towards self-driven online data partitioning using reinforcement learning. In Proc. the 1st International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Jun. 2018, Article No. 1. DOI: 10.1145/3211954.3211956.
[67]
Durand G C, Piriyev R, Pinnecke M et al. Automated vertical partitioning with deep reinforcement learning. In New Trends in Databases and Information Systems, Welzer T et al. (eds.), Springer, 2019, pp.126–134. DOI: 10.1007/978-3-030-30278-8_16.
[68]

Liu P J, Li H Y, Wang T Y et al. Multi-stage method for online vertical data partitioning based on spectral clustering. Journal of Software, 2023, 34(6): 2804–2832. DOI: 10.13328/j.cnki.jos. 006496.

[69]

Grund M, Krüger J, Plattner H, Zeier A, Cudre-Mauroux P, Madden S. HYRISE: A main memory hybrid storage engine. Proceedings of the VLDB Endowment, 2010, 4(2): 105–116. DOI: 10.14778/1921071.1921077.

[70]
Jindal A, Quiané-Ruiz J A, Dittrich J. Trojan data layouts: Right shoes for a running elephant. In Proc. the 2nd ACM Symposium on Cloud Computing, Oct. 2011, Article No. 21. DOI: 10.1145/2038916.2038937.
[71]
Gu X Y, Yang X F, Wang W P, Jin Y, Meng D. CHAC: An effective attribute clustering algorithm for large-scale data processing. In Proc. the 7th International Conference on Networking, Architecture, and Storage, Jun. 2012, pp.94–98. DOI: 10.1109/NAS.2012.16.
[72]
Arulraj J, Pavlo A, Menon P. Bridging the archipelago between row-stores and column-stores for hybrid workloads. In Proc. the 2016 International Conference on Management of Data, Jun. 2016, pp.583–598. DOI: 10.1145/2882903.2915231.
[73]

Athanassoulis M, Bøgh K S, Idreos S. Optimal column layout for hybrid workloads. Proceedings of the VLDB Endowment, 2019, 12(13): 2393–2407. DOI: 10.14778/3358701.3358707.

[74]

McCormick W T, Schweitzer P J, White T W. Problem decomposition and data reorganization by a clustering technique. Operations Research, 1972, 20(5): 993–1009. DOI: 10.1287/opre.20.5.993.

[75]
Ailamaki A, DeWitt D J, Hill M D, Skounakis M. Weaving relations for cache performance. In Proc. the 27th International Conference on Very Large Data Bases, Sept. 2001, pp.169–180.
[76]
Li L Z, Gruenwald L. Autonomous database partitioning using data mining on single computers and cluster computers. In Proc. the 16th International Database Engineering & Applications Sysmposium, Aug. 2012, pp.32–41. DOI: 10.1145/2351476.2351481.
[77]
van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In Proc. the 30th AAAI Conference on Artificial Intelligence, Feb. 2016, pp.2094–2100. DOI: 10.1609/aaai.v30i1.10295.
[78]

Jindal A, Palatinus E, Pavlov V, Dittrich J. A comparison of knives for bread slicing. Proceedings of the VLDB Endowment, 2013, 6(6): 361–372. DOI: 10.14778/2536336.2536338.

[79]

Al-Kateb M, Sinclair P, Au G, Ballinger C. Hybrid row-column partitioning in Teradata®. Proceedings of the VLDB Endowment, 2016, 9(13): 1353–1364. DOI: 10.14778/ 3007263.3007273.

[80]

Pinnecke M, Durand G C, Broneske D, Zoun R, Saake G. GridTables: A One-Size-Fits-Most H2TAP data store. Datenbank-Spektrum, 2020, 20(1): 43–56. DOI: 10.1007/s13222-019-00330-x.

[81]
Kang D H, Jiang R C, Blanas S. Jigsaw: A data storage and query processing engine for irregular table partitioning. In Proc. the 2021 International Conference on Management of Data, Jun. 2021, pp.898–911. DOI: 10.1145/3448016.3457547.
[82]
Abebe M, Lazu H, Daudjee K. Proteus: Autonomous adaptive storage for mixed workloads. In Proc. the 2022 International Conference on Management of Data, Jun. 2022, pp.700–714. DOI: 10.1145/3514221.3517834.
[83]

Wang J Y, Chai C L, Liu J B, Li G L. FACE: A normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 2021, 15(1): 72–84. DOI: 10.14778/3485450.3485458.

Journal of Computer Science and Technology
Pages 346-368
Cite this article:
Liu P-J, Li C-P, Chen H. Enhancing Storage Efficiency and Performance: A Survey of Data Partitioning Techniques. Journal of Computer Science and Technology, 2024, 39(2): 346-368. https://doi.org/10.1007/s11390-024-3538-1

297

Views

1

Crossref

0

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 21 June 2023
Accepted: 29 February 2024
Published: 30 March 2024
© Institute of Computing Technology, Chinese Academy of Sciences 2024
Return