AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Efficient Partitioning Method for Optimizing the Compression on Array Data

Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China
Show Author Information

Abstract

Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that array partitioning can affect the compression performance significantly, this paper aims to design an efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation.

Electronic Supplementary Material

Download File(s)
jcst-37-5-1049-Highlights.pdf (131.6 KB)

References

[1]
Duggan J, Stonebraker M. Incremental elasticity for array databases. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp. 409-420. DOI: 10.1145/2588555.2588569.
[2]
Li J, Rotem D, Wong H K. A new compression method with fast searching on large databases. In Proc. the 13th International Conference on Very Large Data Bases, Sept. 1987, pp. 311-318.
[3]
Wang J, Lin C, Papakonstantinou Y, Swanson S. An experimental study of bitmap compression vs. inverted list compression. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp. 993-1008. DOI: 10.1145/3035918.3064007.
[4]

Damme P, Ungethüm A, Hildebrandt J, Habich D, Lehner W. From a comprehensive experimental survey to a cost-based selection strategy for lightweight integer compression algorithms. ACM Transactions on Database Systems, 2019, 44(3): Article No. 9. DOI: 10.1145/3323991.

[5]
Sarawagi S, Stonebraker M. Efficient organization of large multidimensional arrays. In Proc. the 1994 International Conference on Data Engineering, Feb. 1994, pp. 328-336. DOI: 10.1109/ICDE.1994.283048.
[6]
Otoo E J, Rotem D, Seshadri S. Optimal chunking of large multidimensional arrays for data warehousing. In Proc. the 10th International Workshop on Data Warehousing and OLAP, Nov. 2007, pp. 25-32. DOI: 10.1145/1317331.1317337.
[7]
Stonebraker M, Brown P, Poliakov A, Raman S. The architecture of SciDB. In Proc. the 23rd International Conference on Scientific and Statistical Database Management, Jul. 2011, pp. 1-16. DOI: 10.1007/978-3-642-22351-8_1.
[8]
Soroush E, Balazinska M, Wang D. ArrayStore: A storage manager for complex parallel array processing. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2011, pp. 253-264. DOI: 10.1145/1989323.1989351.
[9]
Rusu F, Cheng Y. A survey on array storage, query languages, and systems. arXiv: 1302.0103, 2013. https://arxiv.org/abs/1302.0103, Feb. 2022.
[10]
Chang C, Moon B, Acharya A, Shock C, Sussman A, Saltz J. Titan: A high-performance remote-sensing database. In Proc. the 13th International Conference on Data Engineering, April 1997, pp. 375-384. DOI: 10.1109/ICDE.1997.581883.
[11]

Marathe A P, Salem K. Query processing techniques for arrays. The VLDB Journal, 2002, 11(1): 68-91. DOI: 10.1007/s007780200062.

[12]
Brown P G. Overview of SciDB: Large scale array storage, processing and analysis. In Proc. the ACM SIGMOD International Conference on Management of Data, Jun. 2010, pp. 963-968. DOI: 10.1145/1807167.1807271.
[13]

Papadopoulos S, Datta K, Madden S, Mattson T. The TileDB array data storage manager. Proceedings of the VLDB Endowment, 2016, 10(4): 349-360. DOI: 10.14778/3025111.3025117.

[14]

Kaddoura M, Ranka S, Wang A. Array decompositions for nonuniform computational environments. Journal of Parallel and Distributed Computing, 1996, 36(2): 91-105. DOI: 10.1006/jpdc.1996.0092.

[15]

Muthukrishnan S, Suel T. Approximation algorithms for array partitioning problems. Journal of Algorithms, 2005, 54(1): 85-104. DOI: 10.1016/j.jalgor.2003.11.006.

[16]

Moudani W, Hussein M, Moukhtar M, Mora-Camino F. Intelligent data compression approach in multidimensional data warehouse. International Journal on Computer Science and Engineering, 2011, 3(2): 951-960.

[17]
Rahmani M M, Al-Mahmud A, Hossen M A, Rahman M, Ahmed M R, Sohan M F. A comparative analysis of traditional and modern data compression schemes for large multi-dimensional extendible array. In Proc. the 2019 International Conference on Electrical, Computer and Communication Engineering, Feb. 2019. DOI: 10.1109/ECACE.2019.8679182.
[18]

Cormode G, Garofalakis M, Haas P J et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends® in Databases, 2011, 4(1/2/3): 1-294. DOI: 10.1561/1900000004.

[19]

Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37-57. DOI: 10.1145/3147.3165.

[20]
Deshpande A, Garofalakis M, Rastogi R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp. 199-210. DOI: 10.1145/375663.375685.
[21]
Reiner B, Hahn K, Höfling G, Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proc. the 13th International Conference on Database and Expert Systems Applications, Sept. 2002, pp. 689-700. DOI: 10.1007/3-540-46146-9_68.
[22]
Chai C, Li G, Li J, Deng D, Feng J. Cost-effective crowdsourced entity resolution: A partial-order approach. In Proc. the 2016 International Conference on Management of Data, Jun. 26-Jul. 1, 2016, pp. 969-984. DOI: 10.1145/2882903.2915252.
[23]

Chai C, Liu J, Tang N, Li G, Luo Y. Selective data acquisition in the wild for model charging. Proceedings of the VLDB Endowment, 2022, 15(7): 1466-1478. DOI: 10.14778/3523210.3523223.

[24]
Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In Proc. the 2022 IEEE International Conference on Data Engineering, May 2022, pp. 3360-3372. DOI: 10.1109/ICDE53745.2022.00317.
[25]
Manne F, Sørevik T. Partitioning an array onto a mesh of processors. In Proc. the 3rd International Workshop on Applied Parallel Computing, Aug. 1996, pp. 467-477. DOI: 10.1007/3-540-62095-8_50.
[26]

Mingozzi A, Ricciardelli S, Spadoni M. Partitioning a matrix to minimize the maximum cost. Discrete Applied Mathematics, 1995, 62(1/2/3): 221-248. DOI: 10.1016/0166-218X(94)00154-6.

[27]

Nicol D M. Rectilinear partitioning of irregular data parallel computations. Journal of Parallel and Distributed Computing, 1994, 23(2): 119-134. DOI: 10.1006/jpdc.1994.1126.

[28]

Saule E, Baş E Ö, Çatalyürek Ü V. Load-balancing spatially located computations using rectangular partitions. Journal of Parallel and Distributed Computing, 2012, 72(10): 1201-1214. DOI: 10.1016/j.jpdc.2012.05.013.

[29]

Roth M A, Van Horn S J. Database compression. SIGMOD Record, 1993, 22(3): 31-39. DOI: 10.1145/163090.163096.

[30]

Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software–Practice and Experience, 2015, 45(1): 1-29. DOI: 10.1002/spe.2203.

[31]
Schlegel B, Gemulla R, Lehner W. Fast integer compression using SIMD instructions. In Proc. the 6th International Workshop on Data Management on New Hardware, June 2010, pp. 34-40. DOI: 10.1145/1869389.1869394.
[32]
Antoshenkov G. Byte-aligned bitmap compression. In Proc. the 1995 Data Compression Conference, Mar. 1995, pp. 476. DOI: 10.1109/DCC.1995.515586.
[33]

Wu K, Otoo E J, Shoshani A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 2006, 31(1): 1-38. DOI: 10.1145/1132863.1132864.

[34]

Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering, 2010, 69(1): 3-28. DOI: 10.1016/j.datak.2009.08.006.

[35]

Colantonio A, Di Pietro R. CONCISE: Compressed 'n' composable integer set. Information Processing Letters, 2010, 110(16): 644-650. DOI: 10.1016/j.ipl.2010.05.018.

[36]

Chambi S, Lemire D, Kaser O, Godin R. Better bitmap performance with roaring bitmaps. Software–Practice and Experience, 2016, 46(5): 709-719. DOI: 10.1002/spe.2325.

[37]
Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp. 1463-1478. DOI: 10.1145/3035918.3064036.
[38]
Chai C, Cao L, Li G, Li J, Luo Y, Madden S. Human-in-the-loop outlier detection. In Proc. the 2020 International Conference on Management of Data, Jun. 2020, pp. 19-33. DOI: 10.1145/3318464.3389772.
[39]

Huffman D A. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952, 40(9): 1098-1101. DOI: 10.1109/JRPROC.1952.273898.

[40]

Witten I H, Neal R M, Cleary J G. Arithmetic coding for data compression. Communications of the ACM, 1987, 30(6): 520-540. DOI: 10.1145/214762.214771.

[41]

Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343. DOI: 10.1109/TIT.1977.1055714.

[42]
Agarwal R, Khandelwal A, Stoica I. Succinct: Enabling queries on compressed data. In Proc. the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, pp. 337-350.
[43]
Mutlu O, Shen X, Zhai J, Zhang F. Potential of a method for text analytics directly on compressed data. Technical Report, North Carolina State University, 2017. https://repository.lib.ncsu.edu/bitstream/handle/1840.20/35582/TR-2017-4.pdf?sequence=1&isAllowed=y, Feb. 2022.
[44]
Zhang F, Zhai J, Shen X, Mutlu O, Chen W. Zwift: A programming framework for high performance text analytics on compressed data. In Proc. the 32nd International Conference on Supercomputing, Jun. 2018, pp. 195-206. DOI: 10.1145/3205289.3205325.
[45]

Nevill-Manning C G, Witten I H. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 1997, 7: 67-82. DOI: 10.1613/jair.374.

[46]
Zhang F, Pan Z, Zhou Y, Zhai J, Shen X, Mutlu O, Du X. GTADOC: Enabling efficient GPU-based text analytics without decompression. In Proc. the 2021 IEEE International Conference on Data Engineering, Apr. 2021, pp. 1679-1690. DOI: 10.1109/ICDE51399.2021.00148.
[47]

Zhang F, Zhai J, Shen X, Mutlu O, Du X. POCLib: A high-performance framework for enabling near orthogonal processing on compression. IEEE Transactions on Parallel and Distributed Systems, 2022, 33(2): 459-475. DOI: 10.1109/TPDS.2021.3093234.

Journal of Computer Science and Technology
Pages 1049-1067
Cite this article:
Han S, Liu X-M, Li J-Z. Efficient Partitioning Method for Optimizing the Compression on Array Data. Journal of Computer Science and Technology, 2022, 37(5): 1049-1067. https://doi.org/10.1007/s11390-022-2371-7

462

Views

2

Crossref

0

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 31 March 2022
Accepted: 21 September 2022
Published: 30 September 2022
©Institute of Computing Technology, Chinese Academy of Sciences 2022
Return