AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Minimum Epsilon-Kernel Computation for Large-Scale Data Processing

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518000, China
Show Author Information

Abstract

Kernel is a kind of data summary which is elaborately extracted from a large dataset. Given a problem, the solution obtained from the kernel is an approximate version of the solution obtained from the whole dataset with a provable approximate ratio. It is widely used in geometric optimization, clustering, and approximate query processing, etc., for scaling them up to massive data. In this paper, we focus on the minimum ε-kernel (MK) computation that asks for a kernel of the smallest size for large-scale data processing. For the open problem presented by Wang et al. that whether the minimum ε-coreset (MC) problem and the MK problem can be reduced to each other, we first formalize the MK problem and analyze its complexity. Due to the NP-hardness of the MK problem in three or higher dimensions, an approximate algorithm, namely Set Cover-Based Minimum ε-Kernel algorithm (SCMK), is developed to solve it. We prove that the MC problem and the MK problem can be Turing-reduced to each other. Then, we discuss the update of MK under insertion and deletion operations, respectively. Finally, a randomized algorithm, called the Randomized Algorithm of Set Cover-Based Minimum ε-Kernel algorithm (RA-SCMK), is utilized to further reduce the complexity of SCMK. The efficiency and effectiveness of SCMK and RA-SCMK are verified by experimental results on real-world and synthetic datasets. Experiments show that the kernel sizes of SCMK are 2x and 17.6x smaller than those of an ANN-based method on real-world and synthetic datasets, respectively. The speedup ratio of SCMK over the ANN-based method is 5.67 on synthetic datasets. RA-SCMK runs up to three times faster than SCMK on synthetic datasets.

Electronic Supplementary Material

Download File(s)
jcst-37-6-1398-Highlights.pdf (1.1 MB)

References

[1]

Agarwal P K, Har-Peled S, Varadarajan K R. Approximating extent measures of points. Journal of the ACM, 2004, 51(4): 606-635. DOI: 10.1145/1008731.1008736.

[2]
Wang Y, Li Y, Tan K L. Coresets for minimum enclosing balls over sliding windows. In Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2019, pp.314-323. DOI: 10.1145/3292500.3330826.
[3]
Bachem O, Lucic M, Krause A. Scalable k-means clustering via lightweight coresets. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2018, pp.1119-1127. DOI: 10.1145/3219819.3219973.
[4]
Har-Peled S, Mazumdar S. On coresets for k-means and k-median clustering. In Proc. the 36th Annual ACM Symposium on Theory of Computing, Jun. 2004, pp.291-300. DOI: 10.1145/1007352.1007400.
[5]
Yu A, Agarwal P K, Yang J. Processing a large number of continuous preference top-k queries. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.397-408. DOI: 10.1145/2213836.2213882.
[6]
Wang Y, Mathioudakis M, Li Y, Tan K L. Minimum coresets for maxima representation of multidimensional data. In Proc. the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Jun. 2021, pp.138-152. DOI: 10.1145/3452021.3458322.
[7]
Agarwal P K, Har-Peled S, Varadarajan K R. Geometric approximation via coresets. In Combinatorial and Computational Geometry, 2005, 52. http://library.msri.org/books/Book52/files/01agar.pdf, Oct. 2022.
[8]

Yu H, Agarwal P K, Poreddy R, Varadarajan K R. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 2008, 52(3): 378-402. DOI: 10.1007/s00453-007-9067-9.

[9]

Luo G, Wu K L, Yu P S. Answering linear optimization queries with an approximate stream index. Knowledge and Information Systems, 2009, 20(1): 95-121. DOI: 10.1007/s10115-008-0157-z.

[10]
Agarwal P K, Kumar N, Sintos S, Suri S. Efficient algorithms for k-regret minimizing sets. In Proc. the 16th International Symposium on Experimental Algorithms, Jun. 2017, Article No. 7. DOI: 10.4230/LIPIcs.SEA.2017.7.
[11]
Xie M, Wong R C W, Li J, Long C, Lall A. Efficient k-regret query algorithm with restriction-free bound for any dimensionality. In Proc. the 2018 International Conference on Management of Data, Jun. 2018, pp.959-974. DOI: 10.1145/3183713.3196903.
[12]
Zheng J, Wang Y, Wang X, Ma W. Continuous k-regret minimization queries: A dynamic coreset approach. IEEE Transactions on Knowledge and Data Engineering. DOI: 10.1109/TKDE.2022.3166835.
[13]

Chan T M. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 2006, 35(1/2): 20-35. DOI: 10.1016/j.comgeo.2005.10.002.

[14]
Arya S, Chan T M. Better ε-dependencies for offline approximate nearest neighbor search, Euclidean minimum spanning trees, and ε-kernels. In Proc. the 30th Annual Symposium on Computational Geometry, Jun. 2014, pp.416-425. DOI: 10.1145/2582112.2582161.
[15]
Arya S, Da Fonseca G D, Mount D M. Near-optimal ε-Kernel construction and related problems. In Proc. the 33rd International Symposium on Computational Geometry, Jul. 2017, Article No. 10. DOI: 10.4230/LIPIcs.SoCG.2017.10.
[16]

Chan T M. Applications of Chebyshev polynomials to low-dimensional computational geometry. Journal of Computational Geometry, 2018, 9(2): 3-20. DOI: 10.20382/jocg.v9i2a2.

[17]

Chan T M. Dynamic coresets. Discrete and Computational Geometry, 2009, 42(3): 469-488. DOI: 10.1007/s00454-009-9165-3.

[18]
Andoni A, Nguyen H L. Width of points in the streaming model. In Proc. the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 2012, pp.447-452. DOI: 10.1137/1.9781611973099.38.
[19]
Chan T M. Dynamic streaming algorithms for ε-kernels. In Proc. the 32nd International Symposium on Computational Geometry, Jun. 2016, Article No. 27. DOI: 10.4230/LIPIcs.SoCG.2016.27.
[20]

Guo H, Li J, Gao H. Data source selection for approximate query. Journal of Combinatorial Optimization, 2022, 44(4): 2443-2459. DOI: 10.1007/s10878-021-00760-y.

[21]

Suh N P. Complexity: Theory and Applications (1st edition). Oxford University Press, 2005.

[22]

Bentley J L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, 18(9): 509-517. DOI: 10.1145/361002.361007.

[23]

Yao A C C. On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM Journal on Computing, 1982, 11(4): 721-736. DOI: 10.1137/0211059.

Journal of Computer Science and Technology
Pages 1398-1411
Cite this article:
Guo H-J, Li J-Z, Gao H. Minimum Epsilon-Kernel Computation for Large-Scale Data Processing. Journal of Computer Science and Technology, 2022, 37(6): 1398-1411. https://doi.org/10.1007/s11390-022-2429-6

482

Views

1

Crossref

0

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 15 April 2022
Accepted: 14 November 2022
Published: 30 November 2022
©Institute of Computing Technology, Chinese Academy of Sciences 2022
Return