Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.
W. Kim, B. J. Choi, E. K. Hong, S. K. Kim, and D. Lee, A taxonomy of dirty data, Data Min. Knowl. Discov., vol. 7, no. 1, pp. 81–99, 2003.
V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, An overview of end-to-end entity resolution for big data, ACM Comput. Surv., vol. 53, no. 6, pp. 1–42, 2020.
P. Konda, S. Das, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, et al., Magellan: Toward building entity matching management systems over data science stacks, Proceedings of the VLDB Endowment, vol. 9, no. 13, pp. 1581–1584, 2016.
I. Ahamed, M. Jahan, Z. Tasnim, T. Karim, S. M. Salim Reza, and D. A. Hossain, Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance, Bull. Electr. Eng. Inform., vol. 10, no. 4, pp. 1997–2005, 2021.
M. B. Lazreg, M. Goodwin, and O. C. Granmo, Combining a context aware neural network with a denoising autoencoder for measuring string similarities, Comput. Speech Lang., vol. 60, p. 101028, 2020.
Z. Ming, J. Chen, L. Cui, S. Yang, Y. Pan, W. Xiao, and L. Zhou, Edge-based video surveillance with graph-assisted reinforcement learning in smart construction, IEEE Internet Things J., vol. 9, no. 12, pp. 9249–9265, 2022.
D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari, A survey of active learning algorithms for supervised remote sensing image classification, IEEE J. Sel. Top. Signal Process., vol. 5, no. 3, pp. 606–617, 2011.
158
Views
15
Downloads
0
Crossref
0
Web of Science
0
Scopus
0
CSCD
Altmetrics
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).