Graph Deep Active Learning Framework for Data Deduplication

Huan Cao; Shengdong Du; Jie Hu; Yan Yang; Shi-Jinn Horng; Tianrui Li

doi:10.26599/BDMA.2023.9020040

| Sign up

PDF (2.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Graph Deep Active Learning Framework for Data Deduplication

Huan Cao^¹, Shengdong Du^¹(), Jie Hu^¹, Yan Yang^¹, Shi-Jinn Horng^², Tianrui Li^¹

1School of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, China

2College of Information and Electric Engineering, Asia University, Chongsheng 41359, China

Show Author Information

Abstract

With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.

Keywords

data deduplication active learning similarity calculation

References

[1]

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, Deduplicating training data makes language models better, arXiv preprint arXiv: 2107.06499, 2021.

Crossref

[2]

W. Kim, B. J. Choi, E. K. Hong, S. K. Kim, and D. Lee, A taxonomy of dirty data, Data Min. Knowl. Discov., vol. 7, no. 1, pp. 81–99, 2003.

Crossref Google Scholar

[3]

V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, An overview of end-to-end entity resolution for big data, ACM Comput. Surv., vol. 53, no. 6, pp. 1–42, 2020.

Crossref Google Scholar

[4]

P. Konda, S. Das, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, et al., Magellan: Toward building entity matching management systems over data science stacks, Proceedings of the VLDB Endowment, vol. 9, no. 13, pp. 1581–1584, 2016.

Crossref Google Scholar

[5]

J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa, Low-resource deep entity resolution with transfer and active learning, arXiv preprint arXiv: 1906.08042, 2019.

Crossref

[6]

A. Primpeli and C. Bizer, Graph-boosted active learning for multi-source entity resolution, in Proc. 20th Int. Semantic Web Conf., ISWC 2021, Virtual Event, 2021, pp. 182–199.

Crossref

[7]

P. Medvedev, Theoretical analysis of edit distance algorithms: An applied perspective, arXiv preprint arXiv: 2204.09535, 2022.

[8]

J. L. Holechek, M. Vavra, and R. D. Pieper, Methods for determining the botanical composition, similarity, and overlap of range herbivore diets, in Developing Strategies for Rangeland Management, M. Koppal, ed. Calabasas, FL, USA: CRC Press, 2021, pp. 425–472.

[9]

I. Ahamed, M. Jahan, Z. Tasnim, T. Karim, S. M. Salim Reza, and D. A. Hossain, Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance, Bull. Electr. Eng. Inform., vol. 10, no. 4, pp. 1997–2005, 2021.

Crossref Google Scholar

[10]

X. Dai, X. Yan, K. Zhou, Y. Wang, H. Yang, and J. Cheng, Convolutional embedding for edit distance, in Proc. 43rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Virtual Event, 2020, pp. 599–608.

Crossref

[11]

W. Lan and W. Xu, Character-based neural networks for sentence pair modeling, arXiv preprint arXiv: 1805.08297, 2018.

Crossref

[12]

M. B. Lazreg, M. Goodwin, and O. C. Granmo, Combining a context aware neural network with a denoising autoencoder for measuring string similarities, Comput. Speech Lang., vol. 60, p. 101028, 2020.

Crossref Google Scholar

[13]

N. Reimers and I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-Networks, arXiv preprint arXiv: 1908.10084, 2019.

Crossref

[14]

N. G. Marchant, A. Kaplan, D. N. Elazar, B. I. P. Rubinstein, and R. C. Steorts, D-blink: Distributed end-to-end Bayesian entity resolution, J. Comput. Graph. Stat., vol. 30, no. 2, pp. 406–421, 2021.

Crossref

[15]

A. Jain, S. Sarawagi, and P. Sen, Deep indexed active learning for matching heterogeneous entity representations, arXiv preprint arXiv: 2104.03986, 2021.

Crossref

[16]

P. Munjal, N. Hayat, M. Hayat, J. Sourati, and S. Khan, Towards robust and reproducible active learning using neural networks, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 223–232.

Crossref

[17]

Z. Ming, J. Chen, L. Cui, S. Yang, Y. Pan, W. Xiao, and L. Zhou, Edge-based video surveillance with graph-assisted reinforcement learning in smart construction, IEEE Internet Things J., vol. 9, no. 12, pp. 9249–9265, 2022.

Crossref Google Scholar

[18]

D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari, A survey of active learning algorithms for supervised remote sensing image classification, IEEE J. Sel. Top. Signal Process., vol. 5, no. 3, pp. 606–617, 2011.

Crossref Google Scholar

[19]

M. Bilgic, L. Mihalkova, and L. Getoor, Active learning for networked data, in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010, pp. 79–86.

[20]

H. T. Nguyen and A. Smeulders, Active learning using pre-clustering, in Proc. 21st Int. Conf. Machine Learning, Banff, Canada, 2004, p. 79.

Crossref

[21]

L. D. F. Costa, Further generalizations of the jaccard index, arXiv preprint arXiv: 2110.09619, 2021.

[22]

L. Grinsztajn, E. Oyallon, and G. Varoquaux, Why do tree-based models still outperform deep learning on tabular data? arXiv preprint arXiv: 2207.08815, 2022.

[23]

A. Saeedi, E. Peukert, and E. Rahm, Using link features for entity clustering in knowledge graphs, in Proc. 15th Int. Conf., ESWC 2018, Heraklion, Greece, 2018, pp. 576–592.

Crossref

[24]

A. Primpeli, R. Peeters, and C. Bizer, The WDC training dataset and gold standard for large-scale product matching, in Proc. 2019 World Wide Web Conf., San Francisco, CA, USA, 2019, pp. 381–386.

Crossref

[25]

X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun, and G. Saake, Heterogeneous committee-based active learning for entity resolution (healer), in Proc. 23rd European Conf., ADBIS 2019, Bled, Slovenia, 2019, pp. 69–85.

Crossref

Big Data Mining and Analytics

Volume 7 Issue 3,
September 2024

Pages 753-764

DOI: 10.26599/BDMA.2023.9020040

Cite this article:

Cao H, Du S, Hu J, et al. Graph Deep Active Learning Framework for Data Deduplication. Big Data Mining and Analytics, 2024, 7(3): 753-764. https://doi.org/10.26599/BDMA.2023.9020040