AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Show Author Information

Abstract

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Electronic Supplementary Material

Download File(s)
jcst-36-4-806-Highlights.pdf (161.4 KB)

References

[1]
Beskales G, Ilyas I F, Golab L, Galiullin A. On the relative trust between inconsistent data and inaccurate constraints. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.541-552. DOI: 10.1109/ICDE.2013.6544854.
[2]
Chu X, Ilyas I F, Papotti P. Holistic data cleaning: Putting violations into context. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.458-469. DOI: 10.1109/ICDE.2013.6544847.
[3]
Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the 36th ACM Int. Conf. Management of Data, May 2015, pp.1247-1261. DOI: 10.1145/2723372.2749431.
[4]
Hao S, Tang N, Li G, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd IEEE Int. Conf. Data Engineering, Apr. 2017, pp.933-944. DOI: 10.1109/ICDE.2017.141.
[5]

Wang J, Kraska T, Franklin M J, Feng J. CrowdER: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494. DOI: 10.14778/2350229.2350263.

[6]
Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF: A commodity data cleaning system. In Proc. the 34th ACM Int. Conf. Management of Data, Jun. 2013, pp.541-552. DOI: 10.1145/2463676.2465327.
[7]
Gamberger D, Lavrač N. Conditions for Occam’s razor applicability and noise elimination. In Proc. the 9th Springer Eur. Conf. Machine Learning, Apr. 1997, pp.108-123. DOI: 10.1007/3-540-62858-4_76.
[8]

García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications, 2010, 19(2): 263-282. DOI: 10.1007/s00521-009-0295-6.

[9]
Lim S. Cleansing noisy city names in spatial data mining. In Proc. the 2010 Int. Conf. Information Science and Applications, Apr. 2010. DOI: 10.1109/ICISA.2010.5480390.
[10]

Frénay B, Verleysen M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Networks and Learning Systems, 2013, 25(5): 845-869. DOI: 10.1109/TNNLS.2013.2292894.

[11]

Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 2004, 22(3): 177-210. DOI: 10.1007/s10462-004-0751-8.

[12]
Song S, Li C, Zhang X. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In Proc. the 21st ACM Int. Conf. Knowledge Discovery and Data Mining, Aug. 2015, pp.1115-1124. DOI: 10.1145/2783258.2783317.
[13]
Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd ACM Int. Conf. Machine Learning, Jun. 2006, pp.161-168. DOI: 10.1145/1143844.1143865.
[14]
Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In Proc. the 25th ACM Int. Conf. Machine Learning, Jul. 2008, pp.96-103. DOI: 10.1145/1390156.1390169.
[15]
Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. the 37th IEEE/ACM Int. Conf. Software Engineering, May 2015, pp.789-800. DOI: 10.1109/ICSE.2015.91.
[16]

Kirchner K, Zec J, Delibašić B. Facilitating data preprocessing by a generic framework: A proposal for clustering. Artificial Intelligence Review, 2016, 45(3): 271-297. DOI: 10.1007/s10462-015-9446-6.

[17]
Sidi F, Panahy P H S, Affendey L S, Jabar M A, Ibrahim H, Mustapha A. Data quality: A survey of data quality dimensions. In Proc. the 2nd IEEE Int. Conf. Information Retrieval and Knowledge Management, Mar. 2012, pp.300-304. DOI: 10.1109/InfRKM.2012.6204995.
[18]
Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Jun. 2010, pp.169-178. DOI: 10.1145/1807085.1807109.
[19]

Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019. DOI: 10.14778/2367502.2367564.

[20]

Arocena P C, Glavic B, Mecca G, Miller R J, Papotti P, Santoro D. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment, 2015, 9(2): 36-47. DOI: 10.14778/2850578.2850579.

Journal of Computer Science and Technology
Pages 806-821
Cite this article:
Qi Z-X, Wang H-Z, Wang A-J. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation. Journal of Computer Science and Technology, 2021, 36(4): 806-821. https://doi.org/10.1007/s11390-021-1344-6

483

Views

9

Crossref

8

Web of Science

11

Scopus

0

CSCD

Altmetrics

Received: 31 January 2021
Accepted: 27 June 2021
Published: 05 July 2021
©Institute of Computing Technology, Chinese Academy of Sciences 2021
Return