Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi; Hong-Zhi Wang; An-Jie Wang

doi:10.1007/s11390-021-1344-6

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Zhi-Xin Qi, Hong-Zhi Wang(

), An-Jie Wang

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China

Show Author Information

Abstract

Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate model with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent, and conflicting data on classification and clustering models. From the experimental results, we observe that dirty-data impacts are related to the error type, the error rate, and the data size. Based on the findings, we suggest users leverage our proposed metrics, sensibility and data quality inflection point, for model selection and data cleaning.

Keywords

classification clustering data quality data cleaning model selection

Electronic Supplementary Material

Download File(s)

jcst-36-4-806-Highlights.pdf (161.4 KB)

References

[1]

Beskales G, Ilyas I F, Golab L, Galiullin A. On the relative trust between inconsistent data and inaccurate constraints. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.541-552. DOI: 10.1109/ICDE.2013.6544854.

Crossref

[2]

Chu X, Ilyas I F, Papotti P. Holistic data cleaning: Putting violations into context. In Proc. the 29th IEEE Int. Conf. Data Engineering, Apr. 2013, pp.458-469. DOI: 10.1109/ICDE.2013.6544847.

Crossref

[3]

Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the 36th ACM Int. Conf. Management of Data, May 2015, pp.1247-1261. DOI: 10.1145/2723372.2749431.

Crossref

[4]

Hao S, Tang N, Li G, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd IEEE Int. Conf. Data Engineering, Apr. 2017, pp.933-944. DOI: 10.1109/ICDE.2017.141.

Crossref

[5]

Wang J, Kraska T, Franklin M J, Feng J. CrowdER: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483-1494. DOI: 10.14778/2350229.2350263.

Crossref Google Scholar

[6]

Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF: A commodity data cleaning system. In Proc. the 34th ACM Int. Conf. Management of Data, Jun. 2013, pp.541-552. DOI: 10.1145/2463676.2465327.

Crossref

[7]

Gamberger D, Lavrač N. Conditions for Occam’s razor applicability and noise elimination. In Proc. the 9th Springer Eur. Conf. Machine Learning, Apr. 1997, pp.108-123. DOI: 10.1007/3-540-62858-4_76.

Crossref

[8]

García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications, 2010, 19(2): 263-282. DOI: 10.1007/s00521-009-0295-6.

Crossref Google Scholar

[9]

Lim S. Cleansing noisy city names in spatial data mining. In Proc. the 2010 Int. Conf. Information Science and Applications, Apr. 2010. DOI: 10.1109/ICISA.2010.5480390.

Crossref

[10]

Frénay B, Verleysen M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Networks and Learning Systems, 2013, 25(5): 845-869. DOI: 10.1109/TNNLS.2013.2292894.

Crossref Google Scholar

[11]

Zhu X, Wu X. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 2004, 22(3): 177-210. DOI: 10.1007/s10462-004-0751-8.

Crossref Google Scholar

[12]

Song S, Li C, Zhang X. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In Proc. the 21st ACM Int. Conf. Knowledge Discovery and Data Mining, Aug. 2015, pp.1115-1124. DOI: 10.1145/2783258.2783317.

Crossref

[13]

Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd ACM Int. Conf. Machine Learning, Jun. 2006, pp.161-168. DOI: 10.1145/1143844.1143865.

Crossref

[14]

Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In Proc. the 25th ACM Int. Conf. Machine Learning, Jul. 2008, pp.96-103. DOI: 10.1145/1390156.1390169.

Crossref

[15]

Ghotra B, McIntosh S, Hassan A E. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proc. the 37th IEEE/ACM Int. Conf. Software Engineering, May 2015, pp.789-800. DOI: 10.1109/ICSE.2015.91.

Crossref

[16]

Kirchner K, Zec J, Delibašić B. Facilitating data preprocessing by a generic framework: A proposal for clustering. Artificial Intelligence Review, 2016, 45(3): 271-297. DOI: 10.1007/s10462-015-9446-6.

Crossref Google Scholar

[17]

Sidi F, Panahy P H S, Affendey L S, Jabar M A, Ibrahim H, Mustapha A. Data quality: A survey of data quality dimensions. In Proc. the 2nd IEEE Int. Conf. Information Retrieval and Knowledge Management, Mar. 2012, pp.300-304. DOI: 10.1109/InfRKM.2012.6204995.

Crossref

[18]

Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, Jun. 2010, pp.169-178. DOI: 10.1145/1807085.1807109.

Crossref

[19]

Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019. DOI: 10.14778/2367502.2367564.

Crossref Google Scholar

[20]

Arocena P C, Glavic B, Mecca G, Miller R J, Papotti P, Santoro D. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment, 2015, 9(2): 36-47. DOI: 10.14778/2850578.2850579.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 36 Issue 4,
July 2021

Pages 806-821

DOI: 10.1007/s11390-021-1344-6

Cite this article:

Qi Z-X, Wang H-Z, Wang A-J. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation. Journal of Computer Science and Technology, 2021, 36(4): 806-821. https://doi.org/10.1007/s11390-021-1344-6

427

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 31 January 2021

Accepted: 27 June 2021

Published: 05 July 2021