AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Random Subspace Sampling for Classification with Missing Data

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
Show Author Information

Abstract

Many real-world datasets suffer from the unavoidable issue of missing values, and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors. In this paper, we propose a random subspace sampling method, RSS, by sampling missing items from the corresponding feature histogram distributions in random subspaces, which is effective and efficient at different levels of missing data. Unlike most established approaches, RSS does not train on fixed imputed datasets. Instead, we design a dynamic training strategy where the filled values change dynamically by resampling during training. Moreover, thanks to the sampling strategy, we design an ensemble testing strategy where we combine the results of multiple runs of a single model, which is more efficient and resource-saving than previous ensemble methods. Finally, we combine these two strategies with the random subspace method, which makes our estimations more robust and accurate. The effectiveness of the proposed RSS method is well validated by experimental studies.

Electronic Supplementary Material

Download File(s)
JCST-2105-11611-Highlights.pdf (340.6 KB)

References

[1]

García-Laencina P J, Sancho-Gómez J L, Figueiras-Vidal A R. Pattern classification with missing data: A review. Neural Computing and Applications , 2010, 19(2): 263–282. DOI: 10.1007/s00521-009-0295-6.

[2]

White I R, Royston P, Wood A M. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine , 2011, 30(4): 377–399. DOI: 10.1002/ sim.4067.

[3]

Farhangfar A, Kurgan L A, Pedrycz W. A novel framework for imputation of missing values in databases. IEEE Trans. Systems, Man, and Cybernetics—Part A : Systems and Humans , 2007, 37(5): 692–709. DOI: 10.1109/TSMCA.2007.902631.

[4]
Juszczak P, Duin R P W. Combining one-class classifiers to classify missing data. In Proc. the 5th International Workshop on Multiple Classifier Systems, Jun. 2004, pp.92–101. DOI: 10.1007/978-3-540-25966-4_9.
[5]
Krause S, Polikar R. An ensemble of classifiers approach for the missing feature problem. In Proc. the 2003 International Joint Conference on Neural Networks, Jul. 2003, pp.553–558. DOI: 10.1109/IJCNN.2003.1223406.
[6]

Polikar R, DePasquale J, Syed Mohammed H, Brown G, Kuncheva L I. Learn++. MF: A random subspace approach for the missing feature problem. Pattern Recognition , 2010, 43(11): 3817–3832. DOI: 10.1016/j.patcog.2010.05.028.

[7]
Ghahramani Z, Jordan M I. Supervised learning from incomplete data via an EM approach. In Proc. the 6th International Conference on Neural Information Processing Systems, Nov. 1993, pp.120–127.
[8]
Ahmad S, Tresp V. Some solutions to the missing feature problem in vision. In Proc. the 5th International Conference on Neural Information Processing Systems, Nov. 1992, pp.393–400.
[9]
Salzberg S L. Bookreview: C4.5: Programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning, 1994, 16(3): 235–240. DOI: 10.1007/BF00993309.
[10]
Batista G E, Monard M C. A study of k-nearest neighbour as an imputation method. Hybrid Intelligent Systems, 2002, 87(48): 251–260. DOI: 10.1109/METRIC.2004.1357895.
[11]
Schafer J L. Analysis of Incomplete Multivariate Data (1st edition). CRC Press, 1997. DOI: 10.1201/9780367803025.
[12]
Zhao Y X, Udell M. Missing value imputation for mixed data via Gaussian copula. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2020, pp.636–646. DOI: 10.1145/3394486.3403106.
[13]
Rubin D B. Multiple Imputation for Nonresponse in Surveys (1st edition). John Wiley & Sons, Inc., 2004.
[14]
Houari R, Bounceur A, Tari A K, Kecha M T. Handling missing data problems with sampling methods. In Proc. the 2014 International Conference on Advanced Networking Distributed Systems and Applications, Jun. 2014, pp.99–104. DOI: 10.1109/INDS.2014.25.
[15]

Stekhoven D J, Bühlmann P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics , 2012, 28(1): 112–118. DOI: 10.1093/bioinformatics/btr597.

[16]
Zhou Z H. Ensemble Methods: Foundations and Algorithms (1st edition). CRC Press, 2012. DOI: 10.1201/b12207.
[17]

Ho T K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Analysis and Machine Intelligence , 1998, 20(8): 832–844. DOI: 10.1109/34.709601.

[18]

Breiman L. Random forests. Machine Learning , 2001, 45(1): 5–32. DOI: 10.1023/A:1010933404324.

[19]

Sharpe P K, Solly R J. Dealing with missing values in neural network-based diagnostic systems. Neural Computing & Applications , 1995, 3(2): 73–77. DOI: 10.1007/BF 01421959.

[20]
Jiang K, Chen H X, Yuan S M. Classification for incomplete data using classifier ensembles. In Proc. the 2005 International Conference on Neural Networks and Brain, Apr. 2005, pp.559–563. DOI: 10.1109/ICNNB.2005.1614675.
[21]

Cao Y H, Wu J X, Wang H C, Lasenby J. Neural random subspace. Pattern Recognition , 2021, 112: Article No. 107801. DOI: 10.1016/j.patcog.2020.107801.

[22]
Little R J A, Rubin D B. Statistical Analysis with Missing Data (3rd edition). John Wiley & Sons, Inc., 2019.
[23]

Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research , 2010, 11(80): 2287–2322.

[24]
Huang S J, Xu M, Xie M K, Sugiyama M, Niu G, Chen S C. Active feature acquisition with supervised matrix completion. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2018, pp.1571–1579. DOI: 10.1145/3219819.3220084.
[25]
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.448–456.
[26]
Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.
Journal of Computer Science and Technology
Pages 472-486
Cite this article:
Cao Y-H, Wu J-X. Random Subspace Sampling for Classification with Missing Data. Journal of Computer Science and Technology, 2024, 39(2): 472-486. https://doi.org/10.1007/s11390-023-1611-9

193

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 26 May 2021
Accepted: 04 February 2023
Published: 30 March 2024
© Institute of Computing Technology, Chinese Academy of Sciences 2024
Return