Novel and Efficient Randomized Algorithms for Feature Selection

Zigeng Wang; Xia Xiao; Sanguthevar Rajasekaran

doi:10.26599/BDMA.2020.9020005

| Sign up

PDF (3.9 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Novel and Efficient Randomized Algorithms for Feature Selection

Zigeng Wang, Xia Xiao, Sanguthevar Rajasekaran()

∙ Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.

Show Author Information

Abstract

Feature selection is a crucial problem in efficient machine learning, and it also greatly contributes to the explainability of machine-driven decisions. Methods, like decision trees and Least Absolute Shrinkage and Selection Operator (LASSO), can select features during training. However, these embedded approaches can only be applied to a small subset of machine learning models. Wrapper based methods can select features independently from machine learning models but they often suffer from a high computational cost. To enhance their efficiency, many randomized algorithms have been designed. In this paper, we propose automatic breadth searching and attention searching adjustment approaches to further speedup randomized wrapper based feature selection. We conduct theoretical computational complexity analysis and further explain our algorithms’ generic parallelizability. We conduct experiments on both synthetic and real datasets with different machine learning base models. Results show that, compared with existing approaches, our proposed techniques can locate a more meaningful set of features with a high efficiency.

Keywords

feature selection randomized algorithms efficient selection

References

[1]

Z. G.

Wang

and S.

Rajasekaran

, Efficient randomized feature selection algorithms, in Proc. 2019 IEEE 21st Int. Conf. High Performance Computing and Communications; IEEE 17th Int. Conf. Smart City; IEEE 5th Int. Conf. on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 796-803.

[2]

Ounit

, S.

Wanamaker

, T. J.

, and S.

Lonardi

, CLARK: Fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers, BMC Genomics, vol. 16. no.1, p. 236, 2015.

Crossref Google Scholar

[3]

Menzel

, K. L.

, and A.

Krogh

, Fast and sensitive taxonomic classification for metagenomics with Kaiju, Nat. Commun., vol. 7, p. 11 257, 2016.

Crossref Google Scholar

[4]

Y. X.

Tong

, Y. Q.

Chen

, Z. M.

Zhou

, L.

Chen

, J.

Wang

, Q.

Yang

, L. P.

and W. F.

, The simpler the better: A unified approach to predicting original taxi demands based on large-scale online platforms, in Proc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2017, pp. 1653-1662.

Crossref

[5]

C. F.

Zhang

, M. X.

Dong

, T. H.

Luan

, and K.

Ota

, Battery maintenance of pedelec sharing system: Big data based usage prediction and replenishment scheduling, IEEE Trans. Network Sci. Eng., vol. 7, no. 1, pp. 127-138, 2019.

Crossref Google Scholar

[6]

Moeini

, W. X.

Zeng

, I. L.

Yen

, and F.

Bastani

, Toward data discovery in dynamic Smart city applications, in Proc. 2019 IEEE 21st Int. Conf. High Performance Computing and Communications; IEEE 17th Int. Conf. Smart City; IEEE 5th Int. Conf. Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 2019, pp. 2572-2579.

[7]

X. R.

Zhang

, C. A.

Huang

, M. Y.

Liu

, A.

Stefanopoulou

, and T.

Ersal

, Predictive cruise control with private vehicle-to-vehicle communication for improving fuel consumption and emissions, IEEE Commun. Mag., vol. 57, no. 10, pp. 91-97, 2019.

Crossref Google Scholar

[8]

Gledson

, T. B.

Dhafari

, N.

Paton

, and J.

Keane

, A smart city dashboard for combining and analysing multi-source data streams, in Proc. 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 2018, pp. 1366-1373.

Crossref

[9]

Mohammadi

and A.

Al-Fuqaha

, Enabling cognitive smart cities using big data and machine learning: Approaches and challenges, IEEE Commun. Magaz., vol. 56, no. 2, pp. 94-101, 2018.

Crossref Google Scholar

[10]

Wold

, K.

Esbensen

, and P.

Geladi

, Principal component analysis, Chemometr. Intellig. Lab. Syst., vol. 2, nos. 1-3, pp. 37-52, 1987.

Crossref Google Scholar

[11]

Mika

, G.

Ratsch

, J.

Weston

, B.

Scholkopf

, and K. R.

Mullers

, Fisher discriminant analysis with kernels, in Proc. Neural Networks for Signal Processing IX: Proc. 1999 IEEE Signal Processing Society Workshop, Madison, WI, USA, 1999, pp. 41-48.

[12]

L. P.

Wang

, Y. L.

Wang

, and Q.

Chang

, Feature selection methods for big data bioinformatics: A survey from the search perspective, Methods, vol. 111, pp. 21-31, 2016.

Crossref Google Scholar

[13]

Mladenić

, Feature selection in text mining, in Encyclopedia of Machine Learning, C.

Sammut

and G. J.

Webb

, eds. Boston, MA, USA: Springer, 2011, pp. 406-410.

Crossref

[14]

Huang

and S.

Aviyente

, Wavelet feature selection for image classification, IEEE Trans. Image Process., vol. 17, no. 9, pp. 1709-1720, 2008.

Crossref Google Scholar

[15]

J. L.

Tracy

, A.

Trabucco

, A. M.

Lawing

, J. T.

Giermakowski

, M.

Tchakerian

, G. M.

Drus

, and R. D.

Coulson

, Random subset feature selection for ecological niche models of wildfire activity in Western North America, Ecol. Modell., vol. 383, pp. 52-68, 2018.

Crossref Google Scholar

[16]

Liu

, D. X.

Jiang

, and W. G.

Yang

, Global geometric similarity scheme for feature selection in fault diagnosis, Exp. Syst. Applicat., vol. 41, no. 8, pp. 3585-3595, 2014.

Crossref Google Scholar

[17]

J. D.

, K. W.

Cheng

, S. H.

Wang

, F.

Morstatter

, R. P.

Trevino

and J. L.

Tang

, Feature selection: A data perspective, ACM Comput. Sur., vol. 50, no. 6, p. 94, 2017.

Crossref Google Scholar

[18]

Jović

, K.

Brkić

, and N.

Bogunović

, A review of feature selection methods with applications, in Proc. 2015 38th Int. Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 2015, pp. 1200-1205.

[19]

Z. H.

Zhang

and E. R.

Hancock

, A graph-based approach to feature selection, in Proc. Int. Workshop on Graph-Based Representations in Pattern Recognition, Münster, Germany, 2011, pp. 205-214.

Crossref

[20]

A. K.

Das

, S.

Goswami

, A.

Chakrabarti

, and B.

Chakraborty

, A new hybrid feature selection approach using feature association map for supervised and unsupervised classification, Exp. Syst. Appl., vol. 88, pp. 81-94, 2017.

Crossref Google Scholar

[21]

Xiao

, Z. G.

Wang

, and S.

Rajasekaran

, Novel speedup techniques for parallel singular value decomposition, in Proc. 2018 IEEE 20th Int. Conf. High Performance Computing and Communications; IEEE 16th Int. Conf. Smart City; IEEE 4th Int. Conf. Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 2018, pp. 188-195.

[22]

Chandrashekar

and F.

Sahin

, A survey on feature selection methods, Comput. Electr. Eng., vol. 40, no. 1, pp. 16-28, 2014.

Crossref Google Scholar

[23]

Fonti

and E.

Belitser

, Feature Selection Using Lasso, https://pdfs.semanticscholar.org/24ac/d159910658223209 433cf4cbe3414264de39.pdf, 2017.

[24]

Duchi

, S.

Shalev-Shwartz

, Y.

Singer

, and T.

Chandra

, Efficient projections onto the

l_{1}

-ball for learning in high dimensions, in Proc. 25th Int. Conf. Machine Learning, Helsinki, Finland, 2008, pp. 272-279.

Crossref

[25]

Grgić-Hlača

, M. B.

Zafar

, K. P.

Gummadi

, and A.

Weller

, Beyond distributive fairness in algorithmic decision making: Feature selection for procedurally fair learning, in Proc. Thirty-Second AAAI Conf. on Artificial Intelligence (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symp. on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018.

[26]

Zhang

, M. M.

Khalili

, C.

Tekin

and M.

Liu

, Group retention when using machine learning in sequential decision making: The interplay between user dynamics and fairness, presented at the 33rd Conf. Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019, pp. 15 243-15 252.

[27]

Zhang

, T. Q.

Zhu

, P.

Xiong

, H.

Huo

, Z.

Tari

, and W. L.

Zhou

, Correlated differential privacy: Feature selection in machine learning, IEEE Trans. Ind. Inf., vol. 16, no. 3, pp. 2115-2124, 2019.

Crossref Google Scholar

[28]

X. R.

Zhang

, M. M.

Khalili

, and M. Y.

Liu

, Improving the privacy and accuracy of ADMM-based distributed algorithms, in Proc. 35th Int. Conf. on Machine Learning, Stockholm, Sweden, 2018, pp. 5796-5805.

[29]

, D. R.

De Celis Leal

, S.

Rana

, S.

Gupta

, A.

Sutti

, S.

Greenhill

, T.

Slezak

, M.

Height

, and S.

Venkatesh

, Rapid Bayesian optimisation for synthesis of short polymer fiber materials, Sci. Rep., vol. 7, p. 5683, 2017.

Crossref Google Scholar

[30]

Kim

, A.

Chandrasekaran

, T. D.

Huan

, D.

Das

, and R.

Ramprasad

, Polymer genome: A data-powered polymer informatics platform for property predictions, J. Phys. Chem. C, vol. 122, no. 31, pp. 17 575-17 585, 2018.

Crossref Google Scholar

[31]

P. M.

Narendra

and K.

Fukunaga

, A branch and bound algorithm for feature subset selection, IEEE Trans. Comput., vol. C-26, no. 9, pp. 917-922, 1977.

Crossref Google Scholar

[32]

Doak

, An evaluation of feature selection methods and their application to computer security, Tech. Rep. CSE-92-18 UC Davis, Department of Computer Science, University of California, Davis, CA, USA, 1992.

[33]

Nakariyakul

and D. P.

Casasent

, An improvement on floating search algorithms for feature subset selection, Patt. Recognit., vol. 42, no. 9, pp. 1932-1940, 2009.

Crossref Google Scholar

[34]

Russell

and P.

Norvig

, Artificial Intelligence: A Modern Approach. New York, NY, USA: Prentice Hall, 2009.

[35]

Xue

, M. J.

Zhang

, W. N.

Browne

, and X.

Yao

, A survey on evolutionary computation approaches to feature selection, IEEE Trans. Evol. Comput., vol. 20, no. 4, pp. 606-626, 2016.

Crossref Google Scholar

[36]

Ghamisi

and J. A.

Benediktsson

, Feature selection based on hybridization of genetic algorithm and particle swarm optimization, IEEE Geosci. Remote Sens. Lett., vol. 12, no. 2, pp. 309-313, 2015.

Crossref Google Scholar

[37]

D. J.

Stracuzzi

and P. E.

Utgoff

, Randomized variable elimination, J. Mach. Learn. Res., vol. 5, pp. 1331-1362, 2004.

Google Scholar

[38]

Saha

, S.

Rajasekaran

, and R.

Ramprasad

, Novel randomized feature selection algorithms, Int. J. Found. Comput. Sci., vol. 26, no. 3, pp. 321-341, 2015.

Crossref Google Scholar

[39]

Brankovic

, A.

Falsone

, M.

Prandini

, and L.

Piroddi

, A feature selection and classification algorithm based on randomized extraction of model populations, IEEE Trans. Cybernet., vol. 48, no. 4, pp. 1151-1162, 2018.

Crossref Google Scholar

[40]

Foroutan

and J.

Sklansky

, Feature selection for automatic classification of non-Gaussian data, IEEE Trans. Syst. Man Cybernet., vol. 17, no. 2, pp. 187-198, 1987.

Crossref Google Scholar

[41]

Pudil

, J.

Novovičová

, and J.

Kittler

, Floating search methods in feature selection, Patt. Recognit. Lett., vol. 15, no. 11, pp. 1119-1125, 1994.

Crossref Google Scholar

[42]

D. J.

Stracuzzi

, Randomized feature selection, in Computational Methods of Feature Selection, H.

Liu

and H.

Motoda

, eds. London, UK: Chapman and Hall/CRC, 2007, pp. 53-74.

Crossref

[43]

S. W.

Lin

, Z. J.

Lee

, S. C.

Chen

, and T. Y.

Tseng

, Parameter determination of support vector machine and feature selection using simulated annealing approach, Appl. Soft Comput., vol. 8, no. 4, pp. 1505-1512, 2008.

Crossref Google Scholar

[44]

Rajasekaran

and J. H.

Reif

, Derivation of randomized sorting and selection algorithms, in Parallel Algorithm Derivation and Program Transformation, R.

Paige

, J.

Reif

, and R.

Watcher

, eds. Boston, MA, USA: Springer, 1993, pp. 187-205.

[45]

Rajasekaran

, On simulated annealing and nested annealing, J. Glob. Optim., vol. 16, no. 1, pp. 43-56, 2000.

Google Scholar

[46]

Jájá

, An Introduction to Parallel Algorithms. Redwood City, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1992.

[47]

Horowitz

, S.

Sahni

, and S.

Rajasekaran

, Computer Algorithms C++: C++ and Pseudocode Versions. Stuttgart, Germany: Macmillan, 1997.

[48]

Guyon

, S.

Gunn

, A. B.

Hur

, and G.

Dror

, Design and analysis of the NIPS2003 challenge, in Feature Extraction, I.

Guyon

, M.

Nikravesh

, S.

Gunn

, and L. A.

Zadeh

, eds. Berlin, Germany: Springer, 2006, pp. 237-263.

Crossref

[49]

T. F. Investment and Kaggle, Restaurant revenue prediction: Predict annual restaurant sales based on objective measurements, https://www.kaggle.com/c/restaurant-revenue-prediction, 2019.

[50]

Pedregosa

, G.

Varoquaux

, A.

Gramfort

, V.

Michel

, B.

Thirion

, O.

Grisel

, M.

Blondel

, P.

Prettenhofer

, R.

Weiss

, V.

Dubourg

, et al., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res., vol. 12, pp. 2825-2830, 2011.

Google Scholar

[51]

Lasek

, N.

Cercone

, and J.

Saunders

, Smart Restaurants: Survey on customer demand and sales forecasting, in Smart Cities and Homes, M. S.

Obaidat

and O.

Nicopolitidis

, eds. Amsterdam, Netherlands: Elsevier, 2016, pp. 361-386.

Crossref

Big Data Mining and Analytics

Volume 3 Issue 3,
September 2020

Pages 208-224

DOI: 10.26599/BDMA.2020.9020005

Cite this article:

Wang Z, Xiao X, Rajasekaran S. Novel and Efficient Randomized Algorithms for Feature Selection. Big Data Mining and Analytics, 2020, 3(3): 208-224. https://doi.org/10.26599/BDMA.2020.9020005