Efficient Model Store and Reuse in an OLML Database System

Jian-Wei Cui; Wei Lu; Xin Zhao; Xiao-Yong Du

doi:10.1007/s11390-021-1353-5

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

Efficient Model Store and Reuse in an OLML Database System

Jian-Wei Cui^{¹^,²}, Wei Lu^{¹^,²}, Xin Zhao^{¹^,²}, Xiao-Yong Du^{¹^,²}(

)

Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China Beijing 100872, China

School of Information, Renmin University of China, Beijing 100872, China

Show Author Information

Abstract

Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

Keywords

model selection model reuse OnLine Machine Learning (OLML) database

Electronic Supplementary Material

Download File(s)

jcst-36-4-792-Highlights.pdf (354.3 KB)

References

[1]

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? arXiv: 1411.1792, 2014. https://arxiv.org/abs/1411.1792, Nov. 2020.

[2]

Wang W, Wang S, Gao J, Zhang M, Chen G, Ng T K, Ooi B C. Rafiki: Machine learning as an analytics service system. arXiv: 1804.06087, 2018. https://arxiv.org/abs/1804.06087, Apr. 2021.

[3]

Zhang W, Jiang J, Shao Y, Cui B. Efficient diversity-driven ensemble for deep neural networks. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.73-84. DOI: 10.1109/ICDE48307.2020.00014.

Crossref

[4]

Derakhshan B, Mahdiraji A R, Abedjan Z, Rabl T, Markl V. Optimizing machine learning workloads in collaborative environments. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.1701-1716. DOI: 10.1145/3318464.3389715.

Crossref

[5]

Schapire R E. Explaining AdaBoost. In Empirical Inference, Schölkopf B, Luo Z, Vovk V (eds.), Springer, 2013, pp.37-52. DOI: 10.1007/978-3-642-41136-6_5.

Crossref

[6]

Zhao Z, Chen H, Zhang J, Zhao X, Liu T, Lu W, Chen X, Deng H, Ju Q, Du X. UER: An open-source toolkit for pre-training models. arXiv: 1909.05658, 2019. https://arxiv.org/abs/1909.05658, April 2021.

[7]

Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013. https://arxiv.org/abs/1301.3781, Jan. 2021.

[8]

Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. DOI: 10.3115/v1/D14-1162.

Crossref

[9]

Zhao Z, Liu T, Li S, Li B, Du X. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. DOI: 10.18653/v1/D17-1023.

Crossref

[10]

Dai A M, Olah C, Le Q V. Document embedding with paragraph vectors. arXiv: 1507.07998, 2015. https://arxiv.org/abs/1507.07998, April 2021.

[11]

Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In Proc. the 2011 Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.355-362.

[12]

Chen B, Huang F. Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.314-323. DOI: 10.18653/v1/K16-1031.

Crossref

[13]

Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359. DOI: 10.1109/TKDE.2009.191.

Crossref Google Scholar

[14]

Freitag M, Al-Onaizan Y. Fast domain adaptation for neural machine translation. arXiv: 1612.06897, 2016. https://arxiv.org/abs/1612.06897, Dec. 2020.

[15]

Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems, 2016, 41(1): Article No. 2. DOI: 10.1145/2877204.

Crossref Google Scholar

[16]

Nguyen C, Hassner T, Seeger M, Archambeau C. LEEP: A new measure to evaluate transferability of learned representations. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.7294-7305.

[17]

Dietterich T G. Ensemble methods in machine learning. In Proc. the 1st International Workshop on Multiple Classifier Systems, Jun. 2000, pp.1-15. DOI: 10.1007/3-540-450149_1.

Crossref

[18]

Fu F, Jiang J, Shao Y, Cui B. An experimental evaluation of large scale GBDT systems. Proceedings of the VLDB Endowment, 2019, 12(11): 1357-1370. DOI: 10.14778/3342263.3342273.

Crossref Google Scholar

[19]

Breiman L. Stacked regressions. Machine Learning, 1996, 24(1): 49-64. DOI: 10.1023/A:1018046112532.

Crossref Google Scholar

[20]

Ding Y X, Zhou Z H. Boosting-based reliable model reuse. In Proc. the 12th Asian Conference on Machine Learning, November 2020, pp.145-160.

[21]

Miao H, Li A, Davis LS, Deshpande A. ModelHub: Towards unified data and lifecycle management for deep learning. arXiv: 1611.06224, 2016. https://arxiv.org/abs/1611.06224, Nov. 2020.

[22]

Vartak M, Subramanyam H, Lee W E, Viswanathan S, Husnoo S, Madden S, Zaharia M. MODELDB: A system for machine learning model management. In Proc. the Workshop on Human-in-the-Loop Data Analytics, June 26–July 1, 2016, Article No. 14. DOI: 10.1145/2939502.2939516.

Crossref

[23]

Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore A J, Madden S, Parameswaran A G. Datahub: Collaborative data science & dataset version management at scale. arXiv: 1409.0798, 2014. https://arxiv.org/abs/1409.0798, April 2021.

[24]

Kraska T, Talwalkar A, Duchi J C, Griffith R, Franklin M J, Jordan M I. MLbase: A distributed machine-learning system. In Proc. the 6th Biennial Conference on Innovative Data Systems Research, Jan. 2013.

[25]

Xin D, Ma L, Liu J, Macke S, Song S, Parameswaran A. HELIX: Accelerating human-in-the-loop machine learning. arXiv: 1808.01095, 2018. https://arxiv.org/abs/1808.01095, April 2021.

Crossref

[26]

Xu L, Dong Q, Liao Y, Yu C, Tian Y, Liu W, Li L, Liu C, Zhang X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv: 2001.04351, 2020. https://arxiv.org/abs/2001.04351, Jan. 2021.

[27]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. arXiv: 1706.03762, 2017. https://arxiv.org/abs/1706.03762, April 2021.

[28]

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, April 2021.

Journal of Computer Science and Technology

Volume 36 Issue 4,
July 2021

Pages 792-805

DOI: 10.1007/s11390-021-1353-5

Cite this article:

Cui J-W, Lu W, Zhao X, et al. Efficient Model Store and Reuse in an OLML Database System. Journal of Computer Science and Technology, 2021, 36(4): 792-805. https://doi.org/10.1007/s11390-021-1353-5

375

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 04 February 2021

Accepted: 27 June 2021

Published: 05 July 2021