AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Efficient Model Store and Reuse in an OLML Database System

Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China Beijing 100872, China
School of Information, Renmin University of China, Beijing 100872, China
Show Author Information

Abstract

Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models. Yet, for these neural network models, it is necessary to label a tremendous amount of training data, which is prohibitively expensive in reality. In this paper, we propose OnLine Machine Learning (OLML) database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data. An efficient model reuse algorithm AdaReuse is developed in the OLML database. Specifically, AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality, through which a group of trained models with high reuse potential for the training task could be selected efficiently. Then, multi selected models will be trained iteratively to encourage diverse models, with which a better training effect could be achieved by ensemble. We evaluate AdaReuse on two types of natural language processing (NLP) tasks, and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited. Based on AdaReuse, we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models. Usability studies are conducted to illustrate the OLML database could properly store the trained models, and reuse the trained models efficiently in new training tasks.

Electronic Supplementary Material

Download File(s)
jcst-36-4-792-Highlights.pdf (354.3 KB)

References

[1]
Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? arXiv: 1411.1792, 2014. https://arxiv.org/abs/1411.1792, Nov. 2020.
[2]
Wang W, Wang S, Gao J, Zhang M, Chen G, Ng T K, Ooi B C. Rafiki: Machine learning as an analytics service system. arXiv: 1804.06087, 2018. https://arxiv.org/abs/1804.06087, Apr. 2021.
[3]
Zhang W, Jiang J, Shao Y, Cui B. Efficient diversity-driven ensemble for deep neural networks. In Proc. the 36th IEEE International Conference on Data Engineering, Apr. 2020, pp.73-84. DOI: 10.1109/ICDE48307.2020.00014.
[4]
Derakhshan B, Mahdiraji A R, Abedjan Z, Rabl T, Markl V. Optimizing machine learning workloads in collaborative environments. In Proc. the 2020 ACM SIGMOD International Conference on Management of Data, Jun. 2020, pp.1701-1716. DOI: 10.1145/3318464.3389715.
[5]
Schapire R E. Explaining AdaBoost. In Empirical Inference, Schölkopf B, Luo Z, Vovk V (eds.), Springer, 2013, pp.37-52. DOI: 10.1007/978-3-642-41136-6_5.
[6]
Zhao Z, Chen H, Zhang J, Zhao X, Liu T, Lu W, Chen X, Deng H, Ju Q, Du X. UER: An open-source toolkit for pre-training models. arXiv: 1909.05658, 2019. https://arxiv.org/abs/1909.05658, April 2021.
[7]
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013. https://arxiv.org/abs/1301.3781, Jan. 2021.
[8]
Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. DOI: 10.3115/v1/D14-1162.
[9]
Zhao Z, Liu T, Li S, Li B, Du X. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. DOI: 10.18653/v1/D17-1023.
[10]
Dai A M, Olah C, Le Q V. Document embedding with paragraph vectors. arXiv: 1507.07998, 2015. https://arxiv.org/abs/1507.07998, April 2021.
[11]
Axelrod A, He X, Gao J. Domain adaptation via pseudo in-domain data selection. In Proc. the 2011 Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.355-362.
[12]
Chen B, Huang F. Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.314-323. DOI: 10.18653/v1/K16-1031.
[13]

Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359. DOI: 10.1109/TKDE.2009.191.

[14]
Freitag M, Al-Onaizan Y. Fast domain adaptation for neural machine translation. arXiv: 1612.06897, 2016. https://arxiv.org/abs/1612.06897, Dec. 2020.
[15]

Zhang C, Kumar A, Ré C. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems, 2016, 41(1): Article No. 2. DOI: 10.1145/2877204.

[16]
Nguyen C, Hassner T, Seeger M, Archambeau C. LEEP: A new measure to evaluate transferability of learned representations. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.7294-7305.
[17]
Dietterich T G. Ensemble methods in machine learning. In Proc. the 1st International Workshop on Multiple Classifier Systems, Jun. 2000, pp.1-15. DOI: 10.1007/3-540-450149_1.
[18]

Fu F, Jiang J, Shao Y, Cui B. An experimental evaluation of large scale GBDT systems. Proceedings of the VLDB Endowment, 2019, 12(11): 1357-1370. DOI: 10.14778/3342263.3342273.

[19]

Breiman L. Stacked regressions. Machine Learning, 1996, 24(1): 49-64. DOI: 10.1023/A:1018046112532.

[20]
Ding Y X, Zhou Z H. Boosting-based reliable model reuse. In Proc. the 12th Asian Conference on Machine Learning, November 2020, pp.145-160.
[21]
Miao H, Li A, Davis LS, Deshpande A. ModelHub: Towards unified data and lifecycle management for deep learning. arXiv: 1611.06224, 2016. https://arxiv.org/abs/1611.06224, Nov. 2020.
[22]
Vartak M, Subramanyam H, Lee W E, Viswanathan S, Husnoo S, Madden S, Zaharia M. MODELDB: A system for machine learning model management. In Proc. the Workshop on Human-in-the-Loop Data Analytics, June 26–July 1, 2016, Article No. 14. DOI: 10.1145/2939502.2939516.
[23]
Bhardwaj A, Bhattacherjee S, Chavan A, Deshpande A, Elmore A J, Madden S, Parameswaran A G. Datahub: Collaborative data science & dataset version management at scale. arXiv: 1409.0798, 2014. https://arxiv.org/abs/1409.0798, April 2021.
[24]
Kraska T, Talwalkar A, Duchi J C, Griffith R, Franklin M J, Jordan M I. MLbase: A distributed machine-learning system. In Proc. the 6th Biennial Conference on Innovative Data Systems Research, Jan. 2013.
[25]
Xin D, Ma L, Liu J, Macke S, Song S, Parameswaran A. HELIX: Accelerating human-in-the-loop machine learning. arXiv: 1808.01095, 2018. https://arxiv.org/abs/1808.01095, April 2021.
[26]
Xu L, Dong Q, Liao Y, Yu C, Tian Y, Liu W, Li L, Liu C, Zhang X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv: 2001.04351, 2020. https://arxiv.org/abs/2001.04351, Jan. 2021.
[27]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. arXiv: 1706.03762, 2017. https://arxiv.org/abs/1706.03762, April 2021.
[28]
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, April 2021.
Journal of Computer Science and Technology
Pages 792-805
Cite this article:
Cui J-W, Lu W, Zhao X, et al. Efficient Model Store and Reuse in an OLML Database System. Journal of Computer Science and Technology, 2021, 36(4): 792-805. https://doi.org/10.1007/s11390-021-1353-5

409

Views

2

Crossref

0

Web of Science

2

Scopus

0

CSCD

Altmetrics

Received: 04 February 2021
Accepted: 27 June 2021
Published: 05 July 2021
©Institute of Computing Technology, Chinese Academy of Sciences 2021
Return