| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Yi-Ge Xu^{¹^,²}, Xi-Peng Qiu^{¹^,²}(), Li-Gao Zhou^³, Xuan-Jing Huang^{¹^,²}

1School of Computer Science, Fudan University, Shanghai 200433, China

2Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China

3Huawei Technologies Co., Ltd., Hangzhou 310052, China

Show Author Information

Abstract

Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing (NLP) and yield state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-training tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model. In order to transfer knowledge from the teacher model to the student model efficiently, we further use knowledge distillation, which is called self-distillation because the distillation comes from the model itself through the time dimension. Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms, and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.

Keywords

BERT deep learning fine-tuning natural language processing (NLP)pre-training model

Electronic Supplementary Material

Download File(s)

JCST-2010-11119-Highlights.pdf (397.8 KB)

References

[1]

Devlin J, Chang M W, Lee K et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.

[2]

Yang Z L, Dai Z H, Yang Y M et al. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. the 33rd International Conference on Neural Information Processing Systems (NIPS), Dec. 2019, Article No. 517.

[3]

Liu Y H, Ott M, Goyal N et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, Aug. 2023.

[4]

Rajpurkar P, Zhang J, Lopyrev K et al. SQuAD: 100, 000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016, pp.2383–2392. DOI: 10.18653/v1/D16-1264.

[5]

Bowman S R, Angeli G, Potts C et al. A large annotated corpus for learning natural language inference. In Proc. the 2015 EMNLP, Sept. 2015, pp.632–642. DOI: 10.18653/v1/D15-1075.

[6]

Qiu X P, Sun T X, Xu Y G et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 2020, 63(10): 1872–1897. DOI: 10.1007/s11431-020-1647-3.

Crossref Google Scholar

[7]

Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. the 4th Workshop on Representation Learning for NLP, Aug. 2019, pp.7–14. DOI: 10.18653/v1/W19-4302.

[8]

Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In Proc. the 36th International Conference on Machine Learning (ICML), Jun. 2019, pp.5986–5995.

[9]

Houlsby N, Giurgiu A, Jastrzebski S et al. Parameter-efficient transfer learning for NLP. In Proc. the 36th ICML, Jun. 2019, pp.2790–2799.

[10]

Dong L, Yang N, Wang W H et al. Unified language model pre-training for natural language understanding and generation. arXiv: 1905.03197, 2019. https://arxiv.org/abs/1905.03197, Aug. 2023.

[11]

Liu X D, He P C, Chen W Z et al. Multi-task deep neural networks for natural language understanding. arXiv: 1901.11504, 2019. https://arxiv.org/pdf/1901.11504.pdf, Aug. 2023.

[12]

Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: 1910.10683, 2019. https://arxiv.org/abs/1910.10683, Aug. 2023.

[13]

Sun C, Qiu X P, Xu Y G et al. How to fine-tune BERT for text classification? In Proc. the 18th China National Conference on Chinese Computational Linguistics, Oct. 2019, pp.194–206. DOI: 10.1007/978-3-030-32381-3_16.

[14]

Li H, Wang X S, Ding S F. Research and development of neural network ensembles: A survey. Artificial Intelligence Review, 2018, 49(4): 455–479. DOI: 10.1007/s10462-016-9535-1.

Crossref Google Scholar

[15]

Polyak B T, Juditsky A B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992, 30(4): 838–855. DOI: 10.1137/0330046.

Crossref Google Scholar

[16]

Schaul T, Quan J, Antonoglou I et al. Prioritized experience replay. In Proc. the 4th International Conference on Learning Representations (ICLR), May 2016.

[17]

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv: 1503.02531, 2015. https://arxiv.org/abs/1503.02531, Aug. 2023.

[18]

Laine S, Aila T. Temporal ensembling for semi-supervised learning. In Proc. the 5th ICLR, Apr. 2017.

[19]

Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. the 31st NIPS, Dec. 2017, pp.1195–1204.

[20]

Wei H R, Huang S J, Wang R et al. Online distilling from checkpoints for neural machine translation. In Proc. the 2019 NAACL: Human Language Technologies, Jun. 2019, pp.1932–1941. DOI: 10.18653/v1/N19-1192.

[21]

Liu W J, Zhou P, Wang Z R et al. FastBERT: A self-distilling BERT with adaptive inference time. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Jul. 2020, pp.6035–6044. DOI: 10.18653/v1/2020.acl-main.537.

[22]

Wang A, Singh A, Michael J et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. DOI: 10.18653/v1/W18-5446.

[23]

Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. In Proc. the 31st NIPS, Dec. 2017, pp.5998–6008.

[24]

Sanh V, Debut L, Chaumond J et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv: 1910.01108, 2019. https://arxiv.org/abs/1910.01108, Aug. 2023.

[25]

Jiao X Q, Yin Y C, Shang L F et al. TinyBERT: Distilling BERT for natural language understanding. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4163–4174. DOI: 10.18653/v1/2020.findings-emnlp.372.

[26]

Sun Z Q, Yu H K, Song X D et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proc. the 58th ACL, Jul. 2020, pp.2158–2170. DOI: 10.18653/v1/2020.acl-main.195.

[27]

Wang W H, Wei F R, Dong L et al. MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proc. the 34th NIPS, Dec. 2020, Article No. 485.

[28]

Ganaie M A, Hu M H, Malik A K et al. Ensemble deep learning: A review. Eng. Appl. Artif. Intell., 2022, 105: 105–151. DOI: 10.1016/j.engappai.2022.105151.

[29]

Andrychowicz M, Wolski F, Ray A et al. Hindsight experience replay. In Proc. the 31st NIPS, Dec. 2017, pp.5055–5065.

[30]

Horgan D, Quan J, Budden D et al. Distributed prioritized experience replay. In Proc. the 6th ICLR, Apr. 30–May 3, 2018.

[31]

Sun S Q, Cheng Y, Gan Z et al. Patient knowledge distillation for BERT model compression. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.4323–4332. DOI: 10.18653/v1/D19-1441.

[32]

Liu X D, He P C, Chen W Z et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv: 1904.09482, 2019. https://arxiv.org/abs/1904.09482, Aug. 2023.

[33]

Warstadt A, Singh A, Bowman S R. Neural network acceptability judgments. Trans. Association for Computational Linguistics, 2019, 7: 625–641. DOI: 10.1162/tacl_a_ 00290.

Crossref Google Scholar

[34]

Socher R, Perelygin A, Wu J et al. Recursive deep models for semantic compositionality over a sentiment Treebank. In Proc. EMNLP, Oct. 2013, pp.1631–1642.

[35]

Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases. In Proc. the 3rd International Workshop on Paraphrasing, Oct. 2005, pp.9–16.

[36]

Cer D, Diab M, Agirre E et al. SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv: 1708.00055, 2017. https://arxiv.org/abs/1708.00055, Aug. 2023.

[37]

Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. the 2018 NAACL: Human Language Technologies, Jun. 2018, pp.1112–1122. DOI: 10.18653/v1/N18-1101.

[38]

Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2): 442–451. DOI: 10.1016/0005-2795(75)90109-9.

Crossref Google Scholar

[39]

Maas A L, Daly R E, Pham P T et al. Learning word vectors for sentiment analysis. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp.142–150.

[40]

Zhang X, Zhao J B, LeCun Y. Character-level convolutional networks for text classification. In Proc. the 28th NIPS, Dec. 2015, pp.649–657.

[41]

Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters & less data. arXiv: 2009.09139, 2020. https://arxiv.org/abs/2009.09139, Aug. 2023.

[42]

Howard J, Ruder S. Universal language model fine-tuning for text classification. In Proc. the 56th ACL, Jul. 2018, pp.328–339. DOI: 10.18653/v1/P18-1031.

Journal of Computer Science and Technology

Volume 38 Issue 4,
July 2023

Pages 853-866

DOI: 10.1007/s11390-021-1119-0

Cite this article:

Xu Y-G, Qiu X-P, Zhou L-G, et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. Journal of Computer Science and Technology, 2023, 38(4): 853-866. https://doi.org/10.1007/s11390-021-1119-0

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号