Article Link
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Electronic Supplementary Material
References
Show full outline
Hide outline
Regular Paper

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

School of Computer Science, Fudan University, Shanghai 200433, China
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China
Huawei Technologies Co., Ltd., Hangzhou 310052, China
Show Author Information

Abstract

Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing (NLP) and yield state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-training tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model. In order to transfer knowledge from the teacher model to the student model efficiently, we further use knowledge distillation, which is called self-distillation because the distillation comes from the model itself through the time dimension. Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms, and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.

Electronic Supplementary Material

Download File(s)
JCST-2010-11119-Highlights.pdf (397.8 KB)

References

[1]
Devlin J, Chang M W, Lee K et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.
[2]
Yang Z L, Dai Z H, Yang Y M et al. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. the 33rd International Conference on Neural Information Processing Systems (NIPS), Dec. 2019, Article No. 517.
[3]
Liu Y H, Ott M, Goyal N et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, Aug. 2023.
[4]
Rajpurkar P, Zhang J, Lopyrev K et al. SQuAD: 100, 000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016, pp.2383–2392. DOI: 10.18653/v1/D16-1264.
[5]
Bowman S R, Angeli G, Potts C et al. A large annotated corpus for learning natural language inference. In Proc. the 2015 EMNLP, Sept. 2015, pp.632–642. DOI: 10.18653/v1/D15-1075.
[6]

Qiu X P, Sun T X, Xu Y G et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 2020, 63(10): 1872–1897. DOI: 10.1007/s11431-020-1647-3.

[7]
Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. the 4th Workshop on Representation Learning for NLP, Aug. 2019, pp.7–14. DOI: 10.18653/v1/W19-4302.
[8]
Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In Proc. the 36th International Conference on Machine Learning (ICML), Jun. 2019, pp.5986–5995.
[9]
Houlsby N, Giurgiu A, Jastrzebski S et al. Parameter-efficient transfer learning for NLP. In Proc. the 36th ICML, Jun. 2019, pp.2790–2799.
[10]
Dong L, Yang N, Wang W H et al. Unified language model pre-training for natural language understanding and generation. arXiv: 1905.03197, 2019. https://arxiv.org/abs/1905.03197, Aug. 2023.
[11]
Liu X D, He P C, Chen W Z et al. Multi-task deep neural networks for natural language understanding. arXiv: 1901.11504, 2019. https://arxiv.org/pdf/1901.11504.pdf, Aug. 2023.
[12]
Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: 1910.10683, 2019. https://arxiv.org/abs/1910.10683, Aug. 2023.
[13]
Sun C, Qiu X P, Xu Y G et al. How to fine-tune BERT for text classification? In Proc. the 18th China National Conference on Chinese Computational Linguistics, Oct. 2019, pp.194–206. DOI: 10.1007/978-3-030-32381-3_16.
[14]

Li H, Wang X S, Ding S F. Research and development of neural network ensembles: A survey. Artificial Intelligence Review, 2018, 49(4): 455–479. DOI: 10.1007/s10462-016-9535-1.

[15]

Polyak B T, Juditsky A B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992, 30(4): 838–855. DOI: 10.1137/0330046.

[16]
Schaul T, Quan J, Antonoglou I et al. Prioritized experience replay. In Proc. the 4th International Conference on Learning Representations (ICLR), May 2016.
[17]
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv: 1503.02531, 2015. https://arxiv.org/abs/1503.02531, Aug. 2023.
[18]
Laine S, Aila T. Temporal ensembling for semi-supervised learning. In Proc. the 5th ICLR, Apr. 2017.
[19]
Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. the 31st NIPS, Dec. 2017, pp.1195–1204.
[20]
Wei H R, Huang S J, Wang R et al. Online distilling from checkpoints for neural machine translation. In Proc. the 2019 NAACL: Human Language Technologies, Jun. 2019, pp.1932–1941. DOI: 10.18653/v1/N19-1192.
[21]
Liu W J, Zhou P, Wang Z R et al. FastBERT: A self-distilling BERT with adaptive inference time. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Jul. 2020, pp.6035–6044. DOI: 10.18653/v1/2020.acl-main.537.
[22]
Wang A, Singh A, Michael J et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. DOI: 10.18653/v1/W18-5446.
[23]
Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. In Proc. the 31st NIPS, Dec. 2017, pp.5998–6008.
[24]
Sanh V, Debut L, Chaumond J et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv: 1910.01108, 2019. https://arxiv.org/abs/1910.01108, Aug. 2023.
[25]
Jiao X Q, Yin Y C, Shang L F et al. TinyBERT: Distilling BERT for natural language understanding. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4163–4174. DOI: 10.18653/v1/2020.findings-emnlp.372.
[26]
Sun Z Q, Yu H K, Song X D et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proc. the 58th ACL, Jul. 2020, pp.2158–2170. DOI: 10.18653/v1/2020.acl-main.195.
[27]
Wang W H, Wei F R, Dong L et al. MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proc. the 34th NIPS, Dec. 2020, Article No. 485.
[28]
Ganaie M A, Hu M H, Malik A K et al. Ensemble deep learning: A review. Eng. Appl. Artif. Intell., 2022, 105: 105–151. DOI: 10.1016/j.engappai.2022.105151.
[29]
Andrychowicz M, Wolski F, Ray A et al. Hindsight experience replay. In Proc. the 31st NIPS, Dec. 2017, pp.5055–5065.
[30]
Horgan D, Quan J, Budden D et al. Distributed prioritized experience replay. In Proc. the 6th ICLR, Apr. 30–May 3, 2018.
[31]
Sun S Q, Cheng Y, Gan Z et al. Patient knowledge distillation for BERT model compression. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.4323–4332. DOI: 10.18653/v1/D19-1441.
[32]
Liu X D, He P C, Chen W Z et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv: 1904.09482, 2019. https://arxiv.org/abs/1904.09482, Aug. 2023.
[33]

Warstadt A, Singh A, Bowman S R. Neural network acceptability judgments. Trans. Association for Computational Linguistics, 2019, 7: 625–641. DOI: 10.1162/tacl_a_ 00290.

[34]
Socher R, Perelygin A, Wu J et al. Recursive deep models for semantic compositionality over a sentiment Treebank. In Proc. EMNLP, Oct. 2013, pp.1631–1642.
[35]
Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases. In Proc. the 3rd International Workshop on Paraphrasing, Oct. 2005, pp.9–16.
[36]
Cer D, Diab M, Agirre E et al. SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv: 1708.00055, 2017. https://arxiv.org/abs/1708.00055, Aug. 2023.
[37]
Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. the 2018 NAACL: Human Language Technologies, Jun. 2018, pp.1112–1122. DOI: 10.18653/v1/N18-1101.
[38]

Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2): 442–451. DOI: 10.1016/0005-2795(75)90109-9.

[39]
Maas A L, Daly R E, Pham P T et al. Learning word vectors for sentiment analysis. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp.142–150.
[40]
Zhang X, Zhao J B, LeCun Y. Character-level convolutional networks for text classification. In Proc. the 28th NIPS, Dec. 2015, pp.649–657.
[41]
Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters & less data. arXiv: 2009.09139, 2020. https://arxiv.org/abs/2009.09139, Aug. 2023.
[42]
Howard J, Ruder S. Universal language model fine-tuning for text classification. In Proc. the 56th ACL, Jul. 2018, pp.328–339. DOI: 10.18653/v1/P18-1031.
Journal of Computer Science and Technology
Pages 853-866
Cite this article:
Xu Y-G, Qiu X-P, Zhou L-G, et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. Journal of Computer Science and Technology, 2023, 38(4): 853-866. https://doi.org/10.1007/s11390-021-1119-0
Metrics & Citations  
Article History
Copyright
Return