A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Xueyang Wu; Rongzhong Lian; Di Jiang; Yuanfeng Song; Weiwei Zhao; Qian Xu; Qiang Yang

doi:10.26599/AIR.2022.9150001

| Sign up

PDF (989 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Xueyang Wu^¹, Rongzhong Lian^², Di Jiang^², Yuanfeng Song^², Weiwei Zhao^², Qian Xu^{¹^,²}, Qiang Yang^{¹^,²}()

1 Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China.

2 WeBank Co. Ltd., Shenzhen 518057, China.

Show Author Information

Abstract

Robustness is a long-standing challenge for automatic speech recognition (ASR) as the applied environment of any ASR system faces much noisier speech samples than clean training corpora. However, it is impractical to annotate every types of noisy environments. In this work, we propose a novel phonetic-semantic pre-training (PSP) framework that allows a model to effectively improve the performance of ASR against practical noisy environments via seamlessly integrating pre-training, self-supervised learning, and fine-tuning. In particular, there are three fundamental stages in PSP. First, pre-train the phone-to-word transducer (PWT) to map the generated phone sequence to the target text using only unpaired text data; second, continue training the PWT on more complex data generated from an empirical phone-perturbation heuristic, in additional to self-supervised signals by recovering the tainted phones; and third, fine-tune the resultant PWT with real world speech data. We perform experiments on two real-life datasets collected from industrial scenarios and synthetic noisy datasets, which show that the PSP effectively improves the traditional ASR pipeline with relative character error rate (CER) reductions of 28.63% and 26.38%, respectively, in two real-life datasets. It also demonstrates its robustness against synthetic highly noisy speech datasets.

Keywords

pre-training automatic speech recognition self-supervised learning

References

D. Jurafsky and J. H. Martin, Speech and Language Processing. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2000.

H. Hu, T. Tan, and Y. Qian, Generative adversarial networks based data augmentation for noise robust speech recognition, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5044−5048.

Crossref

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27^th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672−2680.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv: 1910.13461, 2019.

Crossref

Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar, H. Huang, A. Tjandra, X. Zhang, F. Zhang, et al., Transformer-based acoustic modeling for hybrid speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6874–6878.

Crossref

D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, et al., The subspace Gaussian mixture model—A structured model for speech recognition, Comput. Speech Lang., vol. 25, no. 2, pp. 404–439, 2011.

Crossref Google Scholar

M. Gales and S. Young, The application of hidden Markov models in speech recognition, Found. Trends Signal Process., vol. 1, no. 3, pp. 195–304, 2007.

Crossref Google Scholar

G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.

Crossref Google Scholar

S. Latif, M. Usman, R. Rana, and J. Qadir, Phonocardiographic sensing using deep learning for abnormal heartbeat detection, IEEE Sensors J., vol. 18, no. 22, pp. 9393–9400, 2018.

Crossref Google Scholar

A. Qayyum, S. Latif, and J. Qadir, Quran reciter identification: A deep learning approach, in Proc. 7^th Int. Conf. Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 2018, pp. 492–497.

Crossref

Z. Chen, Q. Liu, H. Li, and K. Yu, On modular training of neural acoustics-to-word model for LVCSR, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4754–4758.

Crossref

M. N. Sundararaman, A. Kumar, and J. Vepa, Phoneme-BERT: Joint language modelling of phoneme sequence and ASR transcript, arXiv preprint arXiv: 2102.00804, 2021.

Crossref

S. Ghorbani, S. Khorram, and J. H. L. Hansen, Domain expansion in DNN-based acoustic models for robust speech recognition, in Proc. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 107–113.

Crossref

A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proc. Interspeech 2012, Portland, OR, USA, 2012, pp. 22–25.

Crossref

J. Liao, Y. Shi, M. Gong, L. Shou, S. Eskimez, L. Lu, H. Qu, and M. Zeng, Generating human readable transcript for automatic speech recognition with pre-trained language model, arXiv preprint arXiv: 2102.11114, 2021.

Crossref

X. Qiu, T. Sun, Y. G. Xu, Y. Shao, N. Dai, and X. Huang, Pre-trained models for natural language processing: A survey, Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, 2020.

Crossref Google Scholar

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.

O. Melamud, J. Goldberger, and I. Dagan, context2vec: Learning generic context embedding with bidirectional LSTM, in Proc. 20^th SIGNLL Conf. Computational Natural Language Learning, Berlin, Germany, 2016, pp. 51–61.

Crossref

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, in Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 2018, pp. 2227–2237.

Crossref

J. Howard and S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv: 1801.06146, 2018.

Crossref

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The Kaldi speech recognition toolkit, in Proc. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 2011.

Kaldi: ‘Chain’ Models, https://kaldi-asr.org/doc/chain.html, 2021.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., PyTorch: An imperative style, high-performance deep learning library, arXiv preprint arXiv: 1912.01703, 2019.

G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, OpenNMT: Open-source toolkit for neural machine translation, in Proc. ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72.

Crossref

D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus, arXiv preprint arXiv: 1510.08484, 2015.

CAAI Artificial Intelligence Research

Volume 1 Issue 1,
September 2022

Pages 1-7

DOI: 10.26599/AIR.2022.9150001

Cite this article:

Wu X, Lian R, Jiang D, et al. A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition. CAAI Artificial Intelligence Research, 2022, 1(1): 1-7. https://doi.org/10.26599/AIR.2022.9150001