AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Article | Open Access

Pretraining Enhanced RNN Transducer

Junyu Lu1Rongzhong Lian1Di Jiang1( )Yuanfeng Song1Zhiyang Su2Victor Junqiu Wei2Lin Yang2
WeBank Co., Ltd., Shenzhen 518000, China
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China
Show Author Information

Abstract

Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.

References

[1]
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., Deep Speech: Scaling up end-to-end speech recognition, arXiv preprint arXiv: 1412.5567, 2014.
[2]
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al., Deep Speech 2: End-to-end speech recognition in English and Mandarin, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 173–182.
[3]
Y. Song, D. Jiang, X. Wu, Q. Xu, R. C. W. Wong, and Q. Yang, Topic-aware dialogue speech recognition with transfer learning, in Proc. 20th Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2019), Graz, Austria, 2019, pp. 829–833.
[4]

X. Wu, R. Lian, D. Jiang, Y. Song, W. Zhao, Q. Xu, and Q. Yang, A phonetic-semantic pre-training model for robust speech recognition, CAAI Artif. Intell. Res., vol. 1, no. 1, pp. 1–7, 2022.

[5]
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., Streaming end-to-end speech recognition for mobile devices, arXiv preprint arXiv: 1811.06621, 2018.
[6]
A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv: 1211.3711, 2012.
[7]
K. Rao, H. Sak, and R. Prabhavalkar, Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer, in Proc. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 193–199.
[8]
S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: unsupervised pre-training for speech recognition, arXiv preprint arXiv: 1904.05862, 2019.
[9]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova, E. Hulburd, D. Liu, M. Wang, A. G. Catlin, M. Lei, J. Zhang, et al., BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.
[10]
A. Baevski, H. Zhou, A. Mohamed, M. Auli, N. Vaessen, and D. A. van Leeuwen, wav2vec 2.0: A framework for self-supervised learning of speech representations, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 12449–12460.
[11]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, LlionJones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 5998–6008, 2017.
[12]
Y. Zhang, J. Qin, D. S. Park, W. Han, C. C. Chiu, R. Pang, Q. V. Le, and Y. Wu, Pushing the limits of semi-supervised learning for automatic speech recognition, arXiv preprint arXiv: 2010.10504, 2020.
[13]
A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Conformer: Convolution-augmented Transformer for speech recognition, in Proc. 21st Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2020), Shanghai, China, 2020, pp. 5036–5040.
[14]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. 23rd Int. Conf. Machine learning, Pittsburgh, PA, USA, 2006, pp. 369–376.
[15]
H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7079–7083.
[16]

R. Kumar, S. Srivastava, and J. R. P. Gupta, Comparative study of neural networks for control of nonlinear dynamical systems with Lyapunov stability-based adaptive learning rates, Arab. J. Sci. Eng., vol. 43, no. 6, pp. 2971–2993, 2018.

[17]

R. Kumar, Double internal loop higher-order recurrent neural network-based adaptive control of the nonlinear dynamical system, Soft Comput., vol. 27, no. 22, pp. 17313–17331, 2023.

[18]

R. Kumar, S. Srivastava, and A. Mohindru, Lyapunov stability-Dynamic Back Propagation-based comparative study of different types of functional link neural networks for the identification of nonlinear systems, Soft Comput., vol. 24, no. 7, pp. 5463–5482, 2020.

[19]

R. Kumar, Recurrent context layered radial basis function neural network for the identification of nonlinear dynamical systems, Neurocomputing, vol. 580, pp. 127524, 2024.

[20]
Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7829–7833.
[21]
Y. Song, D. Jiang, X. Zhao, Q. Xu, R. C. W. Wong, L. Fan, and Q. Yang, L2RS: A learning-to-rescore mechanism for automatic speech recognition, arXiv preprint arXiv: 1910.11496, 2019.
[22]
B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y. Hu, L. Xie, and X. Lei, Unified streaming and non-streaming two-pass end-to-end model for speech recognition, arXiv preprint arXiv: 2012.05481, 2020.
[23]
Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit, in Proc. 22nd Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2021), Brno, Czech Republic, 2021, pp. 4054–4058.
[24]

D. Jiang, C. Tan, J. Peng, C. Chen, X. Wu, W. Zhao, Y. Song, Y. Tong, C. Liu, Q. Xu et al., A GDPR-compliant ecosystem for speech recognition with transfer, federated, and evolutionary learning, ACM Trans. Intell. Syst. Technol., vol. 12, no. 3, pp. 1–19, 2021.

[25]
Y. Song, X. Huang, X. Zhao, D. Jiang, and R. C. W. Wong, Multimodal N-best list rescoring with weakly supervised pre-training in hybrid speech recognition, in Proc. 2021 IEEE Int. Conf. Data Mining (ICDM), Auckland, New Zealand, 2021, pp. 1336–1341.
[26]
C. Tan, D. Jiang, J. Peng, X. Wu, Q. Xu, and Q. Yang, A de novo divide-and-merge paradigm for acoustic model optimization in automatic speech recognition, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 3709–3715.
[27]
C. Tan, D. Jiang, H. Mo, J. Peng, Y. Tong, W. Zhao, C. Chen, R. Lian, Y. Song, and Q. Xu, Federated acoustic model optimization for automatic speech recognition, in Proc. 25 th Int. Conf. Database Systems for Advanced Applications, Jeju, Republic of Korea, 2020, pp. 771–774.
[28]
T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Proc. 16th Annu. Conf. Int. Speech Communication Association, Dresden, Germany, 2015, pp. 1–5.
[29]
S. W. Fu, Y. Tsao, X. Lu, and H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf. (APSIPA ASC), Kuala Lumpur, Malaysia, 2017, pp. 6–12.
[30]
N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, End-to-end speech recognition from the raw waveform, arXiv preprint arXiv: 1806.07098, 2018.
[31]
M. W. Y. Lam, J. Wang, C. Weng, D. Su, and D. Yu, Raw waveform encoder with multi-scale globally attentive locally recurrent networks for end-to-end speech recognition, arXiv preprint arXiv: 2106.04275, 2021.
[32]
M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, Rnn-transducer with stateless prediction network, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain, 2020, pp. 7049–7053.
[33]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 5206–5210.
[34]
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline, in Proc. 2017 20th Conf. Oriental Chapter Int. Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 2017, pp. 1–5.
[35]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
[36]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv: 1802.05365, 2018.
[37]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://openai.com/index/language-unsupervised, 2018.
[38]
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5753–5763.
[39]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, et al., RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.
[40]
D. Jiang, C. Zhang, and Y. Song, Probabilistic Topic Models: Foundation and Application, Singapore: Springer Nature, 2023.
[41]

Y. Li, D. Jiang, R. Lian, X. Wu, C. Tan, Y. Xu, and Z. Su, Heterogeneous latent topic discovery for semantic text mining, IEEE Trans. Knowl. Data Eng. vol. 35, no. 1, pp. 533–544, 2023.

[42]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.

[43]
T. Lin, M. Maire, S. Belongie, J. Hays, PietroPerona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 740–755.
[44]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 652–663, 2017.

[45]
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, 3D human pose estimation in video with temporal convolutions and semi-supervised training, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7753–7762.
[46]
M. Ravanelli and Y. Bengio, Learning speaker representations with mutual information, arXiv preprint arXiv: 1812.00271, 2018.
[47]

C. Chen, D. Jiang, J. Peng, R. Lian, Y. Li, C. Zhang, L. Chen, and L. Fan, Scalable identity-oriented speech retrieval, IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 3261–3265, 2023.

[48]
V. Mitra, V. Kowtha, H. Y. S. Chien, E. Azemi, and C. Avendano, Pre-trained model representations and their robustness against noise for speech emotion analysis, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5.
[49]
B. Han, Z. Lv, A. Jiang, W. Huang, Z. Chen, Y. Deng, J. Ding, C. Lu, W. Q. Zhang, P. Fan, et al., Exploring large scale pre-trained models for robust machine anomalous sound detection, in Proc. 2024 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024, pp. 1326–1330.
[50]

G. Synnaeve and E. Dupoux, A temporal coherence loss function for learning unsupervised acoustic embeddings, Procedia Comput. Sci., vol. 81, pp. 95–100, 2016.

[51]
A. van den Oord, Y. Li, O. Vinyals, P. de Haan, and S. Löwe, Representation learning with contrastive predictive coding, arXiv preprint arXiv: 1807.03748, 2018.
[52]
J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, Transfer learning for speech recognition on a budget, arXiv preprint arXiv: 1706.00290, 2017.
[53]

P. H. Le-Khac, G. Healy, and A. F. Smeaton, Contrastive representation learning: A framework and review, IEEE Access, vol. 8, pp. 193907–193934, 2020.

[54]
J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: First results, arXiv preprint arXiv: 1412.1602, 2014.
[55]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[56]
D. Hau and K. Chen, Exploring hierarchical speech representations with a deep convolutional neural network, in Proc. UK Annu. Workshop on Computational Intelligence (UKCI'11), Manchester, UK, 2021.
[57]

O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, Convolutional neural networks for speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1533–1545, 2014.

[58]
D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv: 1606.08415, 2016.
[59]
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, On using monolingual corpora in neural machine translation, arXiv preprint arXiv: 1503.03535, 2015.
[60]
D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, SpecAugment: A simple data augmentation method for automatic speech recognition, arXiv preprint arXiv: 1904.08779, 2019.
[61]
I. Loshchilov and F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv: 1711.05101, 2017.
[62]
A. Paszke, S. Gross, F. Massa, A. Lerer, JamesBradbury, G. Chanan, T. Killeen, Z. Lin, NataliaGimelshein, L. Antiga, et al., PyTorch: An imperative style, high performance deep learning library, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 8026–8037.
[63]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations, virtual, 2020, pp. 38–45.
CAAI Artificial Intelligence Research
Article number: 9150039
Cite this article:
Lu J, Lian R, Jiang D, et al. Pretraining Enhanced RNN Transducer. CAAI Artificial Intelligence Research, 2024, 3: 9150039. https://doi.org/10.26599/AIR.2024.9150039

331

Views

76

Downloads

0

Crossref

Altmetrics

Received: 22 February 2024
Revised: 28 June 2024
Accepted: 23 July 2024
Published: 11 September 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return