| Sign up

PDF (1.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Figures (4)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Tables (3)

Table 1

Table 2

Table 3

Article | Open Access

Pretraining Enhanced RNN Transducer

Junyu Lu^¹, Rongzhong Lian^¹, Di Jiang^¹(), Yuanfeng Song^¹, Zhiyang Su^², Victor Junqiu Wei^², Lin Yang^²

1WeBank Co., Ltd., Shenzhen 518000, China

2Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China

Show Author Information

Abstract

Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.

Keywords

pretraining automatic speech recognition self-supervised learning

References

[1]

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., Deep Speech: Scaling up end-to-end speech recognition, arXiv preprint arXiv: 1412.5567, 2014.

[2]

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al., Deep Speech 2: End-to-end speech recognition in English and Mandarin, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 173–182.

[3]

Y. Song, D. Jiang, X. Wu, Q. Xu, R. C. W. Wong, and Q. Yang, Topic-aware dialogue speech recognition with transfer learning, in Proc. 20th Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2019), Graz, Austria, 2019, pp. 829–833.

[4]

X. Wu, R. Lian, D. Jiang, Y. Song, W. Zhao, Q. Xu, and Q. Yang, A phonetic-semantic pre-training model for robust speech recognition, CAAI Artif. Intell. Res., vol. 1, no. 1, pp. 1–7, 2022.

Crossref Google Scholar

[5]

Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., Streaming end-to-end speech recognition for mobile devices, arXiv preprint arXiv: 1811.06621, 2018.

[6]

A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv: 1211.3711, 2012.

[7]

K. Rao, H. Sak, and R. Prabhavalkar, Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer, in Proc. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 193–199.

[8]

S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: unsupervised pre-training for speech recognition, arXiv preprint arXiv: 1904.05862, 2019.

[9]

J. Devlin, M. W. Chang, K. Lee, K. Toutanova, E. Hulburd, D. Liu, M. Wang, A. G. Catlin, M. Lei, J. Zhang, et al., BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.

[10]

A. Baevski, H. Zhou, A. Mohamed, M. Auli, N. Vaessen, and D. A. van Leeuwen, wav2vec 2.0: A framework for self-supervised learning of speech representations, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 12449–12460.

[11]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, LlionJones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 5998–6008, 2017.

[12]

Y. Zhang, J. Qin, D. S. Park, W. Han, C. C. Chiu, R. Pang, Q. V. Le, and Y. Wu, Pushing the limits of semi-supervised learning for automatic speech recognition, arXiv preprint arXiv: 2010.10504, 2020.

[13]

A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Conformer: Convolution-augmented Transformer for speech recognition, in Proc. 21st Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2020), Shanghai, China, 2020, pp. 5036–5040.

[14]

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. 23rd Int. Conf. Machine learning, Pittsburgh, PA, USA, 2006, pp. 369–376.

[15]

H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7079–7083.

[16]

R. Kumar, S. Srivastava, and J. R. P. Gupta, Comparative study of neural networks for control of nonlinear dynamical systems with Lyapunov stability-based adaptive learning rates, Arab. J. Sci. Eng., vol. 43, no. 6, pp. 2971–2993, 2018.

Crossref Google Scholar

[17]

R. Kumar, Double internal loop higher-order recurrent neural network-based adaptive control of the nonlinear dynamical system, Soft Comput., vol. 27, no. 22, pp. 17313–17331, 2023.

Crossref Google Scholar

[18]

R. Kumar, S. Srivastava, and A. Mohindru, Lyapunov stability-Dynamic Back Propagation-based comparative study of different types of functional link neural networks for the identification of nonlinear systems, Soft Comput., vol. 24, no. 7, pp. 5463–5482, 2020.

Crossref Google Scholar

[19]

R. Kumar, Recurrent context layered radial basis function neural network for the identification of nonlinear dynamical systems, Neurocomputing, vol. 580, pp. 127524, 2024.

Crossref Google Scholar

[20]

Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7829–7833.

[21]

Y. Song, D. Jiang, X. Zhao, Q. Xu, R. C. W. Wong, L. Fan, and Q. Yang, L2RS: A learning-to-rescore mechanism for automatic speech recognition, arXiv preprint arXiv: 1910.11496, 2019.

[22]

B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y. Hu, L. Xie, and X. Lei, Unified streaming and non-streaming two-pass end-to-end model for speech recognition, arXiv preprint arXiv: 2012.05481, 2020.

[23]

Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit, in Proc. 22nd Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2021), Brno, Czech Republic, 2021, pp. 4054–4058.

[24]

D. Jiang, C. Tan, J. Peng, C. Chen, X. Wu, W. Zhao, Y. Song, Y. Tong, C. Liu, Q. Xu et al., A GDPR-compliant ecosystem for speech recognition with transfer, federated, and evolutionary learning, ACM Trans. Intell. Syst. Technol., vol. 12, no. 3, pp. 1–19, 2021.

Crossref Google Scholar

[25]

Y. Song, X. Huang, X. Zhao, D. Jiang, and R. C. W. Wong, Multimodal N-best list rescoring with weakly supervised pre-training in hybrid speech recognition, in Proc. 2021 IEEE Int. Conf. Data Mining (ICDM), Auckland, New Zealand, 2021, pp. 1336–1341.

[26]

C. Tan, D. Jiang, J. Peng, X. Wu, Q. Xu, and Q. Yang, A de novo divide-and-merge paradigm for acoustic model optimization in automatic speech recognition, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 3709–3715.

[27]

C. Tan, D. Jiang, H. Mo, J. Peng, Y. Tong, W. Zhao, C. Chen, R. Lian, Y. Song, and Q. Xu, Federated acoustic model optimization for automatic speech recognition, in Proc. 25^th Int. Conf. Database Systems for Advanced Applications, Jeju, Republic of Korea, 2020, pp. 771–774.

[28]

T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Proc. 16th Annu. Conf. Int. Speech Communication Association, Dresden, Germany, 2015, pp. 1–5.

[29]

S. W. Fu, Y. Tsao, X. Lu, and H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf. (APSIPA ASC), Kuala Lumpur, Malaysia, 2017, pp. 6–12.

[30]

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, End-to-end speech recognition from the raw waveform, arXiv preprint arXiv: 1806.07098, 2018.

[31]

M. W. Y. Lam, J. Wang, C. Weng, D. Su, and D. Yu, Raw waveform encoder with multi-scale globally attentive locally recurrent networks for end-to-end speech recognition, arXiv preprint arXiv: 2106.04275, 2021.

[32]

M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, Rnn-transducer with stateless prediction network, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain, 2020, pp. 7049–7053.

[33]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 5206–5210.

[34]

H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline, in Proc. 2017 20th Conf. Oriental Chapter Int. Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 2017, pp. 1–5.

[35]

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.

[36]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv: 1802.05365, 2018.

[37]

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://openai.com/index/language-unsupervised, 2018.

[38]

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5753–5763.

[39]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, et al., RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.

[40]

D. Jiang, C. Zhang, and Y. Song, Probabilistic Topic Models: Foundation and Application, Singapore: Springer Nature, 2023.

[41]

Y. Li, D. Jiang, R. Lian, X. Wu, C. Tan, Y. Xu, and Z. Su, Heterogeneous latent topic discovery for semantic text mining, IEEE Trans. Knowl. Data Eng. vol. 35, no. 1, pp. 533–544, 2023.

Crossref Google Scholar

[42]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.

Crossref Google Scholar

[43]

T. Lin, M. Maire, S. Belongie, J. Hays, PietroPerona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 740–755.

[44]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 652–663, 2017.

Crossref Google Scholar

[45]

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, 3D human pose estimation in video with temporal convolutions and semi-supervised training, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7753–7762.

[46]

M. Ravanelli and Y. Bengio, Learning speaker representations with mutual information, arXiv preprint arXiv: 1812.00271, 2018.

[47]

C. Chen, D. Jiang, J. Peng, R. Lian, Y. Li, C. Zhang, L. Chen, and L. Fan, Scalable identity-oriented speech retrieval, IEEE Trans. Knowl. Data Eng., vol. 35, no. 3, pp. 3261–3265, 2023.

Crossref Google Scholar

[48]

V. Mitra, V. Kowtha, H. Y. S. Chien, E. Azemi, and C. Avendano, Pre-trained model representations and their robustness against noise for speech emotion analysis, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5.

[49]

B. Han, Z. Lv, A. Jiang, W. Huang, Z. Chen, Y. Deng, J. Ding, C. Lu, W. Q. Zhang, P. Fan, et al., Exploring large scale pre-trained models for robust machine anomalous sound detection, in Proc. 2024 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024, pp. 1326–1330.

[50]

G. Synnaeve and E. Dupoux, A temporal coherence loss function for learning unsupervised acoustic embeddings, Procedia Comput. Sci., vol. 81, pp. 95–100, 2016.

Crossref Google Scholar

[51]

A. van den Oord, Y. Li, O. Vinyals, P. de Haan, and S. Löwe, Representation learning with contrastive predictive coding, arXiv preprint arXiv: 1807.03748, 2018.

[52]

J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, Transfer learning for speech recognition on a budget, arXiv preprint arXiv: 1706.00290, 2017.

[53]

P. H. Le-Khac, G. Healy, and A. F. Smeaton, Contrastive representation learning: A framework and review, IEEE Access, vol. 8, pp. 193907–193934, 2020.

Crossref Google Scholar

[54]

J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: First results, arXiv preprint arXiv: 1412.1602, 2014.

[55]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

Crossref Google Scholar

[56]

D. Hau and K. Chen, Exploring hierarchical speech representations with a deep convolutional neural network, in Proc. UK Annu. Workshop on Computational Intelligence (UKCI'11), Manchester, UK, 2021.

[57]

O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, Convolutional neural networks for speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1533–1545, 2014.

Crossref Google Scholar

[58]

D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv: 1606.08415, 2016.

[59]

C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, On using monolingual corpora in neural machine translation, arXiv preprint arXiv: 1503.03535, 2015.

[60]

D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, SpecAugment: A simple data augmentation method for automatic speech recognition, arXiv preprint arXiv: 1904.08779, 2019.

[61]

I. Loshchilov and F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv: 1711.05101, 2017.

[62]

A. Paszke, S. Gross, F. Massa, A. Lerer, JamesBradbury, G. Chanan, T. Killeen, Z. Lin, NataliaGimelshein, L. Antiga, et al., PyTorch: An imperative style, high performance deep learning library, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 8026–8037.

[63]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations, virtual, 2020, pp. 38–45.

CAAI Artificial Intelligence Research

Article number: 9150039

DOI: 10.26599/AIR.2024.9150039

Cite this article:

Lu J, Lian R, Jiang D, et al. Pretraining Enhanced RNN Transducer. CAAI Artificial Intelligence Research, 2024, 3: 9150039. https://doi.org/10.26599/AIR.2024.9150039

Return

Table 1Detail structures of Conv_p and Conv_r.

Convolution network	Type	Layer	Kernel size	Stride	Initialization
${C o n v}_{r}$	2D	2	[ $3 \times 3$ , $3 \times 3$ ]	[ $1 \times 1$ , $1 \times 1$ ]	-
${C o n v}_{p}$	1D	7	[10, 3, 3, 3, 3, 2, 2]	[5, 2, 2, 2, 2, 2, 2]	Wav2vec 2.0

Table 2Experiment results on LibriSpeech and AISHELL-1 task. The metric used in LibriSpeech taskes is WER, while we evaluate CER in AISHELL-1 (Dev: Development; Avg: Avergage). ❉ means no additional training parameters compared to the baseline.

Transducer	Parameter (10⁶)	Error rate (%)
		LibriSpeech					AISHELL-1
		Dev	Test-clean	Test-other	Avg (test)		Dev	Test
LSTM-LSTM [6]	130	-	3.2	7.8	5.5		10.1	11.8
Transformer-Transformer [20]	139	-	2.4	5.6	4.0		-	-
Baseline	57	7.9	2.9	7.6	5.3		4.9	4.5
+PAE	❉	7.3	2.8	7.2	5.0 ( $↓ 7.4 %$ )		4.4	4.3 ( $↓ 4.4 %$ )
+PAE +PLN	❉	7.4	2.8	7.3	5.1 ( $↓ 5.6 %$ )		4.3	4.2 ( $↓ 6.6 %$ )

Table 3CER comparison of cold-start and warm-start models on AISHELL-1.

(%)
Model	Cold-start	Warm-start
Baseline	4.52	-
+PAE	4.57	4.26
+PAE +PLN	97.89	4.19

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号