[1]
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., Deep Speech: Scaling up end-to-end speech recognition, arXiv preprint arXiv: 1412.5567, 2014.
[2]
D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al., Deep Speech 2: End-to-end speech recognition in English and Mandarin, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 173–182.
[3]
Y. Song, D. Jiang, X. Wu, Q. Xu, R. C. W. Wong, and Q. Yang, Topic-aware dialogue speech recognition with transfer learning, in Proc. 20th Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2019), Graz, Austria, 2019, pp. 829–833.
[5]
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al., Streaming end-to-end speech recognition for mobile devices, arXiv preprint arXiv: 1811.06621, 2018.
[6]
A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv: 1211.3711, 2012.
[7]
K. Rao, H. Sak, and R. Prabhavalkar, Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer, in Proc. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 193–199.
[8]
S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: unsupervised pre-training for speech recognition, arXiv preprint arXiv: 1904.05862, 2019.
[9]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova, E. Hulburd, D. Liu, M. Wang, A. G. Catlin, M. Lei, J. Zhang, et al., BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.
[10]
A. Baevski, H. Zhou, A. Mohamed, M. Auli, N. Vaessen, and D. A. van Leeuwen, wav2vec 2.0: A framework for self-supervised learning of speech representations, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 12449–12460.
[11]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, LlionJones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 5998–6008, 2017.
[12]
Y. Zhang, J. Qin, D. S. Park, W. Han, C. C. Chiu, R. Pang, Q. V. Le, and Y. Wu, Pushing the limits of semi-supervised learning for automatic speech recognition, arXiv preprint arXiv: 2010.10504, 2020.
[13]
A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Conformer: Convolution-augmented Transformer for speech recognition, in Proc. 21st Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2020), Shanghai, China, 2020, pp. 5036–5040.
[14]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks, in Proc. 23rd Int. Conf. Machine learning, Pittsburgh, PA, USA, 2006, pp. 369–376.
[15]
H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7079–7083.
[20]
Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 7829–7833.
[21]
Y. Song, D. Jiang, X. Zhao, Q. Xu, R. C. W. Wong, L. Fan, and Q. Yang, L2RS: A learning-to-rescore mechanism for automatic speech recognition, arXiv preprint arXiv: 1910.11496, 2019.
[22]
B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y. Hu, L. Xie, and X. Lei, Unified streaming and non-streaming two-pass end-to-end model for speech recognition, arXiv preprint arXiv: 2012.05481, 2020.
[23]
Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit, in Proc. 22nd Annu. Conf. Int. Speech Communication Association (INTERSPEECH 2021), Brno, Czech Republic, 2021, pp. 4054–4058.
[25]
Y. Song, X. Huang, X. Zhao, D. Jiang, and R. C. W. Wong, Multimodal N-best list rescoring with weakly supervised pre-training in hybrid speech recognition, in Proc. 2021 IEEE Int. Conf. Data Mining (ICDM), Auckland, New Zealand, 2021, pp. 1336–1341.
[26]
C. Tan, D. Jiang, J. Peng, X. Wu, Q. Xu, and Q. Yang, A de novo divide-and-merge paradigm for acoustic model optimization in automatic speech recognition, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2020, pp. 3709–3715.
[27]
C. Tan, D. Jiang, H. Mo, J. Peng, Y. Tong, W. Zhao, C. Chen, R. Lian, Y. Song, and Q. Xu, Federated acoustic model optimization for automatic speech recognition, in Proc. 25 th Int. Conf. Database Systems for Advanced Applications, Jeju, Republic of Korea, 2020, pp. 771–774.
[28]
T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Proc. 16th Annu. Conf. Int. Speech Communication Association, Dresden, Germany, 2015, pp. 1–5.
[29]
S. W. Fu, Y. Tsao, X. Lu, and H. Kawai, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conf. (APSIPA ASC), Kuala Lumpur, Malaysia, 2017, pp. 6–12.
[30]
N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, End-to-end speech recognition from the raw waveform, arXiv preprint arXiv: 1806.07098, 2018.
[31]
M. W. Y. Lam, J. Wang, C. Weng, D. Su, and D. Yu, Raw waveform encoder with multi-scale globally attentive locally recurrent networks for end-to-end speech recognition, arXiv preprint arXiv: 2106.04275, 2021.
[32]
M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, Rnn-transducer with stateless prediction network, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain, 2020, pp. 7049–7053.
[33]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 2015, pp. 5206–5210.
[34]
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline, in Proc. 2017 20th Conf. Oriental Chapter Int. Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 2017, pp. 1–5.
[35]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
[36]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv: 1802.05365, 2018.
[37]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, https://openai.com/index/language-unsupervised, 2018.
[38]
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, XLNet: Generalized autoregressive pretraining for language understanding, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 5753–5763.
[39]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, et al., RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.
[40]
D. Jiang, C. Zhang, and Y. Song, Probabilistic Topic Models: Foundation and Application, Singapore: Springer Nature, 2023.
[43]
T. Lin, M. Maire, S. Belongie, J. Hays, PietroPerona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision (ECCV), Zurich, Switzerland, 2014, pp. 740–755.
[45]
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, 3D human pose estimation in video with temporal convolutions and semi-supervised training, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7753–7762.
[46]
M. Ravanelli and Y. Bengio, Learning speaker representations with mutual information, arXiv preprint arXiv: 1812.00271, 2018.
[48]
V. Mitra, V. Kowtha, H. Y. S. Chien, E. Azemi, and C. Avendano, Pre-trained model representations and their robustness against noise for speech emotion analysis, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5.
[49]
B. Han, Z. Lv, A. Jiang, W. Huang, Z. Chen, Y. Deng, J. Ding, C. Lu, W. Q. Zhang, P. Fan, et al., Exploring large scale pre-trained models for robust machine anomalous sound detection, in Proc. 2024 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 2024, pp. 1326–1330.
[51]
A. van den Oord, Y. Li, O. Vinyals, P. de Haan, and S. Löwe, Representation learning with contrastive predictive coding, arXiv preprint arXiv: 1807.03748, 2018.
[52]
J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, Transfer learning for speech recognition on a budget, arXiv preprint arXiv: 1706.00290, 2017.
[54]
J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, End-to-end continuous speech recognition using attention-based recurrent NN: First results, arXiv preprint arXiv: 1412.1602, 2014.
[56]
D. Hau and K. Chen, Exploring hierarchical speech representations with a deep convolutional neural network, in Proc. UK Annu. Workshop on Computational Intelligence (UKCI'11), Manchester, UK, 2021.
[58]
D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv: 1606.08415, 2016.
[59]
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, On using monolingual corpora in neural machine translation, arXiv preprint arXiv: 1503.03535, 2015.
[60]
D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, SpecAugment: A simple data augmentation method for automatic speech recognition, arXiv preprint arXiv: 1904.08779, 2019.
[61]
I. Loshchilov and F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv: 1711.05101, 2017.
[62]
A. Paszke, S. Gross, F. Massa, A. Lerer, JamesBradbury, G. Chanan, T. Killeen, Z. Lin, NataliaGimelshein, L. Antiga, et al., PyTorch: An imperative style, high performance deep learning library, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 8026–8037.
[63]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-art natural language processing, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing: System Demonstrations, virtual, 2020, pp. 38–45.