Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios, for a wide range of computer audition tasks in everyday-life noisy environments.
Wagner J, Triantafyllopoulos A, Wierstorf H et al. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Analysis and Machine Intelligence , 2023, 45(9): 10745–10759. DOI: 10.1109/TPAMI.2023.3263585.
Liu S, Keren G, Parada-Cabaleiro E, Schuller B. N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications , 2021, 80(18): 28365–28389. DOI: 10.1007/s11042-021-11080-y.
Spille C, Kollmeier B, Meyer B T. Comparing human and automatic speech recognition in simple and complex acoustic scenes. Computer Speech & Language , 2018, 52: 123–140. DOI: 10.1016/j.csl.2018.04.003.
Wang Z Q, Wang D L. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing , 2016, 24(4): 796–806. DOI: 10.1109/TASLP.2016.2528171.
Li L J, Kang Y K, Shi Y C, Kürzinger L, Watzel T, Rigoll G. Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition. EURASIP Journal on Audio, Speech, and Music Processing , 2021, 2021(1): 26. DOI: 10.1186/S13636-021-00215-6.
Cámbara G, López F, Bonet D et al. TASE: Task-aware speech enhancement for wake-up word detection in voice assistants. Applied Sciences , 2022, 12(4): Article No. 1974. DOI: 10.3390/app12041974.
Wang D, Wang X, Lv S. An overview of end-to-end automatic speech recognition. Symmetry , 2019, 11(8): 1018. DOI: 10.3390/sym11081018.
Hsu W N, Bolte B, Tsai Y H H, Lakhotia K, Salakhutdinov R, Mohamed A. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, and Language Processing , 2021, 29: 3451–3460. DOI: 10.1109/TASLP.2021.3122291.
Jing L, Tian Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Analysis and Machine Intelligence , 2021, 43(11): 4037–4058. DOI: 10.1109/TPAMI.2020.2992393.
Liu X, Zhang F, Hou Z Y, Mian L, Wang Z, Zhang J, Tang J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowledge and Data Engineering , 2023, 35(1): 857–876. DOI: 10.1109/TKDE.2021.3090866.
Zheng N H, Shi Y P, Rong W C, Kang Y Y. Effects of skip connections in CNN-based architectures for speech enhancement. Journal of Signal Processing Systems , 2020, 92(8): 875–884. DOI: 10.1007/s11265-020-01518-1.
Yin S, Liu C, Zhang Z, Lin Y, Wang D, Tejedor J, Zheng F, Li Y. Noisy training for deep neural networks in speech recognition. EURASIP Journal on Audio, Speech, and Music Processing , 2015, 2015(1): 2. DOI: 10.1186/s13636-014-0047-0.
Schuller B W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM , 2018, 61(5): 90–99. DOI: 10.1145/3129340.
Busso C, Bulut M, Lee C C et al. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation , 2008, 42(4): 335–359. DOI: 10.1007/s10579-008-9076-6.
Milling M, Baird A, Bartl-Pokorny K D, Liu S, Alcorn A M, Shen J, Tavassoli T, Ainger E, Pellicano E, Pantic M, Cummins N, Schuller B W. Evaluating the impact of voice activity detection on speech emotion recognition for autistic children. Frontiers in Computer Science , 2022, 4: 837269. DOI: 10.3389/fcomp.2022.837269.
Triantafyllopoulos A, Reichel U, Liu S, Huber S, Eyben F, Schuller B W. Multistage linguistic conditioning of convolutional layers for speech emotion recognition. Frontiers in Computer Science , 2023, 5: 1072479. DOI: 10.3389/fcomp.2023.1072479.
Parada-Cabaleiro E, Costantini G, Batliner A, Schmitt M, Schuller B W. DEMoS: An Italian emotional speech corpus: Elicitation methods, machine learning, and perception. Language Resources and Evaluation , 2020, 54(2): 341–383. DOI: 10.1007/s10579-019-09450-y.