AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg 86159, Germany
Chair of Health Informatics, München rechts der Isar, Technical University of Munich, Munich 81675, Germany
Munich Center for Machine Learning, Munich 80333, Germany
Huawei Technologies, Munich, Munich 80992, Germany
Munich Data Science Institute, Garching 85748, Germany
Group on Language, Audio and Music, Imperial College London, London SW7 2AZ, U.K.
Show Author Information

Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios, for a wide range of computer audition tasks in everyday-life noisy environments.

Electronic Supplementary Material

Download File(s)
JCST-2210-12934-Highlights.pdf (370.3 KB)

References

[1]
De Andrade D C, Leo S, Da Silva Viana M L, Bernkopf C. A neural attention model for speech command recognition. arXiv: 1808.08929, 2018. https://arxiv.org/abs/1808.08929, Jul. 2024.
[2]
Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. the 34th Conference on Neural Information Processing Systems, Dec. 2020, pp.12449–12460.
[3]

Wagner J, Triantafyllopoulos A, Wierstorf H et al. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Analysis and Machine Intelligence , 2023, 45(9): 10745–10759. DOI: 10.1109/TPAMI.2023.3263585.

[4]
Ren Z, Kong Q, Han J, Plumbley M D, Schuller B W. Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes. In Proc. the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2019, pp.56–60. DOI: 10.1109/ICASSP.2019.8683434.
[5]

Liu S, Keren G, Parada-Cabaleiro E, Schuller B. N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications , 2021, 80(18): 28365–28389. DOI: 10.1007/s11042-021-11080-y.

[6]

Spille C, Kollmeier B, Meyer B T. Comparing human and automatic speech recognition in simple and complex acoustic scenes. Computer Speech & Language , 2018, 52: 123–140. DOI: 10.1016/j.csl.2018.04.003.

[7]
Triantafyllopoulos A, Keren G, Wagner J et al. Towards robust speech emotion recognition using deep residual networks for speech enhancement. In Proc. the 20th Annual Conf. International Speech Communication Association, Sept. 2019, pp.1691–1695.
[8]
Liu S, Triantafyllopoulos A, Ren Z et al. Towards speech robustness for acoustic scene classification. In Proc. the 21st Annual Conference of the International Speech Communication Association, Oct. 2020, pp.3087–3091.
[9]
Park D S, Chan W, Zhang Y, Chiu C C, Zoph B, Cubuk E D, Le Q V. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. the 20th Annual Conference of the International Speech Communication Association, Sept. 2019, pp.2613–2617.
[10]
Weninger F, Erdogan H, Watanabe S et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proc. the 12th Int. Conf. Latent Variable Analysis and Signal Separation, Aug. 2015, pp.91–99. DOI: 10.1007/978-3-319-22482-4_11.
[11]
Kinoshita K, Ochiai T, Delcroix M, Nakatani T. Improving noise robust automatic speech recognition with single-channel time-domain enhancement network. In Proc. the 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing, May 2020, pp.7009–7013. DOI: 10.1109/ICASSP40776.2020.9053266.
[12]
Sivasankaran S, Nugraha A A, Vincent E, Morales-Cordovilla J A, Dalmia S, Illina I, Liutkus A. Robust ASR using neural network based speech enhancement and feature simulation. In Proc. the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec. 2015, pp.482–489. DOI: 10.1109/ASRU.2015.7404834.
[13]
Zorilă C, Boeddeker C, Doddipatla R, Haeb-Umbach R. An investigation into the effectiveness of enhancement in ASR training and test for chime-5 dinner party transcription. In Proc. the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp.47–53. DOI: 10.1109/ASRU46091.2019.9003785.
[14]
Iwamoto K, Ochiai T, Delcroix M et al. How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR. In Proc. the 23rd Annual Conference of the International Speech Communication Association, Sept. 2022, pp.5418–5422.
[15]

Wang Z Q, Wang D L. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech, and Language Processing , 2016, 24(4): 796–806. DOI: 10.1109/TASLP.2016.2528171.

[16]
Narayanan A, Misra A, Chin K K. Large-scale, sequence-discriminative, joint adaptive training for masking-based robust ASR. In Proc. the 16th Annual Conference of the International Speech Communication Association, Sept. 2015, pp.3571–3575.
[17]
Ma D, Hou N N, Pham V T et al. Multitask-based joint learning approach to robust ASR for radio communication speech. In Proc. the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec. 2021, pp.497–502.
[18]
Chen Z, Watanabe S, Erdogan H et al. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proc. the 16th Annual Conference of the International Speech Communication Association, Sept. 2015, pp.3274–3278.
[19]
Liu B, Nie S, Liang S, Liu W J, Yu M, Chen L W, Peng S Y, Li C L. Jointly adversarial enhancement training for robust end-to-end speech recognition. In Proc. the 20th Annual Conference of the International Speech Communication Association, Sept. 2019, pp.491–495.
[20]

Li L J, Kang Y K, Shi Y C, Kürzinger L, Watzel T, Rigoll G. Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition. EURASIP Journal on Audio, Speech, and Music Processing , 2021, 2021(1): 26. DOI: 10.1186/S13636-021-00215-6.

[21]
Zhu Q S, Zhang J, Zhang Z Q, Dai L R. Joint training of speech enhancement and self-supervised model for noise-robust ASR. arXiv: 2205.13293, 2022. https://arxiv.org/abs/2205.13293, Jul. 2024.
[22]
Kim C, Garg A, Gowda D, Mun S, Han C. Streaming end-to-end speech recognition with jointly trained neural feature enhancement. In Proc. the 2021 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Jun. 2021, pp.6773–6777. DOI: 10.1109/ICASSP39728.2021.9414117.
[23]

Cámbara G, López F, Bonet D et al. TASE: Task-aware speech enhancement for wake-up word detection in voice assistants. Applied Sciences , 2022, 12(4): Article No. 1974. DOI: 10.3390/app12041974.

[24]
Gu Y, Du Z H, Zhang H, Zhang X. A monaural speech enhancement method for robust small-footprint keyword spotting. arXiv: 1906.08415, 2019. https://arxiv.org/abs/1906.08415, Jul. 2024.
[25]
Zhou H, Du J, Tu Y H, Lee C H. Using speech enhancement preprocessing for speech emotion recognition in realistic noisy conditions. In Proc. the 21st Annual Conference of the International Speech Communication Association, Oct. 2020, pp.4098–4102.
[26]
Fu S W, Yu C, Hsieh T A, Plantinga P, Ravanelli M, Lu X, Tsao Y. MetricGAN+: An improved version of metricGAN for speech enhancement. In Proc. the 22nd Annual Conference of the International Speech Communication Association, Aug. 30 -Sept. 3 2021, pp.201–205.
[27]
Schröter H, Rosenkranz T, Escalante-B A N, Maier A. DeepFilterNet: Perceptually motivated real-time speech enhancement. In Proc. the 24th Annual Conference of the International Speech Communication Association, Aug. 2023, pp.2008–2009.
[28]
Valentini-Botinhao C, Wang X, Takaki S, Yamagishi J. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In Proc. the 9th ISCA Speech Synthesis Workshop, Sept. 2016, pp.146–152.
[29]
Dubey H, Gopal V, Cutler R, Aazami A, Matusevych S, Braun S, Eskimez S E, Thakker M, Yoshioka T, Gamper H, Aichner R. ICASSP 2022 deep noise suppression challenge. In Proc. the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2022, pp.9271–9275. DOI: 10.1109/ICASSP43922.2022.9747230.
[30]
Le L, Patterson A, White M. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In Proc. the 32nd Conference on Neural Information Processing Systems, Dec. 2018, pp.107–117.
[31]
Ben-David S, Blitzer J, Crammer K, Pereira F. Analysis of representations for domain adaptation. In Proc. the 20th Annual Conference on Neural Information Processing Systems, Dec. 2006, pp.137–144.
[32]
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In Proc. the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Oct. 2015, pp.234–241. DOI: 10.1007/978-3-319-24574-4_28.
[33]
Choi H S, Kim J H, Huh J, Kim A, Ha J W, Lee K. Phase-aware speech enhancement with deep complex u-net. In Proc. the 7th International Conference on Learning Representations, May 2018.
[34]
Stoller D, Ewert S, Dixon S. Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. In Proc. the 19th International Society for Music Information Retrieval Conference, Sept. 2018, pp.334–340.
[35]
Warden P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv: 1804.03209, 2018. https://arxiv.org/abs/1804.03209, Jul. 2024.
[36]
Dai W, Dai C, Qu S H, Li J C, Das S. Very deep convolutional neural networks for raw waveforms. In Proc. the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, Mar. 2017, pp.421–425. DOI: 10.1109/ICASSP.2017.7952190.
[37]

Wang D, Wang X, Lv S. An overview of end-to-end automatic speech recognition. Symmetry , 2019, 11(8): 1018. DOI: 10.3390/sym11081018.

[38]

Hsu W N, Bolte B, Tsai Y H H, Lakhotia K, Salakhutdinov R, Mohamed A. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, and Language Processing , 2021, 29: 3451–3460. DOI: 10.1109/TASLP.2021.3122291.

[39]
Babu A, Wang C H, Tjandra A et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv: 2111.09296, 2021. https://arxiv.org/abs/2111.09296, Jul. 2024.
[40]

Jing L, Tian Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Analysis and Machine Intelligence , 2021, 43(11): 4037–4058. DOI: 10.1109/TPAMI.2020.2992393.

[41]

Liu X, Zhang F, Hou Z Y, Mian L, Wang Z, Zhang J, Tang J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowledge and Data Engineering , 2023, 35(1): 857–876. DOI: 10.1109/TKDE.2021.3090866.

[42]
Amodei A, Ananthanarayanan S, Anubhai R et al. Deep speech 2: End-to-end speech recognition in English and mandarin. In Proc. the 33rd International Conference on Machine Learning, Jun. 2016, pp.173–182.
[43]
Li H, Xu Z, Taylor G, Studer C, Goldstein T. Visualizing the loss landscape of neural nets. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.6391–6401.
[44]

Zheng N H, Shi Y P, Rong W C, Kang Y Y. Effects of skip connections in CNN-based architectures for speech enhancement. Journal of Signal Processing Systems , 2020, 92(8): 875–884. DOI: 10.1007/s11265-020-01518-1.

[45]
Hannun A, Case C, Casper J et al. Deep speech: Scaling up end-to-end speech recognition. arXiv: 1412.5567, 2014. https://arxiv.org/abs/1412.5567, Jul. 2024.
[46]

Yin S, Liu C, Zhang Z, Lin Y, Wang D, Tejedor J, Zheng F, Li Y. Noisy training for deep neural networks in speech recognition. EURASIP Journal on Audio, Speech, and Music Processing , 2015, 2015(1): 2. DOI: 10.1186/s13636-014-0047-0.

[47]
Kim J, El-Khamy M, Lee J. Bridgenets: Student-teacher transfer learning based on recursive neural networks and its application to distant speech recognition. In Proc. the 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Apr. 2018, pp.5719–5723. DOI: 10.1109/ICASSP.2018.8462137.
[48]
Meng Z, Li J, Gaur Y, Gong Y. Domain adaptation via teacher-student learning for end-to-end speech recognition. In Proc. the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp.268–275. DOI: 10.1109/ASRU46091.2019.9003776.
[49]

Schuller B W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM , 2018, 61(5): 90–99. DOI: 10.1145/3129340.

[50]

Busso C, Bulut M, Lee C C et al. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation , 2008, 42(4): 335–359. DOI: 10.1007/s10579-008-9076-6.

[51]
Baird A, Amiriparian S, Milling M, Schuller B W. Emotion recognition in public speaking scenarios utilising an LSTM-RNN approach with attention. In Proc. the 2021 IEEE Spoken Language Technology Workshop (SLT), Jan. 2021, pp.397–402. DOI: 10.1109/SLT48900.2021.9383542.
[52]

Milling M, Baird A, Bartl-Pokorny K D, Liu S, Alcorn A M, Shen J, Tavassoli T, Ainger E, Pellicano E, Pantic M, Cummins N, Schuller B W. Evaluating the impact of voice activity detection on speech emotion recognition for autistic children. Frontiers in Computer Science , 2022, 4: 837269. DOI: 10.3389/fcomp.2022.837269.

[53]
Oates C, Triantafyllopoulos A, Steiner I, Schuller B W. Robust speech emotion recognition under different encoding conditions. In Proc. the 20th Annual Conference of the International Speech Communication Association, Sept. 2019, pp.3935–3939.
[54]
Mohamed M M, Schuller B W. ConcealNet: An end-to-end neural network for packet loss concealment in deep speech emotion recognition. arXiv: 2005.07777, 2020. https://arxiv.org/abs/2005.07777, Jul. 2024.
[55]

Triantafyllopoulos A, Reichel U, Liu S, Huber S, Eyben F, Schuller B W. Multistage linguistic conditioning of convolutional layers for speech emotion recognition. Frontiers in Computer Science , 2023, 5: 1072479. DOI: 10.3389/fcomp.2023.1072479.

[56]
Bajovic D, Bakhtiarnia A, Bravos G et al. MARVEL: Multimodal extreme scale data analytics for smart cities environments. In Proc. the 2021 In. Balkan Conf. Communications and Networking (BalkanCom), Sept. 2021, pp.143–147. DOI: 10.1109/BalkanCom53780.2021.9593258.
[57]
McDonnell M D, Gao W. Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths. In Proc. the 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing, May 2020, pp.141–145. DOI: 10.1109/ICASSP40776.2020.9053274.
[58]
Heittola T, Mesaros A, Virtanen T. Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions. In Proc. the 5th Workshop on Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020), Nov. 2020, pp.56–60.
[59]
Graves A, Fernández S, Gomez F J, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. the 23rd International Conference on Machine Learning, Jun. 2006, pp.369–376.
[60]
Panayotov V, Chen G G, Povey D, Khudanpur S. Librispeech: An ASR corpus based on public domain audio books. In Proc. the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2015, pp.5206–5210. DOI: 10.1109/ICASSP.2015.7178964.
[61]
Liu S, Sarı L, Wu C Y, Keren G, Shangguan Y, Mahadeokar J, Kalinli O. Towards selection of text-to-speech data to augment ASR training. arXiv: 2306.00998, 2023. https://arxiv.org/abs/2306.00998, Jul. 2024.
[62]

Parada-Cabaleiro E, Costantini G, Batliner A, Schmitt M, Schuller B W. DEMoS: An Italian emotional speech corpus: Elicitation methods, machine learning, and perception. Language Resources and Evaluation , 2020, 54(2): 341–383. DOI: 10.1007/s10579-019-09450-y.

[63]
Ren Z, Baird A, Han J, Zhang Z, Schuller B. Generating and protecting against adversarial attacks for deep speech-based emotion recognition models. In Proc. the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2020, pp.7184–7188. DOI: 10.1109/ICASSP40776.2020.9054087.
[64]
Wang S S, Mesaros A, Heittola T, Virtanen T. A curated dataset of urban scenes for audio-visual scene analysis. In Proc. the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Jun. 2021, pp.626–630. DOI: 10.1109/ICASSP39728.2021.9415085.
Journal of Computer Science and Technology
Pages 895-911
Cite this article:
Milling M, Liu S, Triantafyllopoulos A, et al. Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance. Journal of Computer Science and Technology, 2024, 39(4): 895-911. https://doi.org/10.1007/s11390-024-2934-x

66

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 26 October 2022
Accepted: 30 June 2024
Published: 20 September 2024
© Institute of Computing Technology, Chinese Academy of Sciences 2024
Return