Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Manuel Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Björn W. Schuller

doi:10.1007/s11390-024-2934-x

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Manuel Milling^{¹^,²^,³}, Shuo Liu^¹, Andreas Triantafyllopoulos^{¹^,²^,³}, Ilhan Aslan^⁴, Björn W. Schuller^{¹^,²^,³^,⁵^,⁶}

1Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg 86159, Germany

2Chair of Health Informatics, München rechts der Isar, Technical University of Munich, Munich 81675, Germany

3Munich Center for Machine Learning, Munich 80333, Germany

4Huawei Technologies, Munich, Munich 80992, Germany

5Munich Data Science Institute, Garching 85748, Germany

6Group on Language, Audio and Music, Imperial College London, London SW7 2AZ, U.K.

Show Author Information

Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios, for a wide range of computer audition tasks in everyday-life noisy environments.

Keywords

audio enhancement computer audition joint optimisation multi-task learning voice suppression

Electronic Supplementary Material

Download File(s)

JCST-2210-12934-Highlights.pdf (370.3 KB)

References

[1]

De Andrade D C, Leo S, Da Silva Viana M L, Bernkopf C. A neural attention model for speech command recognition. arXiv: 1808.08929, 2018. https://arxiv.org/abs/1808.08929, Jul. 2024.

[2]

Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. the 34th Conference on Neural Information Processing Systems, Dec. 2020, pp.12449–12460.

[3]

Wagner J, Triantafyllopoulos A, Wierstorf H et al. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Analysis and Machine Intelligence , 2023, 45(9): 10745–10759. DOI: 10.1109/TPAMI.2023.3263585.

Crossref Google Scholar

[4]

Ren Z, Kong Q, Han J, Plumbley M D, Schuller B W. Attention-based atrous convolutional neural networks: Visualisation and understanding perspectives of acoustic scenes. In Proc. the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2019, pp.56–60. DOI: 10.1109/ICASSP.2019.8683434.