| Sign up

PDF (707 KB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao, Chenglin Xu(), Lei Xie(), Haizhou Li

School of Computer Science, Northwestern Polytechnical University, Xi’an 710000, China

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 710129, Singapore

Show Author Information

Abstract

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

Keywords

speech enhancement neural networks dynamic filter reinforcement learning

References

[1]

S.

Boll

, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp. 113–120, 1979.

Crossref Google Scholar

[2]

P.

Scalart

and J. V.

Filho

, Speech enhancement based on a priori signal to noise estimation, in 1996 IEEE Int. Conf. Acoustics, Speech, and Signal Processing Conf. Proc., Atlanta, GA, USA, 1996, pp. 629–632.

[3]

Y.

Ephraim

and D.

Malah

, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 6, pp. 1109–1121, 1984.

Crossref Google Scholar

[4]

R.

Xin

, J.

Zhang

, and Y.

Shao

, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.

Crossref Google Scholar

[5]

Q.

Dang

, J.

Yin

, B.

Wang

, and W.

Zheng

, Deep learning based 2D human pose estimation: A survey, Tsinghua Science and Technology, vol. 24, no. 6, pp. 663–676, 2019.

Crossref Google Scholar

[6]

Y.

Xu

, J.

Du

, L. R.

Dai

, and C. H.

Lee

, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 23, no. 1, pp. 7–19, 2015.

Crossref Google Scholar

[7]

Y. X.

Wang

and D. L.

Wang

, Towards scaling up classification-based speech separation, IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 7, pp. 1381–1390, 2013.

Crossref Google Scholar

[8]

H.

Zhao

, S.

Zarar

, I.

Tashev

, and C. H.

Lee

, Convolutional-recurrent neural networks for speech enhancement, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 2401–2405.

Crossref Google Scholar

[9]

H. S.

Choi

, J. H.

Kim

, J.

Huh

, A.

Kim

, J. W.

Ha

, and K.

Lee

, Phase-aware speech enhancement with deep complex U-Net, in Proc. Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.

[10]

D. C.

Yin

, C.

Luo

, Z. W.

Xiong

, and W. J.

Zeng

, PHASEN: A phase-and-harmonics-aware speech enhancement network, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 5, pp. 9458–9465, 2020.

Crossref Google Scholar

[11]

Y. X.

Hu

, Y.

Liu

, S. B.

Lv

, M. T.

Xing

, S. M.

Zhang

, Y. H.

Fu

, J.

Wu

, B. H.

Zhang

, and L.

Xie

, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement, in INTERSPEECH 2020, Shanghai, China, 2020, 2472–2476.

[12]

S. W.

Fu

, Y.

Tsao

, X. G.

Lu

, and H.

Kawai

, Raw waveform-based speech enhancement by fully convolutional networks, in Proc. Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Kuala Lumpur, Malaysia, 2017, pp. 6–12.

Crossref Google Scholar

[13]

S. W.

Fu

, T. W.

Wang

, Y.

Tsao

, X. G.

Lu

, and H.

Kawai

, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 9, pp. 1570–1584, 2018.

Crossref Google Scholar

[14]

Y.

Luo

and N.

Mesgarani

, Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.

Crossref Google Scholar

[15]

C. L.

Xu

, W.

Rao

, E. S.

Chng

, and H. Z.

Li

, SpEx: Multiscale time domain speaker extraction network, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 1370–1384, 2020.

Crossref Google Scholar

[16]

Y.

Zhao

, B. Y.

Xu

, R.

Giri

, and T.

Zhang

, Perceptually guided speech enhancement using deep neural networks, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5074–5078.

Crossref Google Scholar

[17]

M.

Kolbæk

, Z. H.

Tan

, and J.

Jensen

, Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5059–5063.

Crossref Google Scholar

[18]

H.

Zhang

, X. L.

Zhang

, and G. L.

Gao

, Training supervised speech separation system to improve STOI and PESQ directly, in Proc. 2018 Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5374–5378.

Crossref Google Scholar

[19]

J. M.

Martin-Doñas

, A. M.

Gomez

, J. A.

Gonzalez

, and A. M.

Peinado

, A deep learning loss function based on the perceptual evaluation of the speech quality, IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.

Crossref Google Scholar

[20]

J.

Kim

, M.

El-Kharmy

, and J.

Lee

, End-to-end multi-task denoising for joint SDR and PESQ optimization, arXiv preprint arXiv: 1901.09146, 2019.

[21]

S. W.

Fu

, C. F.

Liao

, and Y.

Tsao

, Learning with learned loss function: Speech enhancement with quality-net to improve perceptual evaluation of speech quality, IEEE Signal Process. Lett., vol. 27, pp. 26–30, 2020.

Crossref Google Scholar

[22]

S. W.

Fu

, C. F.

Liao

, Y.

Tsao

, and S. D.

Lin

, MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement, in Proc. 36^th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2031–2041.

[23]

K.

Zhu

and T.

Zhang

, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

Crossref Google Scholar

[24]

Y.

Koizumi

, K.

Niwa

, Y.

Hioka

, K.

Kobayashi

, and Y.

Haneda

, DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements, in Proc. 2017 IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 2017, pp. 81–85.

Crossref Google Scholar

[25]

Y.

Koizumi

, K.

Niwa

, Y.

Hioka

, K.

Kobayashi

, and Y.

Haneda

, DNN-based source enhancement to increase objective sound quality assessment score, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 10, pp. 1780–1792, 2018.

Crossref Google Scholar

[26]

B.

De Brabandere

, X.

Jia

, T.

Tuytelaars

, and L.

Van Gool

, Dynamic filter networks, in Proc. 30^th Conf. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 667–675.

[27]

A.

Narayanan

and D. L.

Wang

, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 7092–7096.

Crossref Google Scholar

[28]

H.

Erdogan

, J. R.

Hershey

, S.

Watanabe

, and J.

Le Roux

, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing, South Brisbane, Australia, 2015, pp. 708–712.

Crossref Google Scholar

[29]

D. S.

Williamson

, Y. X.

Wang

, and D. L.

Wang

, Complex ratio masking for monaural speech separation, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 3, pp. 483–492, 2016.

Crossref Google Scholar

[30]

W.

Mack

and E. A. P.

Habets

, Deep filtering: Signal extraction and reconstruction using complex time-frequency filters, IEEE Signal Process. Lett., vol. 27, pp. 61–65, 2020.

Crossref Google Scholar

[31]

K.

Tan

, J. T.

Chen

, and D. L.

Wang

, Gated residual networks with dilated convolutions for monaural speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 1, pp. 189–198, 2019.

Crossref Google Scholar

[32]

Y. L.

Zhang

, Y. P.

Tian

, Y.

Kong

, B. N.

Zhong

, and Y.

Fu

, Residual dense network for image super-resolution, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 2472–2481.

Crossref Google Scholar

[33]

A. L.

Maas

, A. Y.

Hannun

, and A. Y.

Ng

, Rectifier nonlinearities improve neural network acoustic models, in Proc. 30^th Int. Conf. Machine Learning, Atlanta, GA, USA, 2013.

[34]

R. S.

Sutton

and A. G.

Barto

, Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press, 2018.

[35]

X.

Hao

, C. H.

Shan

, Y.

Xu

, S. N.

Sun

, and L.

Xie

, An attention-based neural network approach for single channel speech enhancement, in Proc. 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Brighton, UK, 2019, pp. 6895–6899.

Crossref Google Scholar

[36]

N.

Hou

, C. L.

Xu

, E. S.

Chng

, and H. Z.

Li

, Domain adversarial training for speech enhancement, in Proc. 2019 Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf., Lanzhou, China, 2019, pp. 667–672.

Crossref Google Scholar

[37]

T.

Gao

, J.

Du

, L. R.

Dai

, and C. H.

Lee

, Densely connected progressive learning for LSTM-based speech enhancement, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5054–5058.

Crossref Google Scholar

[38]

M.

Kolbæk

, Z. H.

Tan

, S. H.

Jensen

, and J.

Jensen

, On loss functions for supervised monaural time-domain speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 825–838, 2020.

Crossref Google Scholar

[39]

C.

Valentini-Botinhao

, X.

Wang

, S.

Takaki

, and J.

Yamagishi

, Investigating RNN-based speech enhancement methods for noise-robust text-to-speech, in Proc. 9^th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 2016, pp. 146–152.

Crossref Google Scholar

[40]

C.

Veaux

, J.

Yamagishi

, and S.

King

, The voice bank corpus: Design, collection and data analysis of a large regional accent speech database, in Proc. 2013 Int. Conf. Oriental COCOSDA held jointly with 2013 Conf. Asian Spoken Language Research and Evaluation, Gurgaon, India, 2013, pp. 1–4.

Crossref Google Scholar

[41]

J.

Thiemann

, N.

Ito

, and E.

Vincent

, The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings, J. Acoust. Soc. Am., vol. 133, no. 5, p. 3591, 2013.

Crossref Google Scholar

[42]

S.

Pascual

, A.

Bonafonte

, and J.

Serrà

, SEGAN: Speech enhancement generative adversarial network, in INTERSPEECH 2017, Stockholm, Sweden, 2017, pp. 3642–3646.

[43]

D.

Rethage

, J.

Pons

, and X.

Serra

, A wavenet for speech denoising, in Proc. 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5069–5073.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 27 Issue 6,
December 2022

Pages 939-947

DOI: 10.26599/TST.2021.9010048

Cite this article:

Hao X, Xu C, Xie L, et al. Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning. Tsinghua Science and Technology, 2022, 27(6): 939-947. https://doi.org/10.26599/TST.2021.9010048

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号