| Sign up

PDF (5.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

WTASR: Wavelet Transformer for Automatic Speech Recognition of Indian Languages

Tripti Choudhary^¹(), Vishal Goyal^¹, Atul Bansal^²

1Department of Electronics and Communication, GLA University, Mathura 281406, India

2Chandigarh University, Mohali 140413, India

Show Author Information

Abstract

Automatic speech recognition systems are developed for translating the speech signals into the corresponding text representation. This translation is used in a variety of applications like voice enabled commands, assistive devices and bots, etc. There is a significant lack of efficient technology for Indian languages. In this paper, an wavelet transformer for automatic speech recognition (WTASR) of Indian language is proposed. The speech signals suffer from the problem of high and low frequency over different times due to variation in speech of the speaker. Thus, wavelets enable the network to analyze the signal in multiscale. The wavelet decomposition of the signal is fed in the network for generating the text. The transformer network comprises an encoder decoder system for speech translation. The model is trained on Indian language dataset for translation of speech into corresponding text. The proposed method is compared with other state of the art methods. The results show that the proposed WTASR has a low word error rate and can be used for effective speech recognition for Indian language.

Keywords

transformer wavelet automatic speech recognition (ASR)Indian language

References

[1]

L.

Deng

, G.

Hinton

, and B.

Kingsbury

, New types of deep neural network learning for speech recognition and related applications: An overview, in Proc. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 8599–8603.

Crossref Google Scholar

[2]

S. R.

Shahamiri

and S. S. B.

Salim

, A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks, IEEE Trans. Neural Syst. Rehabil. Eng., vol. 22, no. 5, pp. 1053–1063, 2014.

Crossref Google Scholar

[3]

S. R.

Shahamiri

and S. S. B.

Salim

, Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach, Adv. Eng. Inf., vol. 28, no. 1, pp. 102–110, 2014.

Crossref Google Scholar

[4]

H.

Bourlard

and N.

Morgan

, Connectionist Speech Recognition: A Hybrid Approach. Boston, MA, USA: Kluwer Academic Publishers, 1994.

[5]

C.

España-Bonet

and J. A. R.

Fonollosa

, Automatic speech recognition with deep neural networks for impaired speech, in Proc. 3^rd Int. Conf. on Advances in Speech and Language Technologies for Iberian Languages, Lisbon, Portugal, 2016, pp. 97–107.

Crossref Google Scholar

[6]

H.

Sak

, A. W.

Senior

, K.

Rao

, and F.

Beaufays

, Fast and accurate recurrent neural network acoustic models for speech recognition, in Proc. 16^th Annu. Conf. of the Int. Speech Communication Association, Dresden, Germany, 2015, pp. 1468–1472.

Crossref Google Scholar

[7]

W.

Chan

, N.

Jaitly

, Q.

Le

, and O.

Vinyals

, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in Proc. 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 4960–4964.

Crossref Google Scholar

[8]

C. C.

Chiu

, T. N.

Sainath

, Y.

Wu

, R.

Prabhavalkar

, P.

Nguyen

, Z.

Chen

, A.

Kannan

, R. J.

Weiss

, K.

Rao

, E.

Gonina

, et al., State-of-the-art speech recognition with sequence-to-sequence models, in Proc. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 4774–4778.

Crossref Google Scholar

[9]

T.

Hori

, J.

Cho

, and S.

Watanabe

, End-to-end speech recognition with word-based Rnn language models, in Proc. 2018 IEEE Spoken Language Technology Workshop, Athens, Greece, 2018, pp. 389–396.

Crossref Google Scholar

[10]

O.

Abdel-Hamid

, A. R.

Mohamed

, H.

Jiang

, L.

Deng

, G.

Penn

, and D.

Yu

, Convolutional neural networks for speech recognition, IEEE/ACM Trans. Audio Speech Lang Process., vol. 22, no. 10, pp. 1533–1545, 2014.

Crossref Google Scholar

[11]

B.

Vachhani

, C.

Bhat

, B.

Das

, and S. K.

Kopparapu

, Deep autoencoder based speech features for improved dysarthric speech recognition, in Proc. 18^th Annu. Conf. of the Int. Speech Communication Association, Stockholm, Sweden, 2017, pp. 1854–1858.

Crossref Google Scholar

[12]

Q.

Zhang

, H.

Lu

, H.

Sak

, A.

Tripathi

, E.

McDermott

, S.

Koo

, and S.

Kumar

, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss, in Proc. 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020, pp. 7829–7833.

Crossref Google Scholar

[13]

K.

Rao

, H.

Sak

, and R.

Prabhavalkar

, Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer, in Proc. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, 2017, pp. 193–199.

Crossref Google Scholar

[14]

Y.

Wang

, X.

Deng

, S.

Pu

, and Z.

Huang

, Residual convolutional CTC networks for automatic speech recognition, arXiv preprint arXiv: 1702.07793, 2017.

[15]

L.

Dong

, S.

Xu

, and B.

Xu

, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in Proc. 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, 5884–5888.

Crossref Google Scholar

[16]

X.

Chen

, Y.

Wu

, Z.

Wang

, S.

Liu

, and J.

Li

, Developing real-time streaming transformer transducer for speech recognition on large-scale dataset, in Proc. 2021 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, 5904–5908.

Crossref Google Scholar

[17]

A.

Singh

, V.

Kadyan

, M.

Kumar

, and N.

Bassan

, ASRoIL: A comprehensive survey for automatic speech recognition of Indian languages, Artif. Intell. Rev., vol. 53, no. 5, pp. 3673–3704, 2020.

Crossref Google Scholar

[18]

S.

Jaglan

, S.

Dhull

, and K. K.

Singh

, Tertiary wavelet model based automatic epilepsy classification system, Int. J. Intell. Unmanned. Syst.,.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 6 Issue 1,
March 2023

Pages 85-91

DOI: 10.26599/BDMA.2022.9020017

Cite this article:

Choudhary T, Goyal V, Bansal A. WTASR: Wavelet Transformer for Automatic Speech Recognition of Indian Languages. Big Data Mining and Analytics, 2023, 6(1): 85-91. https://doi.org/10.26599/BDMA.2022.9020017

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号