UZBEK SPEECH TRANSLATION USING FINE-TUNED XLSR WAV2VEC2 ON MOZILLA COMMON VOICE 10 DATASET

Abidova Shakhnoza Bakhodirovna

Sharipov Zohirjon Zokirjon ugli

Keywords: Keywords:XLSR Wav2Vec2, Mozilla Common Voice, Wer, Cer, ASR, NMT, Google Translate RVC, Uzbek speech, Python, PyQt.


Abstract

Abstract: This research paper explores the development and application of a
fine-tuned XLSR Wav2Vec2 model on the Mozilla Common Voice[1] dataset for the
purpose of Uzbek speech recognition and translation into various languages. Utilizing
over 208,000 audio files, we achieved notable improvements in model performance,
evidenced by a training loss of 0.4241 and a Word Error Rate (WER) of 0.2588. A key
innovation of our work is the integration of a desktop-based Graphical User Interface
(GUI), developed using Python and PyQt libraries, which facilitates the efficient
transcription and translation of Uzbek audio data. This interface leverages the Google
Translate API for real-time translation, demonstrating the model's effectiveness in
resource-constrained scenarios. Through data augmentation techniques such as speed
and pitch manipulation, noise injection, and cross-lingual data augmentation, we
significantly enhanced the Uzbek speech dataset, contributing to the advancement of
Automatic Speech Recognition (ASR) technology for low-resource languages. The
paper details our methodology, including the fine-tuning of the XLSR Wav2Vec2
model and the implementation of the desktop GUI, underscoring the potential for
improving multilingual communication and understanding.


References

References

Rosana Ardila et al.“Common Voice: A Massively-Multilingual Speech

Corpus”, 2020arXiv:1912.06670 [cs.CL]

Arun Babu et al.“XLS-R: Self-supervised Cross-lingual Speech Representation

Learning at Scale”In Proc. Interspeech 2022, 2022, pp. 2278–

DOI: 10.21437/Interspeech.2022-143

Indunil Ramadasa et al. “ Analysis of the effectiveness of using Google

Translations API for NLP of Sinhalese” (Link_paper)

Benjamin Barras et al. “SoX : Sound eXchange” ( Link)

Steffen Schneider, Alexei Baevski, Ronan Collobert and Michael

Auli“wav2vec: Unsupervised Pre-training for Speech Recognition”,

arXiv:1904.05862 [cs.CL]

Cheng Yi et al.“Applying Wav2vec2.0 to Speech Recognition in Various Low-

resource Languages”, 2021arXiv:2012.12121 [cs.CL]

Shukrullo Turgunov. “Fine tuned Uzbek ASR”(Link)

Zhixing Tan a , Shuo Wang , Zonghan Yang , Gang Chen , Xuancheng Huang ,

Maosong Sun a , Yang Liu et al. “Neural machine translation: A review of

methods, resources, and tools”

https://www.python.org/

MVC Framework Introduction