Text To Speech Synthesis for Afaan Oromoo Language Using Deep Learning Approach

SORESSA BEYENE

Abstract


Text to speech synthesis (TTS) which generate input texts is generate to the speech from texts. TTS is very important in aiding impaired people, in teaching and learning process. But, to implemented TTS have a lot of challenging such as text processing, time to phoneme mapping and acoustic modeling for Afaan Oromoo language. So, Afaan Oromoo language mostly required to text to speech synthesis for development of this language. The application of Natural Language Processing is provide that input texts pair speech to generate the desired result outputs of speech in waveforms from prepared text corpus. The normalized text was used for linguistic features are extracted by using Festival toolkit for Afaan Oromoo TTS. The labeled texts are done using Festival toolkit, and generated the utterances of texts from scheme file parameters. The Festival toolkit is used for texts normalized in linguistic extraction from label phoneme alignment to match with speech corpus in trains and tests. The forced alignment is done by HTK toolkit for prepared environment, checked data extracting features within timestamps of state level alignment for acoustic feature extracted. So, this study focus on TTS approach deep learning model based on BLSTM-RNN for Afaan Oromoo language. The RNN model used from a given input feature sequence to extracted duration model and acoustic model. The implementation is done in BLSTM-based on RNN using pytorch library on jupyter notebook, create duration model and generated speech samples from trained acoustic model. We have prepared 1000 texts corpus their matching text transcription from Afaan Oromoo speech corpus by a female speaker dependent for training 700 sentences and tests 300 sentences from dataset domains. In this study, two evaluation techniques used. Frist, the Mean Opinion Score (MOS) evaluation technique is used for intelligibility and naturalness in TTS. The second is Mel Cepstral Distortion (MCD) which is highly used for objective evaluation in model approach for TTS. So, the performance of this model was measured and quality of synthesized speech is assessed in terms of intelligibility and naturalness which results are 3.77 and 3.76 respectively. The total average processed using objective evaluation technique the speech corpus on 16 kHz standards is generated by MCD BLSTM-based on RNN is 3.89 and merlin wave generated is 3.71 correspondingly.

Keywords: Text To Speech Synthesis, Mel Cepstral Distortion (MCD), Mean Opinion Square (MOS), Bidirectional Long Short Term Memory Recurrent Neural Network (BLSTM-RNN)

DOI: 10.7176/NMMC/101-02

Publication date: April 30th 2022


Full Text: PDF
Download the IISTE publication guideline!

To list your conference here. Please contact the administrator of this platform.

Paper submission email: NMMC@iiste.org

ISSN (Paper)2224-3267 ISSN (Online)2224-3275

Please add our address "contact@iiste.org" into your email contact list.

This journal follows ISO 9001 management standard and licensed under a Creative Commons Attribution 3.0 License.

Copyright © www.iiste.org