How to Create a Thai Text-to-Speech Model Using Tacotron2

Jul 29, 2023 | Educational

Welcome to this step-by-step guide on building a Thai Text-to-Speech (TTS) system using the Tacotron2 model. With today’s advancements in AI, generating speech that sounds natural and expressive in different languages has become more accessible. Let’s dive into the magical world of TTS and see how we can create a voice that speaks Thai with a unique character!

Prerequisites

Before we begin our journey, ensure that you have the following installed:

Python
Nemo Toolkit
SoundFile library

Installation

First things first, we need to install the required libraries. Open your terminal and run the following command:

pip install nemo_toolkit[tts] soundfile

Building the TTS Model

Now, let’s get into the code! The code defines our TTS model and utilizes the Common Voice dataset’s Thai voice.

Think of creating this TTS model like assembling a team for a stage performance. Each member has a specific role that contributes to the overall presentation. In our case:

Tacotron2Model is the scriptwriter, converting written Thai text into a storyline (spectrogram).
UnivNetModel acts as the vocalist, singing the storyline in a way that sounds appealing and expressive.

Code Walkthrough

Below is the sequence of operations to build our TTS:

from nemo.collections.tts.models import UnivNetModel
from nemo.collections.tts.models import Tacotron2Model
import torch
import soundfile as sf

# Load the models
model = Tacotron2Model.from_pretrained('lunarlist/tts-thai-last-step').to(cpu)
vcoder_model = UnivNetModel.from_pretrained(model_name='tts_en_libritts_univnet')

# Set the text to convert
text = "ภาษาไทย ง่าย นิด เดียว"
dict_idx = {k: i for i, k in enumerate(model.hparams['cfg']['labels'])}
parsed2 = torch.Tensor([[66] + [dict_idx[i] for i in text if i] + [67]]).int().to(cpu)

# Generate spectrogram and audio
spectrogram2 = model.generate_spectrogram(tokens=parsed2)
audio2 = vcoder_model.convert_spectrogram_to_audio(spec=spectrogram2)

# Save the audio to disk
sf.write('speech.wav', audio2.to(cpu).detach().numpy()[0], 22050)

In this code, we:

Import the necessary libraries.
Load our TTS and vocoding models.
Define the Thai text we want to convert.
Transform this text into a format our model can understand.
Generate a spectrogram and then convert it into audio.
Finally, save the audio to a file named speech.wav.

Troubleshooting

If you encounter any issues while executing the code, here are some troubleshooting tips:

Ensure all libraries are properly installed and updated.
Check if the models are being loaded from the correct paths.
Ensure that your CUDA drivers are correctly set up if using GPU.
Adjust the audio sample rate if you’re facing playback issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Congratulations, you have successfully created a Thai TTS model! This exciting technology not only opens up new frontiers in communication but makes it easier for non-native speakers to interact with the Thai language.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox