How to Perform Text-to-Speech (TTS) with FastSpeech2 Using SpeechBrain

Feb 29, 2024 | Educational

Ready to break the barriers of communication with machine-generated speech? In today’s blog, we will explore how to utilize the state-of-the-art FastSpeech2 model trained on the LJSpeech dataset to create high-quality text-to-speech (TTS) outputs using the SpeechBrain toolkit. Let’s dive in!

Installation Steps for SpeechBrain

To get started, you’ll need to install SpeechBrain. Follow the steps below:

Clone the SpeechBrain repository:

bash git clone https://github.com/speechbrain/speechbrain.git

Change your directory to SpeechBrain:

cd speechbrain

Install the required packages:

pip install -r requirements.txt

Install SpeechBrain as an editable package:

pip install --editable .

We encourage you to read our tutorials and learn more about SpeechBrain.

Performing TTS with FastSpeech2

Now that you have the toolkit installed, it’s time to make some speech!

1. Text Input

Here’s how to convert text input into speech:

python
import torchaudio
from speechbrain.inference.TTS import FastSpeech2
from speechbrain.inference.vocoders import HIFIGAN

# Initialize TTS and Vocoder
fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir="pretrained_models/tts-fastspeech2-ljspeech")
hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="pretrained_models/tts-hifigan-ljspeech")

# Run the TTS with text input
input_text = "We're the leaders in this luckless change; though our own Baskerville; who was at work some years before them; went much on the same lines."
mel_output, durations, pitch, energy = fastspeech2.encode_text(
    [input_text], pace=1.0, pitch_rate=1.0, energy_rate=1.0
)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waveform
torchaudio.save("example_TTS_input_text.wav", waveforms.squeeze(1), 22050)

2. Phoneme Input

Alternatively, you can input phonemes directly, like so:

input_phonemes = ["W", "ER", "DH", "AH", "L", "IY", "D", "ER", "Z", "IH", "N", "DH", "IH", "S", "L", "AH", "K", "L", "AH", "S", "CH", "EY", "N", "JH", "spn", "DH", "OW", "AW", "ER", "OW", "N", "B", "AE", "S", "K", "ER", "V", "IH", "L", "spn", "HH", "UW", "W", "AA", "Z", "AE", "T", "W", "ER", "K", "S", "AH", "M", "Y", "IH", "R", "Z", "B", "IH", "F", "AO", "R", "DH", "EH", "M", "spn", "W", "EH", "N", "T", "M", "AH", "CH", "AA", "N", "DH", "AH", "S", "EY", "M", "L", "AY", "N", "Z", "spn"]

mel_output, durations, pitch, energy = fastspeech2.encode_phoneme(
    [input_phonemes], pace=1.0, pitch_rate=1.0, energy_rate=1.0
)

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waveform
torchaudio.save("example_TTS_input_phoneme.wav", waveforms.squeeze(1), 22050)

3. Batch Input for Multiple Sentences

If you’re eager to convert multiple sentences in one go, here’s how:

items = [
    "A quick brown fox jumped over the lazy dog.",
    "How much wood would a woodchuck chuck?",
    "Never odd or even"
]

mel_outputs, durations, pitch, energy = fastspeech2.encode_text(
    items, pace=1.0, pitch_rate=1.0, energy_rate=1.0
)

Inference on GPU

To enhance your processing speed, you can perform inference on the GPU by adding run_opts="device:cuda" when calling the from_hparams method.

Training Your Own Model

If you wish to train FastSpeech2 from scratch, here’s a brief rundown:

Clone SpeechBrain:

bash git clone https://github.com/speechbrain/speechbrain

Install it:

bash cd speechbrain & pip install -r requirements.txt & pip install -e .

Run Training:

bash cd recipes/LJSpeechTTS/fastspeech2 & python train.py --device=cuda:0 --max_grad_norm=1.0 --data_folder=your_folder/LJSpeech-1.1 hparams/train.yaml

You can find our training results, including models and logs, here.

Troubleshooting Tips

If you encounter any issues, check the following:

Ensure all required packages are installed correctly.
Check if your input text or phonemes are accurate and formatted properly.
If you’re using GPU and encounter issues, verify that your CUDA drivers are installed and configured correctly.
For performance inquiries, test your setup on the default datasets provided before scaling up to your own datasets.

For more insights, updates, or to collaborate on AI development projects, stay connected with [fxis.ai](https://fxis.ai).

Understanding the Code: An Analogy

Think of the TTS model, FastSpeech2, as an expert chef in a kitchen (your code environment). When you input text or phonemes (ingredients), the chef uses a specific recipe (the model’s architecture) to prepare a dish (the spectrogram). Then, to present a gourmet meal (final audio waveform), another skilled sous-chef (the vocoder) takes the dish from the chef and plates it perfectly. Just like a great meal needs both an expert chef and a capable helper, our TTS system requires both FastSpeech2 and the HiFIGAN vocoder to serve up high-quality speech!

Conclusion

Congratulations! You now have a solid understanding of how to implement Text-to-Speech using FastSpeech2 and SpeechBrain. At [fxis.ai](https://fxis.ai), we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox