In the ever-evolving world of artificial intelligence, transcribing speech across numerous languages poses a challenge. However, with the Massively Multilingual Speech (MMS) project, that challenge transforms into an opportunity! This blog will guide you step-by-step on how to use the MMS project, leveraging its advanced capabilities in automatic speech recognition (ASR) across over 1,000 languages.
Understanding the MMS Model
The MMS project offers a fine-tuned model built on the Wav2Vec2 architecture, which has been developed with a staggering 1 billion parameters. If we think about it, this model is like a knowledgeable library that can interpret spoken languages from different books (or cultures) and transcribe them into text.
Table of Content
Example
Let’s dive right into how you can utilize this incredible model! Below are the steps you need to follow:
1. Install Required Libraries
First, you’ll need to install the necessary libraries:
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
**Note:** Ensure that the Transformers library is at least version 4.30. If it’s not available on PyPI, install it from the source using:
pip install git+https://github.com/huggingface/transformers.git
2. Load Your Audio Samples
We’ll then proceed to load some audio samples, ensuring they are sampled at 16,000 kHz:
from datasets import load_dataset, Audio
# Load an English sample
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
# Load a French sample
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
fr_sample = next(iter(stream_data))["audio"]["array"]
3. Load the Model and Processor
Now it’s time to load the actual model and processor:
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch
model_id = "facebook/mms-1b-all"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
4. Process and Transcribe Audio Data
Here you will process the audio data and transcribe it:
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids) # Outputs transcription for English
Supported Languages
The MMS model supports transcription for a staggering 1,162 languages. You can check the comprehensive list of supported languages, including their ISO 639-3 codes in the MMS Language Coverage Overview.
Model Details
- Developed by: Vineel Pratap et al.
- Model type: Multi-Lingual Automatic Speech Recognition model
- License: CC-BY-NC 4.0 license
- Number of parameters: 1 billion
- Audio sampling rate: 16,000 kHz
Additional Links
- Blog post
- Transformers documentation
- Paper
- GitHub Repository
- Other MMS checkpoints
- MMS Base Checkpoints:
- Official Space
Troubleshooting
If you encounter issues while using the MMS model, consider the following troubleshooting ideas:
- Ensure that you have met all the installation requirements, especially the correct version of Transformers.
- Verify that your audio samples are properly formatted to 16,000 kHz.
- Check the model loading and processor initialization to confirm they’re correctly set up.
For additional insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

