Getting Started with the Canary-1B Model for Automatic Speech Recognition and Translation

May 8, 2024 | Educational

Welcome to your guide on using the powerful Canary-1B model, a multilingual automatic speech recognition (ASR) and speech translation powerhouse developed using NVIDIA’s NeMo toolkit. This model supports four languages and can be utilized for various tasks such as transcribing audio and translating speech. Let’s dive into how to effectively use and troubleshoot this model!

Understanding the Model Architecture

Imagine Canary-1B as a highly skilled translator at a bustling international conference. The encoder acts as the interpreter, converting spoken words into an internal language, while the Transformer decoder is like the translator who delivers the final speech in the target language. This collaborative transition ensures the highest accuracy in understanding and translating spoken content.

How to Use the Canary-1B Model

Using the Canary-1B model is straightforward once you have the required installations. Follow these instructions:

Make sure you have NVIDIA NeMo installed along with Cython and the latest version of PyTorch.
Install the model toolkit using the following command:

pip install git+https://github.com/NVIDIA/NeMo.git@r1.23.0#egg=nemo_toolkit[asr]

Loading the Model

To load the model, execute the following Python commands:


from nemo.collections.asr.models import EncDecMultiTaskModel

# load model
canary_model = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b")

# update decoding params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)

Input Format

The model accepts either a list of audio file paths or a JSONL manifest file. Here’s how to specify the input formats:

For straightforward English audio transcription:


predicted_text = canary_model.transcribe(
    paths2audio_files=[path1.wav, path2.wav],
    batch_size=16  # batch size to run the inference with
)

For transcribing various languages or performing translation, prepare a JSONL manifest with the necessary fields:


# Example line in input_manifest.json
{
    "audio_filepath": "pathtoaudio.wav",  # path to the audio file
    "duration": 1000,  # duration of the audio
    "taskname": "asr",  # or "s2t_translation" for speech-to-text translation
    "source_lang": "en",  # language of the audio input
    "target_lang": "de",  # language of the output
    "pnc": "yes"  # punctuation and capitalization option
}

Transcribing and Translating

Use the model to transcribe audio or translate speech by following these commands:


predicted_text = canary_model.transcribe(
    "path to input manifest file", 
    batch_size=16
)

Troubleshooting Tips

If you encounter issues while using the Canary-1B model, consider the following suggestions:

Model Not Loading: Ensure that you have installed all dependencies correctly. Double-check your PyTorch and NeMo installations.
Input Format Errors: Verify that your audio files are mono and sampled at 16000 Hz. Also, check the JSONL manifest structure for correctness.
Slow Performance: Consider reducing the batch size if you’re experiencing long processing times during transcriptions.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox