How to Use the SenseVoice Speech Model

Oct 24, 2023 | Educational

homemayankDocumentsarticle-generation-using-llmresized_images_gitreadme_FunAudioLLM_SenseVoice

Welcome to a quick guide on utilizing the powerful SenseVoice model, a speech foundation tool that boasts cutting-edge features such as Automatic Speech Recognition (ASR), Spoken Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED).

Highlights of SenseVoice

Multilingual Speech Recognition: Trained on over 400,000 hours of diverse data, it supports more than 50 languages and surpasses Whisper model performance.
Emotion Recognition: Exceptional capability in identifying emotions, outperforming existing leading models.
Fast Inference: Processes audio at lightning speed, reflecting a 15 times speed acceleration compared to Whisper-Large.
Service Deployment: Supports multiple programming languages including Python, C++, and Java among others.

Setting up SenseVoice

To get started with SenseVoice, you need to install the necessary dependencies. Open your terminal and run:

pip install -r requirements.txt

Using SenseVoice for Inference

Using the model for inference is straightforward. Here’s how:

Imagine SenseVoice as a multilingual tour guide in a bustling city. Just as the guide can interpret questions in various languages and respond with the right information almost instantly, SenseVoice can process audio inputs in multiple languages and extract meanings with remarkable speed.

Example Code:

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "path_to_your_model_directory"
model = AutoModel(
    model=model_dir,
    trust_remote_code=True,
    remote_code="model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

res = model.generate(
    input="path_to_audio_file.mp3",
    language="auto",  # Options: zh, en, yue, ja, ko, nospeech
    use_itn=True,
    batch_size_s=60,
    merge_vad=True
)

text = rich_transcription_postprocess(res[0]["text"])
print(text)

In this code, “model_dir” is like the storage room for your guide’s resources, enabling them to provide accurate information based on ideas and cultural contexts.

Exporting and Testing

For more advanced users, exporting your model to ONNX or Libtorch is also possible with the following commands:


# For ONNX
pip install -U funasr funasr-onnx
from pathlib import Path
from funasr_onnx import SenseVoiceSmall

model_dir = "path_to_your_model_directory"
model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
wav_or_scp = ["path_to_audio_file.mp3"]

res = model(wav_or_scp, language="auto", use_itn=True)
print([rich_transcription_postprocess(i) for i in res])

Troubleshooting Tips

Should you encounter issues while using SenseVoice, here are some steps to consider:

Ensure that your model directory is correctly specified.
Check your audio file format; SenseVoice supports various formats.
If the model fails to run, verify that your dependencies are correctly installed.
For assistance or community support, explore suggested issues on the GitHub page or engage in active discussions on the DingTalk group.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With its advanced features and user-friendly setup, SenseVoice represents a tremendous leap in multilingual speech recognition and processing. It not only saves valuable time but also enhances the array of interactions we can have through technology.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox