How to Use Transformers.js with Pyannote Segmentation

Jul 19, 2024 | Educational

Transformers.js is a powerful library that allows developers to work with machine learning models directly on the web. In this guide, we will explore how to use the Pyannote segmentation model, specifically the version 3.0, with ONNX weights in Transformers.js. Not only will we cover the setup, but we’ll also take a deep dive into the code behind it using a fun analogy, so let’s get started!

Setting Up Transformers.js

To use the Pyannote segmentation model in your web application, follow these steps:

Install Transformers.js by adding it to your project.
Load the model and its processor using JavaScript.
Read and preprocess audio data.
Run the model to get the segmentation results.

Step-by-Step Breakdown of the Code

Here’s the complete code snippet you’d be using:

import { AutoProcessor, AutoModelForAudioFrameClassification, read_audio } from '@xenova/transformers';

// Load model and processor
const model_id = 'onnx-community/pyannote-segmentation-3.0';
const model = await AutoModelForAudioFrameClassification.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);

// Read and preprocess audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav';
const audio = await read_audio(url, processor.feature_extractor.config.sampling_rate);
const inputs = await processor(audio);

// Run model with inputs
const { logits } = await model(inputs);

// Process results
const result = processor.post_process_speaker_diarization(logits, audio.length);

// Display results
console.table(result[0], ['start', 'end', 'id', 'confidence']);

Understanding the Code: The Audio Wizard Analogy

Imagine you are a wizard in an audio realm, and you have magical tools to help you understand who is speaking in a conversation:

**Casting Spells (Loading Model and Processor):** Your first spell is to bring forth the necessary tools to decode the mysteries of audio; this is achieved by loading the model and processor.
**Collecting Ingredients (Reading Audio):** Just like a wizard collects ingredients for a potion, you gather audio samples from the URL and prepare them for processing.
**Performing the Enchantment (Running the Model):** With everything set, you cast your spell (run the model), which conjures up magical `logits`, revealing insights about who spoke when.
**Interpreting the Results (Displaying Results):** Finally, like reading a crystal ball, you interpret the output to see who spoke during which moments and how confident you are about it.

Converting Torch Model to ONNX

If you need to convert the Torch model to ONNX, here’s the code snippet you can use:

# pip install torch onnx
import torch
from pyannote.audio import Model

model = Model.from_pretrained(
  "pyannote/segmentation-3.0",
  use_auth_token="hf_...", # <-- Set your HF token here
).eval()

dummy_input = torch.zeros(2, 1, 160000)
torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    do_constant_folding=True,
    input_names=["input_values"],
    output_names=["logits"],
    dynamic_axes={
        "input_values": {0: "batch_size", 1: "num_channels", 2: "num_samples"},
        "logits": {0: "batch_size", 1: "num_frames"},
    },
)

Troubleshooting Tips

While using Transformers.js, you may encounter a few hiccups. Here are some troubleshooting tips to help you out:

No Model Found: Make sure you have the correct model ID. Check if the model is available on the Hugging Face Hub.
Audio File Issues: Ensure that the audio file URL is valid and accessible. You can test the URL in a browser to confirm.
Performance Problems: If the audio processing is slow, consider optimizing your input audio by reducing its duration or quality.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using Transformers.js with the Pyannote segmentation model opens up exciting possibilities for web-based audio processing. You now have the tools to identify speakers in audio recordings effectively. As you explore this intersection of web technology and AI, remember that your development practices can shape the future of audio applications.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox