How to Evaluate WhisperKit ASR Results

Mar 9, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_16_170

In this guide, we will explore how to evaluate the results of the WhisperKit Automatic Speech Recognition (ASR) system using the Librispeech dataset. With various model formats and optimizations, this evaluation process aids developers in selecting the best version for their projects.

Understanding the Evaluation Matrix

To understand the WhisperKit evaluation results, think of it as a fitness competition where various athletes (models) compete in different categories. Just like athletes are judged based on their speed (WER), quality (QoI), and endurance (File Size), WhisperKit provides us with comparable metrics for each model.

WER (Word Error Rate): This is akin to the athlete’s performance score, indicating how many errors were made in recognizing spoken words. A lower score is better.
QoI (Quality of Inference): Think of this as a quality rating that represents how consistent the athlete’s performance is across different events. A higher QoI indicates better reliability.
File Size: This represents the weight class of our athlete. Lighter models may perform faster but could sacrifice some speed in results.

Evaluation Results Overview

The WhisperKit evaluation results against the Librispeech dataset show various parameters for different models. Here’s a summary of the results tested with the optimal OpenAI Whisper setup:


Model                                       | WER  | QoI (%) | File Size (MB)
---------------------------------------------|------|---------|----------------
openai_whisper-large-v3                    | 2.44 | 100     | 3100
openai_whisper-large-v3_turbo               | 2.41 | 99.8    | 3100
openai_whisper-large-v3_turbo_1307MB       | 2.6  | 97.7    | 1307
openai_whisper-large-v3_turbo_1049MB       | 4.81 | 91      | 1049
openai_whisper-large-v3_1053MB              | 4.65 | 90.8    | 1053

Insights from Different Projects

Let’s compare how WhisperKit stacks up against alternative implementations:


Project                 | WER  | Commit Hash | Model Format
------------------------|------|-------------|----------------
WhisperKit              | 2.44 | 0f8b4fe     | Core ML
WhisperCpp              | 2.36 | e72e415     | Core ML + GGUF
WhisperMLX              | 2.69 | 614de66     | MLX (Numpy)

Quality of Inference (QoI) Certification

Measuring the QoI is essential for understanding how various model optimizations might affect speech recognition quality during production. Just like an athlete must focus on maintaining performance under varying conditions, developers must ensure their models do not regress qualitatively while running optimized code.

A structured assessment of no-regression scenarios is implemented through the following pseudocode:


python
qoi = []
for example in dataset:
    no_regression = wer(optimized_model(example)) == wer(reference_model(example))
    qoi.append(no_regression)
qoi = (sum(qoi) / len(qoi)) * 100

Reproducing Results

To reproduce WhisperKit results, use any Apple Silicon Mac to run evaluation jobs. Our M2 Ultra devices can complete evaluations in under 1 hour, while older models may take less than a day. This practice is crucial for maintaining your competitive edge, much like training programs for athletes.

Troubleshooting

If you run into issues while trying to replicate the results or when using WhisperKit in your projects, consider the following troubleshooting tips:

Ensure your hardware meets the minimum requirements, specifically for running Apple Silicon-based evaluations.
Check for updates in the WhisperKit repository to avoid issues related to outdated model versions.
If results are inconsistent, look into your dataset quality and ensure it’s aligned with WhisperKit’s testing parameters.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox