In this guide, we will explore how to evaluate the results of the WhisperKit Automatic Speech Recognition (ASR) system using the Librispeech dataset. With various model formats and optimizations, this evaluation process aids developers in selecting the best version for their projects.
Understanding the Evaluation Matrix
To understand the WhisperKit evaluation results, think of it as a fitness competition where various athletes (models) compete in different categories. Just like athletes are judged based on their speed (WER), quality (QoI), and endurance (File Size), WhisperKit provides us with comparable metrics for each model.
- WER (Word Error Rate): This is akin to the athlete’s performance score, indicating how many errors were made in recognizing spoken words. A lower score is better.
- QoI (Quality of Inference): Think of this as a quality rating that represents how consistent the athlete’s performance is across different events. A higher QoI indicates better reliability.
- File Size: This represents the weight class of our athlete. Lighter models may perform faster but could sacrifice some speed in results.
Evaluation Results Overview
The WhisperKit evaluation results against the Librispeech dataset show various parameters for different models. Here’s a summary of the results tested with the optimal OpenAI Whisper setup:
Model | WER | QoI (%) | File Size (MB)
---------------------------------------------|------|---------|----------------
openai_whisper-large-v3 | 2.44 | 100 | 3100
openai_whisper-large-v3_turbo | 2.41 | 99.8 | 3100
openai_whisper-large-v3_turbo_1307MB | 2.6 | 97.7 | 1307
openai_whisper-large-v3_turbo_1049MB | 4.81 | 91 | 1049
openai_whisper-large-v3_1053MB | 4.65 | 90.8 | 1053
Insights from Different Projects
Let’s compare how WhisperKit stacks up against alternative implementations:
Project | WER | Commit Hash | Model Format
------------------------|------|-------------|----------------
WhisperKit | 2.44 | 0f8b4fe | Core ML
WhisperCpp | 2.36 | e72e415 | Core ML + GGUF
WhisperMLX | 2.69 | 614de66 | MLX (Numpy)
Quality of Inference (QoI) Certification
Measuring the QoI is essential for understanding how various model optimizations might affect speech recognition quality during production. Just like an athlete must focus on maintaining performance under varying conditions, developers must ensure their models do not regress qualitatively while running optimized code.
A structured assessment of no-regression scenarios is implemented through the following pseudocode:
python
qoi = []
for example in dataset:
no_regression = wer(optimized_model(example)) == wer(reference_model(example))
qoi.append(no_regression)
qoi = (sum(qoi) / len(qoi)) * 100
Reproducing Results
To reproduce WhisperKit results, use any Apple Silicon Mac to run evaluation jobs. Our M2 Ultra devices can complete evaluations in under 1 hour, while older models may take less than a day. This practice is crucial for maintaining your competitive edge, much like training programs for athletes.
Troubleshooting
If you run into issues while trying to replicate the results or when using WhisperKit in your projects, consider the following troubleshooting tips:
- Ensure your hardware meets the minimum requirements, specifically for running Apple Silicon-based evaluations.
- Check for updates in the WhisperKit repository to avoid issues related to outdated model versions.
- If results are inconsistent, look into your dataset quality and ensure it’s aligned with WhisperKit’s testing parameters.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

