Live Transcription with Whisper PoC in a Server-Client Setup

Mar 8, 2021 | Educational

Welcome to our guide on how to set up a live transcription Proof of Concept (PoC) using the Whisper model in a server-client configuration! This setup allows multiple clients to connect to a server running a RestAPI, making your transcription task not only efficient but also scalable.

Understanding the Whisper Live Transcription Setup

Imagine hosting a dinner party where you are responsible for translating everything your guests say so that everyone understands. In this scenario, the server acts as your translation assistant, listening to guests (clients) as they speak and producing instant translations (transcriptions) that everyone can read, all in real time. Just like ensuring each guest captures every moment, our setup efficiently processes audio input and provides immediate textual output.

Setting Up the Environment

  • Ensure you have Python installed on your machine.
  • Run the following commands to set up the required packages:
$ pip install -r requirements.txt
  • Create a directory for the models:
$ mkdir models

Running the Server and Client

Now that your environment is set up, let’s get the server and client running.

  1. Before starting, make sure to modify the parameters in server.py as needed.
  2. Start the server’s RestAPI:
  3. python server.py
  4. Now, launch the Gradio client on localhost:
  5. python ui_client.py
  6. If you’d like to use HTTPS without needing certificates, you can run:
  7. SHARE=1 python ui_client.py
  8. For HTTPS with your own certificates, execute:
  9. SSL_CERT_PATH=PATH SSL_KEY_PATH=PATH python ui_client.py

How It Works

The transcription process is a fascinating dance of timing and synchronization. Each second of audio is divided into frames that the server captures and processes simultaneously. To illustrate, let’s take a look at how the server manages the audio input as it unfolds over time:

  • 1st second: The system starts capturing audio and outputs “Hi”.
  • 2nd second: It continues, enriching the transcription with “Hi I am”.
  • As the audio progresses, it captures longer sentences like “Hi I am the one and only Gabor”.

Once audio input begins again, it seamlessly restarts the process, ensuring that nothing is missed – just like our attentive assistant at the dinner party!

Enhancements and Improvements

To take this setup to the next level, consider implementing the following improvements:

  • Utilize a Voice Activity Detection (VAD) on the client side to send audio for transcription only during prolonged periods of silence (e.g., one-second intervals).
  • Transcribe shorter timeframes to produce more instant results while allowing for corrections post-transcription.

Troubleshooting

While setting up your Whisper live transcription PoC, you may encounter some challenges:

  • If you find that the server is not starting as expected, ensure you have the correct versions of Python and the required packages installed.
  • For any connection issues between the server and client, recheck your network settings and ensure that the necessary ports are open.
  • If your audio quality is poor, consider using a high-quality microphone to improve input clarity.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox