Getting Started with Qwen-Audio and Qwen-Audio-Chat

Dec 12, 2023 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_8_149

Welcome to the world of cutting-edge audio language models! In this article, we’ll explore how to utilize the revolutionary Qwen-Audio and Qwen-Audio-Chat, both developed by Alibaba Cloud. By the end of our journey, you’ll have a solid understanding of how to implement these models and troubleshoot any issues that arise along the way.

Introduction to Qwen-Audio

Qwen-Audio is a multimodal large language model designed to process a variety of audio inputs, including human speech, natural sounds, and music. This model is an extension of the Qwen series and serves as a universal audio understanding platform. It can execute multiple tasks simultaneously through a robust multi-task learning framework, making it a powerful tool for developers and researchers alike.

System Requirements

Python 3.8 or above
PyTorch 1.12 or higher (2.0 recommended)
CUDA 11.4 or above (for GPU users)
FFmpeg

Quickstart Guide

To begin using Qwen-Audio, you need to set up your environment and install the required packages. Here’s how to get started:

Install Dependencies

pip install -r requirements.txt

Using Transformers with Qwen-Audio

Once your environment is ready, you can utilize Qwen-Audio with just a few lines of code. Think of the process as following a recipe: you have your ingredients (audio and text inputs) ready, and now you just need to mix them correctly to bake a delicious dialogue experience!

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

torch.manual_seed(1234)
tokenizer = AutoTokenizer.from_pretrained("Qwen-Qwen-Audio-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen-Qwen-Audio-Chat", device_map="cuda", trust_remote_code=True).eval()

query = tokenizer.from_list_format([
    {"audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac", "text": "what does the person say?"}
])

response, history = model.chat(tokenizer, query=query, history=None)
print(response)

This code snippet represents the steps to initiate a dialogue with Qwen-Audio using an audio clip and a corresponding text prompt.

Handling Multiple Turns in Dialogue

Continuing the analogy of baking, if our first batch of cookies was success, we can proceed to fine-tune our model for a specific conversation.

# Second dialogue turn
response, history = model.chat(tokenizer, {"audio": "find the start time and end time of the word middle classes", "text": ""}, history=history)
print(response)

This snippet allows the model to refer back to the previous conversation to maintain context, akin to a chef checking their notes on previous batches to improve the next.

Troubleshooting Tips

While using Qwen-Audio and Qwen-Audio-Chat, you may encounter some issues. Here are a few troubleshooting ideas:

Ensure that all dependencies are installed correctly. Double-check the list of requirements.
Make sure you’re using the latest versions of the libraries. You can verify this by updating the libraries via pip commands.
If you get an error regarding the device being used (CPU or GPU), ensure that your environment settings are properly configured based on your hardware.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox