Music Genre Classification using Wav2Vec 2.0

Jul 6, 2021 | Educational

In the world of audio classification, understanding the genre of a piece of music can be just as complex as deciphering a foreign language. Fortunately, with the power of Wav2Vec 2.0, we can transform this confusion into a seamless experience. This blog post will guide you through the process of classifying music genres using Wav2Vec 2.0, just like tuning a radio to catch the perfect frequency!

How to Use

Before we dive into the intricate details, let’s ensure you have everything in place.

Requirements

First, you need to install the required packages. Open your terminal and execute the following commands:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

Prediction

Now that you have the necessary tools, it’s time to get your hands dirty with some code. We’ll walk through the process of predicting music genres using Wav2Vec 2.0.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor

# Initialize the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the model configuration and feature extractor
model_name_or_path = "m3hrdadfi/wav2vec2-base-100k-voxpopuli-gtzan-music"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

# Function to process the audio file
def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

# Prediction function
def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors='pt', padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits
    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{'Label': config.id2label[i], 'Score': f"{round(score * 100, 3):.1f}%" } for i, score in enumerate(scores)]
    return outputs

path = "genres_original/disco/disco.00067.wav"
outputs = predict(path, sampling_rate)
outputs

Understanding the Code: An Analogy

Imagine you’re hosting a party where different genres of music are played, and you’re tasked with identifying each one. In our code, we first prepare the audio tracks just like setting up each song in a playlist. The `model` acts as your discerning ear, trained to recognize various music genres.

The `speech_file_to_array_fn` function is like the DJ who adjusts the song’s volume and quality before playing it.
The `predict` function is comparable to your friends who give you feedback on each song, helping you classify them based on what they hear.
Finally, `outputs` provides the verdicts of the classifications, complete with scores highlighting how sure you are about each genre.

Evaluation

Now let’s look at how well our model performed with some metrics to evaluate its accuracy:

record = {
     'label': ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock'],
     'precision': [0.792, 0.864, 0.812, 0.778, 0.933, 1.000, 0.783, 0.917, 0.543, 0.611],
     'recall': [0.950, 0.950, 0.650, 0.700, 0.700, 0.850, 0.900, 0.550, 0.950, 0.550],
     'f1-score': [0.864, 0.905, 0.722, 0.737, 0.800, 0.919, 0.837, 0.687, 0.691, 0.579],
     'support': [20]*10,
     'accuracy': 0.775,
     'macro avg': {'precision':0.803,'recall':0.775,'f1-score':0.774},
     'weighted avg': {'precision':0.803,'recall':0.775,'f1-score':0.774}
}
print(record)

Troubleshooting

While following this guide, you might encounter some hurdles. Here are a few common issues and their solutions:

Installation Failures: Ensure you have Python and pip installed. Sometimes, reinstalling the packages resolves hidden issues.
Model Loading Errors: Check for internet connectivity as the model configuration needs to be fetched online.
Performance Issues: If running on a CPU, the process may be slow. Consider using a GPU if available.

If you still face issues, feel free to share your problem or ask questions by posting a GitHub issue from HERE.

For more insights, updates, or to collaborate on AI development projects, stay connected with [fxis.ai](https://fxis.ai).

Conclusion

Music genre classification using Wav2Vec 2.0 exemplifies how AI can bring clarity to complex tasks, transforming them into systematic operations. At [fxis.ai](https://fxis.ai), we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox