How to Harness the Free Spoken Digit Dataset (FSDD) for Your AI Projects

May 5, 2024 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitmachine_learningreadme_Jakobovski_free-spoken-digit-dataset

If you are venturing into the realm of audio processing and machine learning, the Free Spoken Digit Dataset (FSDD) is an invaluable resource. This blog will guide you on how to effectively utilize this dataset, troubleshoot common issues, and stir up some creativity in your projects!

What is the Free Spoken Digit Dataset?

The FSDD consists of audio recordings of spoken digits captured at a sample rate of 8kHz. The dataset contains recordings from six speakers, encompassing a total of 3,000 recordings (50 recordings per digit). Each audio file is neatly trimmed to minimize silence at the beginning and end.

Structure of the Dataset

Files are named in the format: digitLabel_speakerName_index.wav (e.g., 7_jackson_32.wav)
With 6 speakers and 3,000 recordings, it captures clear English pronunciations.

Getting Started with the Dataset

To make the most out of the FSDD, you’ll want to integrate it with Activeloop’s Python package called Hub. Follow these steps for a seamless experience:

Step 1: Install Hub

pip install hub

Step 2: Load the Dataset

import hub

ds = hub.load('hub:activeloop/spoken_mnist')

Step 3: Visualize Spectrogram

Now, let’s visualize the first spectrogram in the dataset:

import matplotlib.pyplot as plt

plt.imshow(ds.spectrograms[0].numpy())
plt.title(f'{ds.speakers[0].data()} spoke {ds.labels[0].numpy()}')
plt.show()

Step 4: Train a Model

Whether you are using PyTorch or TensorFlow, you’re covered!

For PyTorch:

for sample in ds.pytorch():
    # ... model code here ...

For TensorFlow:

for sample in ds.tensorflow():
    # ... model code here ...

Step 5: Understand the Dataset

Curious about what tensors are available? Simply print the dataset:

print(ds)

This command will show you details such as:

Dataset(path=hub:activeloop/spoken_mnist, tensors=[spectrograms, labels, audio, speakers])

Using the Dataset Effectively

The official test set incorporates the first 10% of recordings, ensuring a division between tests and training sets:

Test set: Recordings numbered 0-4 (inclusive)
Training set: Recordings numbered 5-49

Troubleshooting Tips

As you dive into this exciting project, it’s natural to encounter some bumps along the way. Here are a few common issues and solutions:

Installation Issues: Ensure that you are using a compatible version of Python. It is best to use Python 3.6 or later.
Loading Errors: If the dataset doesn’t load as expected, double-check the command and the network connection.
Visualizations Not Displaying: Make sure that your environment supports plotting or utilize Jupyter Notebooks.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Contributions and Metadata

If you want to contribute your own recordings to the FSDD, make sure to adhere to the following:

Use mono 8kHz wav files.
Trim any silence before and after your recordings.
Update the metadata in metadata.py with details related to speaker gender and accents.

Utilities Included

The package also includes some handy utilities:

trimmer.py: Trims silences in your audio files.
fsdd.py: Provides a user-friendly API for accessing the data.
spectogramer.py: Creates spectrograms of the audio data for preprocessing.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Set forth on your journey with the FSDD and keep crafting your own intelligent audio applications!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox