If you are venturing into the realm of audio processing and machine learning, the Free Spoken Digit Dataset (FSDD) is an invaluable resource. This blog will guide you on how to effectively utilize this dataset, troubleshoot common issues, and stir up some creativity in your projects!
What is the Free Spoken Digit Dataset?
The FSDD consists of audio recordings of spoken digits captured at a sample rate of 8kHz. The dataset contains recordings from six speakers, encompassing a total of 3,000 recordings (50 recordings per digit). Each audio file is neatly trimmed to minimize silence at the beginning and end.
Structure of the Dataset
- Files are named in the format: digitLabel_speakerName_index.wav (e.g., 7_jackson_32.wav)
- With 6 speakers and 3,000 recordings, it captures clear English pronunciations.
Getting Started with the Dataset
To make the most out of the FSDD, you’ll want to integrate it with Activeloop’s Python package called Hub. Follow these steps for a seamless experience:
Step 1: Install Hub
pip install hub
Step 2: Load the Dataset
import hub
ds = hub.load('hub:activeloop/spoken_mnist')
Step 3: Visualize Spectrogram
Now, let’s visualize the first spectrogram in the dataset:
import matplotlib.pyplot as plt
plt.imshow(ds.spectrograms[0].numpy())
plt.title(f'{ds.speakers[0].data()} spoke {ds.labels[0].numpy()}')
plt.show()
Step 4: Train a Model
Whether you are using PyTorch or TensorFlow, you’re covered!
- For PyTorch:
for sample in ds.pytorch():
# ... model code here ...
for sample in ds.tensorflow():
# ... model code here ...
Step 5: Understand the Dataset
Curious about what tensors are available? Simply print the dataset:
print(ds)
This command will show you details such as:
Dataset(path=hub:activeloop/spoken_mnist, tensors=[spectrograms, labels, audio, speakers])
Using the Dataset Effectively
The official test set incorporates the first 10% of recordings, ensuring a division between tests and training sets:
- Test set: Recordings numbered 0-4 (inclusive)
- Training set: Recordings numbered 5-49
Troubleshooting Tips
As you dive into this exciting project, it’s natural to encounter some bumps along the way. Here are a few common issues and solutions:
- Installation Issues: Ensure that you are using a compatible version of Python. It is best to use Python 3.6 or later.
- Loading Errors: If the dataset doesn’t load as expected, double-check the command and the network connection.
- Visualizations Not Displaying: Make sure that your environment supports plotting or utilize Jupyter Notebooks.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Contributions and Metadata
If you want to contribute your own recordings to the FSDD, make sure to adhere to the following:
- Use mono 8kHz wav files.
- Trim any silence before and after your recordings.
- Update the metadata in metadata.py with details related to speaker gender and accents.
Utilities Included
The package also includes some handy utilities:
- trimmer.py: Trims silences in your audio files.
- fsdd.py: Provides a user-friendly API for accessing the data.
- spectogramer.py: Creates spectrograms of the audio data for preprocessing.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Set forth on your journey with the FSDD and keep crafting your own intelligent audio applications!