How to Use SNAC: A Multi-Scale Neural Audio Codec

Apr 4, 2024 | Educational

Welcome to the world of sound compression! Today, we’ll explore how to use SNAC (Multi-Scale Neural Audio Codec), an innovative tool that compresses audio into discrete codes at remarkably low bitrates. Whether you’re into music or sound effects (SFX) generation, SNAC has got you covered. So let’s dive in!

Overview of SNAC

SNAC efficiently encodes audio into hierarchical tokens, a method reminiscent of other audio codecs like SoundStream and EnCodec. However, SNAC takes a creative twist: it samples coarse tokens less frequently, allowing it to cover a broader time span.

To put it simply, think of SNAC as a sophisticated “translator” that converts your musical compositions into a compact language of codes, making them easier to transmit and store. This model compresses 44 kHz audio at a low bitrate of just 2.6 kbps while using four different Resolution Vector Quantization (RVQ) levels at various token rates (14, 29, 57, and 115 Hz).

Pretrained Models

Currently, SNAC only supports a single audio channel (mono). Below is a list of the pretrained models available:

hubertsiuzdaksnac_24khz – 0.98 kbps, 24 kHz, 19.8M Parameters, 🗣️ Speech
hubertsiuzdaksnac_32khz – 1.9 kbps, 32 kHz, 54.5M Parameters, 🎸 Music, Sound Effects
hubertsiuzdaksnac_44khz (this model) – 2.6 kbps, 44 kHz, 54.5M Parameters, 🎸 Music, Sound Effects

How to Install and Use SNAC

Follow these steps to get started with SNAC:

Installation

First, you need to install the SNAC library. Open your terminal and execute the following command:

pip install snac

Encoding and Decoding Audio

Once installed, you can begin encoding and decoding audio with SNAC. Below is a sample code to help you get started:

import torch
from snac import SNAC

model = SNAC.from_pretrained('hubertsiuzdaksnac_44khz').eval().cuda()
audio = torch.randn(1, 1, 44100).cuda()  # B, 1, T

with torch.inference_mode():
    codes = model.encode(audio)
    audio_hat = model.decode(codes)

Alternatively, you can encode and reconstruct audio audio in one seamless operation:

with torch.inference_mode():
    audio_hat, codes = model(audio)

Note that the output, codes, consists of a list of token sequences of variable lengths, each related to different temporal resolutions. For example, you might see shapes like:

[code.shape[1] for code in codes]  # [16, 32, 64, 128]

Troubleshooting Tips

If you run into issues while using SNAC, here are some common troubleshooting steps:

Installation Issues: Make sure you have the latest version of pip. You can upgrade it using pip install --upgrade pip.
CUDA Errors: If your model isn’t recognizing CUDA, check if your GPU is properly configured and that you have the necessary NVIDIA drivers installed.
Audio Quality Problems: Experiment with different RVQ levels and token rates to see if a different setting yields better results for your specific audio type.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With SNAC, compressing music and sound effects has never been easier or more efficient. By utilizing its innovative encoding methodology, you can preserve the quality of audio while minimizing storage needs.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox