Welcome to the fascinating world of SoftVC VITS Singing Voice Conversion! This project allows you to take your favorite anime characters and have them sing your desired tunes. It focuses on Singing Voice Conversion (SVC), steering clear of the traditional Text-to-Speech (TTS) functionalities. In this article, we will guide you on getting started, using the necessary components, and troubleshooting common issues.
Getting Started with SoftVC VITS SVC
Before diving into the process, ensure you have Python version 3.8.9 installed on your machine. The following steps will help you set up the system accordingly:
Step 1: Selecting Your Encoder
Select one encoder from the following options:
- ContentVec: Recommended – hubert_base.pt
- HubertSoft: hubert-soft-0d54a1f4.pt
- Whisper-PPG: medium.pt
- WavLM: WavLM-Base+
Step 2: Download Pre-trained Model Files
Choose the encoder you’ll be using and place the downloaded model files in the designated pretrain directory.
Step 3: Preparing Your Dataset
Format your dataset within the dataset_raw directory as follows:
dataset_raw
├── speaker0
│ ├── xxx1-xxx1.wav
│ └── Lxx-0xx8.wav
└── speaker1
└── xx2-0xxx2.wav
Understanding the Code Functionality
The SoftVC VITS integrates various encoder models, each tailored for specific tasks in the Singing Voice Conversion process, akin to a musical orchestra where each musician has an individual role but collaborates to produce a harmonious sound. The SoftVC acts as the conductor, ensuring that each piece, or audio input, is translated into a cohesive musical performance. Here’s a closer breakdown of how this works:
- The ContentVec encoder extracts and preserves essential features of the singing voice, much like a concert maestro ensuring each musician doesn’t misalign with the rhythm.
- Each encoder can manipulate sound characteristics, just as different instruments create their unique tone, contributing to the overall audio outcome.
- These encoders work in tandem with training models to fine-tune and sculpt the final auditory product. This is similar to how musical arrangements come together to enhance the listening experience.
Training the Model
Once your dataset is prepared, it’s time to train the model. Use the following command:
python train.py -c configs/config.json -m 44k
Inference Process
After the training is complete, you can perform inference using:
python inference_main.py -m logs/44k/G_30_400.pth -c configs/config.json -n -src.wav -t 0 -s nen
Troubleshooting Common Issues
If you encounter issues during any of these stages, here are some troubleshooting suggestions:
- Model Not Training: Double-check your Python version. It should be 3.8.9.
- Audio Quality Problems: This may be due to dataset inconsistencies. Ensure that all your audio files are in WAV format and appropriately sliced (suggested length: 5s – 15s).
- Performance Issues: During preprocessing, consider using professional sound processing tools to manage loudness and quality.
- General Questions: Check the project’s issues section on its GitHub page for community troubleshooting.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

