There is an article with the title “Human Sound Separation” that was introduced in a few previous articles. This article is actually learning the tools that are used in voice cloning.
So-Vits-Svc is a shortened version of “SoftVC VITS Singing Voice Conversion”. It was originally from a paper on the topic of VITS (Voice-Over Intelligent Text-to-Speech). So-Vits-Svc is mainly used for singing voice cloning because I only used it, so I didn’t provide too much explanation. I mainly used it because there were many online resources available.
The so-vits-svc GitHub page has a list of the required models, but it is quite disorganized. The integration package inside it is enough for me.
I have watched YouTube videos where this topic is covered. Here are some examples for reference.
- Zero Degrees: AI-Powered Voice Cloning! This is a powerful tool for converting text to speech and voice to text. Learn how to use So-Vits-Svc in detail and try it yourself! The most detailed guide to using So-Vits-Svc on the entire network is included!
Training Method
YouTube explanations are directly training with their own voices, so we need to first record for half an hour. This is a bit of a waste of time. My approach is to first find two story teller of videos on YouTube, with a duration of 30 minutes each, and then merge them if they are too short. Then, extract the voice from the first video and apply it to the voice of the second video. This way, we can save time.
Extracting YouTube Video Sound
You can directly use the YouTube website, which was mentioned in a previous article. If you need to merge multiple audio files, you can use the ffmpeg command.
1 |
ffmpeg -i "concat:a1.mp3|a2.mp3|a3.mp3" -c copy a4.mp3 |
You can use any audio file format, whether it’s mp3 or wav.
You can select videos with no music or use UVR to remove background music.
Basic Environment Installation
Some basic environment settings (such as Anaconda and shared scripts) have already been written in the【 Collaborative Operation】 article, so please refer to it first to ensure that all instructions work correctly.
Creating Conda Environment
Since the dependencies of each project are different, we will create an environment for each project.
1 2 |
conda create -n vits python=3.8 conda activate vits |
Downloading so-vits-svc and installing packages (failure)
Training Method
Since YouTube explanations are directly trained with their own voices, we need to first record for half an hour. This is a bit of a waste of time. My approach is to first find two bookshelves of videos on YouTube, with a duration of 30 minutes each, and then merge them if they are too short. Then, extract the voice from the first video and apply it to the voice of the second video. This way, we can save time.
Extracting YouTube Video Sound
You can directly use the YouTube website, which was mentioned in a previous article, #023##. If you need to merge multiple audio files, you can use the ffmpeg command.
1 |
ffmpeg -i "concat:a1.mp3|a2.mp3|a3.mp3" -c copy a4.mp3 |
You can use any audio file format, whether it’s mp3 or wav.
You can select videos with no music or use #025## UVR to play background music in the past.
Basic Environment Installation
Some basic environment settings (such as Anaconda and shared scripts) have already been written in the【 Collaborative Operation】 article, so please refer to it first to ensure that all instructions work correctly.
Creating Conda Environment
Since the dependencies of each project are different, we will create an environment for each project.
1 2 |
conda create -n vits python=3.8 conda activate vits |
Downloading so-vits-svc and installing packages (failure)
1 2 3 4 |
git clone https://github.com/svc-develop-team/so-vits-svc cd so-vits-svc pip install -r requirements.txt echo "conda activate vits" > env |
Since the official website’s WEBUI is too simple to use, we can’t use it well. Next, please use this integrated version from the internet to operate, maybe the official website’s WEBUI is more user-friendly in the future.
so-vits-svc model download.
According to the official website and the integrated package from the comparison forum, you only need to download 2 models after comparing.
- vec768l12 encoder: ContentVec, checkpoint_best_legacy_500.pt
- Pre-trained NSF-HIFIGAN Vocoder: nsf_hifigan_20221211.zip
After downloading, vec768l12 is placed in the pretrain directory. NFS-HIFIGAN is decompressed and placed in the pretrain directory. Its directory structure is as follows.
Download audio-slicer audio file slice tool
1 2 3 |
git clone https://github.com/openvpi/audio-slicer cd audio-slicer pip install -r requirements.txt |
Before the voice is processed by so-vits-svc, it needs to be segmented and quiet segments removed. Usage is as follows.
1 |
python slicer2.py VOICE_FILE --out wav |
This command will slice the audio file and insert the segments into the directory wav/ in the so-vits-svc/dataset_raw folder.
Then we move the wav folder to the so-vits-svc/dataset_raw folder below for subsequent processing.
Pre-training preparation
Before training, we need to perform some pre-processing tasks. Follow the below commands to do the pre-processing.
1 2 3 4 |
python resample.py python preprocess_flist_config.py --speech_encoder vec768l12 python preprocess_hubert_f0.py --f0_predictor crepe --use_diff python cluster/train_cluster.py |
In the third step, there may be an error, like the pre-trained model and CUDA version don’t match. We can refer to the error messages to update the PyTorch version. The command below is updated to version 11.8. If there are no errors, then no other action is needed.
1 |
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
Start training
Lastly, execute the following command to start training
1 |
python train.py -c configs/config.json -m 44k |
If you want to stop training in the middle, just press the Ctrl+C. If you want to reset the entire process, you can delete all the files in the logs/44k folder and re-execute it. Online videos say that training for half an day can produce good results, but of course, this depends on your graphics card. With my RTX 4090, I had to train up to 100,000 steps before I felt good, which probably took about 24 hours.
Since the GPU usage during the training process is not always full, there is still room for optimization. The training parameters are stored in configs/config.json
- epochs: Total number of training steps, reached after which there will be no further training. If you think it’s not good enough, you can increase this value.
- eval_interval: How many steps to save the model after each iteration.
Start the Web Server
Execute the following command to start the service at port 7860.
1 |
python app.py |
Inference
After opening the browser, you can see the main screen
Model selection latest one, and the configuration file should only have one, and the clustering model should also only have one.
Tap down to see the model loading.
After loading the model, you can use file uploads for cloning or directly convert the text TTS. This time, we need to select the “automatic f0 prediction.” If it is not for singing, you need to select it, otherwise, it will appear as electronic music.
Conclusion
Like other AI models, voice cloning is still in development, and the UI part is still very early. We believe that in the future, after stability, there will be continuous progress.