Fine Tuning XTTS v2 with forked Coqui | Coqui AI is dead; Long live Coqui!

NanoNomad

มุมมอง 7 804

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 9 ม.ค. 2025
Fine tuning XTTS v2 with the forked Coqui project. Coqui AI shut down earlier this year, so what does that mean for us?
Here I go over adjusting the Coqui XTTS v2 training recipe, creating a dataset using Audacity and faster-whisper, and training a single speaker and multispeaker XTTS v2 english model
There are code snippets linked here along with most of the commands to copy-paste:
nanonomad.com/2...

ความคิดเห็น • 34

@m4rc1_n4ch0s 4 หลายเดือนก่อน ⁺²
Thanks for the tips and maintaining such an interesting channel. I managed to do the fine tuning, I used a very small dataset just for this first test, something around 5 minutes of speech. I couldn't feel such a big difference so... I belive it has improved a little bit in some parts. I'll try again with an bigger dataset.
Edit: When I put the "speaker_wav" as the exact same audio that I used for the fine tuning, the TTS gets much better, even with only 5 minutes of audio.
@anmolkapoormail 2 วันที่ผ่านมา
thanks for the video, hope your membership increases.
@petertaylor4954 4 หลายเดือนก่อน
At about the 8:27 mark I had the epiphany that perhaps this video was created using a TTS clone of your own voice? "One three epochs" rather than "Thirteen epochs" or maybe "One to three epochs" ... #amirite? Great video, thanks for sharing your process and insights.
@jimmyjam77 หลายเดือนก่อน
Hi there, thanks for the great video. Have you found any other open source projects that produce more realistic cloned voices than XTTS V2?
@johnwellington269 24 วันที่ผ่านมา
hi thankks! but it has 400 token and 250 character limts, how to fix this
@ViviqueAITales 4 หลายเดือนก่อน
hi Nanonomad thanks such great video as i have fine tune the model. i got good result as u mention u have kept last to column identical as in my dataset i given speaker name in last column should i keep the last 2 column identical does it affect the result ?
@Nono29121 3 หลายเดือนก่อน
Hey, Im trying to include this in a school project of mine. Would you be able to post the full code to get this to work?
@Musabkurdish4 5 หลายเดือนก่อน
Can you make a long vedio to explain stip by stip at the data set to the end and make like sample Web for your own tts
@alanturing5737 5 หลายเดือนก่อน
just commenting to help your channel grow! I do also have a question, does xttsv2 support emotion detection? im using it but i feel like my output always sounds monotone/emotionless, is there a fix for that?
@TheBestgoku 4 หลายเดือนก่อน
is there any place where we can download some pre-trained xtts v2 models? i cant find any
@agenticmark 4 หลายเดือนก่อน
licensing is pretty restrictive for public links. there are some discord servers that pass them around but I am not on any of those anymore.
@MrMoonCraft 6 หลายเดือนก่อน
Hey there, I had a question I was hoping you could answer. It's not clear to me what the difference is between fine-tuning vs adding a speaker audio file passed to the speaker_wav argument when doing the simpler examples provided by coqui.
I had some meh results just passing a speaker_wav in the simple example, even after gather about an hour of high-quality audio. I came across your video and skimmed through it just to validate that you were going over information that was pertinent to me. I am going to watch your video in more detail soon, but I was really hoping you could illuminate the difference I am asking about. I am about to go through the effort of fine-tuning my own xtts model, but before I even go there I wanted to know if maybe I am still missing something with the normal method that could give me improved performance. Any thoughts?
@nanonomad 6 หลายเดือนก่อน ⁺¹
With XTTS you'll still need to provide speaker samples or computed latents to a fine tuned model. Some other types of models support speaker embedding, but I dont think XTTS can do that.
Finetuning aligns the model weighs more closely to your dataset samples (in theory) than they would be otherwise. The goal of training is (usually) a model that can generalize well. Ideally you could provide any audio samples and have it clone a voice with them. But, in reality, the model doesn't have (or perhaps cant have) an approximate representation of every voice, so it fails to perform sometimes. Maybe the input audio sample has recording noise, or the speaker is speaking very differently than what the model was trained on initially.
Or you want to try coaxing out new capabilities. Like taking a trained model and re-training it (finetuning) on a different dataset. A TTS model that can do French may be able to be finetuned on say, German, and it may reduce the time needed to train.
With the XTTS finetuning, you'd just be trying to align the weights to be closer sounding to your dataset samples.
With XTTS, there is a way to use multiple audio files or computed latents to generate audio samples. Youd have to look it up in the docs though, because I dont have any ready code snippets for that. Using multiple samples would average the voice qualities, so you might get better results with that. Removing background noise and any other sounds and trimming extra leading and trailing silence will improve the generated audio. Also, just trying different samples from the same speaker can help. One other thing to try would be normalizing the audio levels to -16dB or so.
@MrMoonCraft 5 หลายเดือนก่อน
@@nanonomad Hey man, I finally got to the stage where I can start fine-tuning the model. I've been working on getting the text transcribed and stuff. When I start the training I'm hitting an issue with max recursion. The terminal output flies off the screen, but before it does it says something about text exceeding 250 characters. Is this related to the length of the audio clips in the training data?
@MrMoonCraft 5 หลายเดือนก่อน
Update: People online are saying it has to do with the length of strings for the test sentences. The thing is that I tried doing shorter sentences to fix that, and that didn't work. If it makes a difference, I'm using an m1 mac. I'll be trying on WSL soon, but I'll need to see about trying in linux if that doesn't work.
edit, error in question:
The text length exceeds the character limit of 250 for language 'en', this might cause truncated audio.
@nanonomad 5 หลายเดือนก่อน ⁺¹
I cant recall exactly where you can find it, but i think theres a tokenizer.py in the coqui package somewhere. You can find the max length for english and raise it above 250 to say 512 without issues Itll clip your input otherwise, and you'll get worse finetuning outcomes
@MrMoonCraft 5 หลายเดือนก่อน
@@nanonomad I think there might be some bug with dependencies or something. I have tried doing some fine tuning on WSL and it's still complaining about the text lenght. I left only one test sentence that's about 19 characters long.
I'm not sure what the exact problem is, but the fine tuning on WSL at least seems to be running without erroring out. I'm still unsure about how I'm going to test the results. From what I understand in the video, tensorboard is a required dependency to be able to check output?
@kamaleon204 6 หลายเดือนก่อน ⁺²
I've followed you for a while. I'm happy you are still doing things like this.
Is what you shown in this video able to perform real time TTS? I was working on making a home server AI chatbot with Whisper + LLM + Coqui that I can speak to but I'm not really a programmer and have had probs getting the Wisper + LLM section to work properly. If this is another option for Coqui that I was using on my own devices with somewhat real time results, I might try to do this as another option for the TTS part.
Thanks for your content, it helps give me ideas on things I need to look at or try myself. Thank you!
@nanonomad 6 หลายเดือนก่อน
Take a look at AllTalk, it might do a lot of what you need if you use it as a plugin for the text generation webui: github.com/erew123/alltalk_tts
Theres Whisper integration though another plugin IIRC, but I havent tried it
XTTSv2 is faster than realtime, but idk how much faster. For some reason the logger isnt giving me correct values and im too tired to figure out why. Its twice as fast as realtime, at least, on an rtx3060.
@kamaleon204 6 หลายเดือนก่อน ⁺¹
@@nanonomad OMG thank you thank you for this! I"ve been busy with other projects lately and just cleared a few out of my backlog and was looking for something to keep learning on and I saw your video today. I think I found my next project to complete now! BTW I have a rtx3060 1st GEN 12GB on the computer i would be using this on. nothing like an A100, B100 or H100 but its still something to learn on.
@nanonomad 6 หลายเดือนก่อน
That's all I have too, and i cant afford colab or anything anymore. I think theres some room for optimization with training xtts though. Should be able to apply PEFT to it somehow and do LoRAs
I was playing around with all talk a bit last night, and theres a bit of a delay with the gens, but it excels at long speech. I haven't figured out how to easily swap models though. I think if I fiddle around with how and where things are loaded it'll be a bit faster. Not sure which is a better tradeoff, offloading the llm to cpu or running xtts on cpu.
If you want to deal with the frustrating ecosystem, Kaggle will give you 30 gpu hours for free every week. Just look in the settings menu for the tiny tiny text link to verify your phone number, or else your Kaggle notebook cant get internet access. It's way more stable than Colab, with longer sessions.
@kamaleon204 6 หลายเดือนก่อน
@@nanonomad i still have a 3080 12 or 16GB in a box ready for a computer build i was intending to get to during covid that I was planning to just run UE and/or an AI workflow. I prob will do both since I'm not planning to do both at the same time. Over the weekend I plan to re-watch your previous videos, play/experiment with AllTalk like you suggested, and a few other tutorials from others along with playing a game I was suggested called "The Farmer was Replaced" or something like that.
I'll be honest. This resurgence to go down this route again for myself was b/c I saw this video this past week of yours. I can't say thank you enough for helping me get back into this.
@kamaleon204 5 หลายเดือนก่อน
@@nanonomad I finally got around to installing AllTalk. I think this is what I've been looking for. I've been using v1 coqui on a docker instance but this is running on a conda env and I'm so happy with it so far (only been playing around for about 1hr) Now I have to figure out how to use it as a replacement that I had been using coqui v1 as.
Might have to rewatch some of your xtts training/refining videos now since I have ALOT to learn now.
@DeathMasterofhell15 5 หลายเดือนก่อน
its not working and i keep getting erorr , able to support me
@DurgaNagababuMolleti 6 หลายเดือนก่อน
Thank you so much for this vedio .
i have doubt , is it possible to train xtts on new language ?in my case i want to train on hindi . please provide a solution if it is possible
@Ohio7speedrun 5 หลายเดือนก่อน
Got a sub outta me, thanks for the info!
@cleverestidiot4636 5 หลายเดือนก่อน
Hey bro can i have your email , i have some questions for you about TTS2
@Arthur-dl1jj 4 หลายเดือนก่อน
nice content, thanks
@TerryThera-g5s 4 หลายเดือนก่อน
Hall Patricia Rodriguez Sarah Martinez Jose
@HakanPıtır 4 หลายเดือนก่อน
Anderson Laura Rodriguez Mark Thompson Jeffrey
@MartinKirk-y5v 3 หลายเดือนก่อน
Rodriguez Joseph Perez Barbara Anderson Karen
@포화-f8e 4 หลายเดือนก่อน
Jones Barbara Jackson Eric Young Elizabeth

ต่อไป

เล่นอัตโนมัติ

Using AI To Detect AI Music (and other music industry data-porn)