Training Any Language in AI Voice Cloning - Tortoise TTS
ฝัง
- เผยแพร่เมื่อ 16 ก.ย. 2024
- Links referenced in the video:
Previous Video - • Training Tortoise TTS ...
Watch Before This Video (Tortoise TTS Installation) - • Local AI Voice Cloning...
Tokenizers HuggingFace - huggingface.co...
Hardware for my PC:
Graphics Card - amzn.to/3pcREux
CPU - amzn.to/43O66Ir
Cooler - amzn.to/3p98TwX
RAM - amzn.to/3NBAsIq
SSD Storage - amzn.to/42NgMFR
Power Supply (PSU) - amzn.to/430bIhy
PC Case - amzn.to/447499T
Mother Board - amzn.to/3CziMXI
Alternative prebuilds to my PC:
Corsair Vengeance i7400 - amzn.to/3p64r22
MSI MPG Velox - amzn.to/42MnJHl
Cheapest and PC recommended:
Cyberpower 3060 - amzn.to/3XjtZoP
Come join The Learning Journey!
Discord - / discord
Github - github.com/Jar...
TikTok - / jarodsjourney
If you found anything helpful, please consider supporting me and the content I am trying to produce!
www.buymeacoff...
Your amazing dude, I been following you almost a year and I have learned a lot from your channel, Keep it up :)
will tortose be able to work with cyrilic characters if i make a tokenizer with cyrilic characters?
Are you going to share your japanese models at some point?
I am working on a script that uses LLMs to generate sentences that I turn into infinite comprehensible input by scraping google images for the words and using ffmpeg to turn the audio and images into a video where for every sentence it displays an image representing the words in that sentence.
I generally don't share my models so that'll be the same in this case. As for Infinite comprehensible input, that is a good one! I'd love to see a demo of that when you complete it.
Hi Jarod! I am glad to see your new video! Thank you!
In fact, the most interesting thing I wanted to know is how you prepared the dataset for training. I asked about this under the previous video 😅. Well, I hope you will tell us about this soon)))
Regarding those clumps of red and green that you're talking about at 19:36, I've also come across this. This effect appears only when the training is resumed. I noticed that if I saved the results at epoch 10, and training was interrupted at epoch 11, then when resuming training from epoch 10, the points from the previous training up to epoch 11 are saved on the graph, and while my training continues from epoch 10 to epoch 11, the points for each iteration are duplicated by the current and previous training, therefore they appear these are clumps of red and green colors. (I don't think this affects anything other than the visual perception of the graph, which is becoming difficult to read for the current training period)
Thanks! Might make a followup video on how I prepared my dataset here. I still generally follow the same way as I've discussed in other videos too though so if you've seen those, it's not far off.
@@Jarods_Journey Yes, I certainly adhere to the approaches that you used earlier, but I'm still not sure about some of the points that I'm doing and how much this may affect the final result
Is that 800h dataset only one speaker? If im gonna collect that much data it would take me 100 years to transcribe it manually lol I have no way to use transcriber to do it automatically....
Oooh, this is great!! :D I want to try training a Spanish language voice! I'll watch this video asap! (I'm working now XD) Thank you very much for sharing it! :D
I want to hear the voice training of Charlie you have there ahahhaha
Hey Jarod! Been watching all your videos and I think I might have a unique challenge. I’d like to remove a tremor in someone’s voice. Since it’s possible to voice clone in other languages, this doesn’t seem impossible. I’m wondering how you would approach?
As always great Video Thanks!
Appreciate it :)!
I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??
Just need that 4070 Super Ti , then i am going in..
hi, after training my model, I try to load its .pth file onto okada's AI voice changer but it says that the pth file is missing a "config" parameter or something. how do i fix that!?#@?!#@
Yo Jarods thanks for the guide! Could you please make another guide using tokenizer for English voices ?
Oh, so if it’s a Latin alphabet language, for example, Swedish, Spanish or German, could I just use the whisper transcribed Swedish text to train the model or how will that come out?
I would say yes, as the text cleaners will normalize the accented characters. However, it might be needed for proper accenting and that is where a custom tokenizer might come in handy.
I would say you could try it but you would run into some problems, for example, I saw a nanonomad video where he trained a tortoise-tts model in Spanish, and although it worked, the accent was not a correct Spanish accent and the model had difficulties with some words. For example, in the example shown by jarods in this video with Spanish (I am Spanish), I would say that it has not been tokenized correctly, since there are certain letters that should not be separated (although I may be wrong, I do not know exactly how the tokenizer works).
@@Jarods_Journey what do you mean with accent characters and stuff like that and I also wonder will the model learn to speak the language that I’m training it
Jarod thank you for the great tutorial! Really appreciate your content is unique ✨️. I've got a question by the way concerning tokenizers: In many Turkic languages including Turkish there are letters such as "s" and "ş" both tokenize the same way into -> "s" and won't it make the model confused? Since those 2 are different letters, are written and spelled differently but tokenized into 1 letter I think there's a chance that the model will misspell them and could be confused because of the tokenizer. What do you think about it?🤔
Hi Jarod, nice channel you got. Can you train a TTS tokenizer that can sing out lyrics of any song? Have you got a video on that? Cheers
great video!!!👍
Amazing video! I have a few questions:
How much file size was approximately the 840 hours of audio you used?
Do you know where I could find a tortoise-tts model in Spanish to fine-tuning it with the voice I want to train?
Or maybe I could train my own model in Spanish and then fine-tuning it but doing it all inside the free version of google colab?
I used mp3 files at 22050 hz and it came out to around 19gb of data
Not too sure, you would have to probably train one up to Finetune ATM and as for Collab, the repository isn't setup for Collab. Some people have gotten it running though but I am not too sure.
13:45 haha, Charlie
There is so much documentation missing on tortoise.
For example, how to install new models. I have a model that is fully trained in tortoise, but i just cant install it, its a few files, but no documentation lists where you have to put them.
They are for french, but it sounds nothing like french when installed, plus the interface doesn't seem to recognize them properly.
And why do i have like hundreds of autoregressive models to pick from after training a new model myself, which one is the correct one to use???
Well, I have a doubt, for example, if I trained a model that spoke in English but wrote in Portuguese, would that work? I would love to know this because I wanted to dub games using tts.
Hi, How to fix it us during the training process, cmd always shows me a notification "ai-voice-cloning>pause"?
What does this mean in RVC client on collab please ?
ERROR: pip's dependency resolver does not currently take into account all the packages that are
installed. This behaviour is the source of the following dependency conflicts. chex 0.1.85 requires
numpy>=1.24.1, but you have numey¥ 1.23.5 which is incompatible. torchdata 0.7.0 requires
torch==2.1.0, but you have torch 2.0.1 which is incompatible. torchtext 0.16.0 requires torch= 2.1.0, but
you have torch 2.0.1 which is incompatible. torchvision 0.16.0+cu121 requires torch= 2.1.0, but you
have torch 2.0.1 which is incompatible
Could you help me, please? I installed TTS recently but my interface looks NOTHING like yours. Yours only has 5 tabs while my has 15 and the settings don't look alike at all. There's no Generate, History, Utility, Training. It has Generation (Bark), Bark Voice Clone, MusicGen + AudioGen, RVC Beta Demo, Demucs Demo, Seamless M4Tv2 Demo, Magnet, Vocos, Tortoise TTS, Outputs, Favorites, Collections, Voices. It's not RVC, but it's not simply Voice Generation either. What should I download?
Hey Jarod, can i follow your installation on mac? I know it says windows but wondering if that will also work on mac.
You think this could be better in pronunciation than XTTSv2? Interesting making a German model, I attempted on Tortoise a few months back but it wasn't great. So not sure if there been a big change since.
The pronunciation isn't horrendous for the Japanese one as far as I can tell, but there are a bunch of quirks with it. Seems to work well finetuning on some voices, but not others so I can't really tell if it's going to be overall better than XTTS. However, I can say that it does sound better so far than the finetune that I did for Japanese of XTTS, but that's just my initial impression
@@Jarods_Journey cool thanks
Great video!
Converting to latin is all you need, really?
Even if the language you want to train contains a lot special characters that are part of the International Phonetic Alphabet(like "ɖ, Ƒ, ɣ, ọ, ʋ") and is tonal? Leading to actual voiced labiodental approximant when "ʋ" from the Latin IPA is written?
I'm starting to find out that it may not be as easy as that, but I have a follow up video on some things I've been finding out
@@Jarods_Journey Alright. Thank you again. Your work is very valuable!
Bekommen wir alles für ein deutsches Modell zusammen? Ich kann das Training durchführen, habe aber nicht ganz verstanden wo ich den Tokenizer herbekomme. Und ich bekomme vielleicht 100 Stunden an Trainingsdaten zusammen, aber keine 800+. Wenn also wer zusammenarbeiten möchte...
Is it possible to train Bengali language with Tortoise TTS?
if i upload an model that is trained in portuguese te output of generation gona be in portuguese? will i be hable to use an RVC audio file in english with text in portuguese?
how can i voice large text?
So how low spec can you go to use this and rvc(not real time) could you do this on a laptop with a 4060 8GB vram?
The link didnt work in tortoise. Any solution?
Hi Jarod, thanks so much for this demo! I learn so much from your videos. Keep up the great work! I followed you tutorial here and managed to train a spanish model using a multi speaker dataset. The training job took about 12 hours to complete successfully. After the training job, I tried generating a voice from the finetuned model. However, due to the volume of my training data, the generation process failed with OOM error. The error indicted that it ran out of memory in the compute_latent process. I have about 25 hours of voice data in my training folder. I wonder if you have any suggestions on how to overcome this issue? I am using an A10 GPU with 24GB VRAM. Thanks in advance!
I don't show it in this video, it's in several others, but you usually move all your audio data and training files into a backup folder and inference from only 2 samples so you don't get OOM.
Another thing you can do is make a new voice folder and infer from there
Thanks for your guidance! @@Jarods_Journey !
Hi , Love the video , This exactly what I was looking for, could you please provide the training scripts for these as I also want to train TTS in my native language.
The training scripts are inside of the AI Voice Training repo, so make sure you understand that process
Thanks for the reply , so all I have to do is prepare the dataset as showing in the train.txt file and training folder and then run generate configuration on to get a valid training dataset.
How long did it take ?
Hi, i have one question:
If I have the word čau and the tokenizer says ['ca', 'u'], will it work?
My intuition says yes as the model will learn to associate ča with ca. Now if there are two versions, č and c in your language, a custom tokenizer with both characters in it might be valuable if you find that it's not doing it correctly.
@@Jarods_Journey Thanks for the reply, if I played cau and čau, it sounds the same. Now I'm going to try train again with my own tokenizer.
thank you . wbt google colab case ?
No one has adapted it well for collab yet, so not available unfortunately.
Is 10GB of vram enough for training?
It seems like the minimum size is 6 GB of VRAM. But even with 4 GB of VRAM, people manage to run)
Yes, you should be fine here
Just now
I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??