AI Voice Cloning for Singing using so-vits-svc-4.0 with Google Collab/Nvidia Card

Jarods Journey

มุมมอง 32 232

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 16 ต.ค. 2024

ความคิดเห็น • 121

@MrDanINSANE ปีที่แล้ว ⁺⁵⁹
Here are some of my tips, (especially for TALKING models, for singing it's easier)
I found many people talks about the amount of STEPS but the correct way to get better results is:
1 - The more SAMPLES the better (lots of wav files) in time length, the more accurate for your model to be trained
Means the TOTAL LENGTH is longer, the better since while training it will have more "different colors" of emotions and pronunciations to capture to create the "ULTIMATE" model... of course, it will take much more time to train.
2 - The MORE epochs you trained = the more Accurate + High Quality results
3 - Obviously the based on more epochs, the longer you'll train the better since you give the model a chance to LEARN based on your dataset how to align closer to the source.
4 - At the moment (unless a new way is out) use ONLY "Crepe" do not use anything else
5 - Don't waste your time on calculating / formulas, the default 10,000 epochs and 16 epochs is fine. if you want just reduce the 10,000 epochs to something lower so it won't take DAYS to complete.
6 - If your dataset is VERY CLEAN (good quality recorded) after around 2,500 - 3,500 epochs trained, (or higher) you will start hearing pretty good results. I highly recommend to keep training to 5,000 epochs (or MORE!) if possible, but from time to time check your current model and test if you find it satisfying instead of just keep training forever.
7 - if your dataset isn't SUPER CLEAN but good enough, the numbers I mentioned above goes HIGHER, the more epochs the more accurate and high quality to your source.
8 - if your dataset is LOW QUALITY... don't waste your time, your results will be more like the source quality, if it's low quality, your model will sound exactly like it eventually. for example: Phone or Radio quality will sound like Phone or Radio... it won't improve the quality, it's not a part of the training model.
- Reverb or Echo will be BAD for training, DO NOT use such source... it's a waste of your time
EXAMPLE: if you'll train somebody with VOCODER EFFECT as the source dataset, the result will always sound with Vocoder.
9 - It's already said in the Colab, but just in case: the reason you want 10 second or less per sample is because of the amount of RAM it can handle while re-sampling to Crepe (heaviest of the other models) so I recommend to split to 4 sec - 9 sec or less.
DON'T worry if you'll have many MANY samples at your directory, if they're length is less than 10 seconds there is no reason for any problem.
10 - If you're training for SINGING, make sure your source samples dataset is from SINGING.
11 - if you're training for TALKING, use talking samples as your dataset. (it's sometimes also good enough for singing! but only if your variety of dataset is very dynamic in range: talking loud, sad, happy, questions, and emotional talking in general, record yourself talking in different ways and not like a monotonic robot).
12 - You want a SUPER MODEL ? (Talk + Sing)
Make 2 directories on your dataset:
Sing = put inside your dataset of singing
Talk = put inside your dataset of talking
You don't have to split it like that, I found it EASIER to organize your dataset in case you want to get back to it and see how you made stuff for comparison in future models you'll train.
13 - DO NOT mix different people / voices on the same training...
you will get a mutant result or something weird (unless that's what you want)
14 - Remember to get REALLY GOOD RESULTS: you'll need a lot of high quality samples.
That means Quantity + Quality (not just one of them based on my experiments at least)
15 - Testing your model:
- Try it on Singing
- Try it on Talking
- Try it on DIFFERETN LANGUAGES! (that's magic I can't even describe)
⚠Warning!
One last thing I'm still exploring and not sure about:
I'm not sure yet if there is such thing in so-vits-svc such as: "OVER TRAINED" to ruin models.
When I tried locally, one of my models just became from amazing to... "BLEEPS" it never happened to me while training on Colab yet so I'm not sure if the reason was over-trained or just because it was in a middle of update versions.
My advice is to keep TESTING your model each 500 1,000 epochs, so you won't lose your progress in case it will be OVER trained (if such thing happens of course).
---
Sorry about my bad English, I hope it helps, have fun!
much love 💙
@Jarods_Journey ปีที่แล้ว ⁺⁴
Appreciate all of these additional details! All super valuable information here as there aren't many tips out there yet other than from those of us who have been trying it out 🤟🤟!
@MrDanINSANE ปีที่แล้ว ⁺³
@@Jarods_Journey I'm glad you find it helpful. I'm still experimenting and learning but so far the above tips is what leads me to great results.
Feel free to share these tips on your next video and keep up the good work! 💙
@darkfield1952 ปีที่แล้ว
"13 - DO NOT mix different people / voices on the same training..."
So I should train a separate model for each voice?
@MrDanINSANE ปีที่แล้ว ⁺²
@@darkfield1952 Yes, each train = 1 model... unless you want some mutant combination, give it a try ;)
@4stacks67long ปีที่แล้ว
How Much Recordings?
@ياسمينعبدالله-ف9ق ปีที่แล้ว ⁺²²
This the only video on TH-cam that actually explained how to do it
Other TH-camrs don’t want us to learn
@Jarods_Journey ปีที่แล้ว ⁺⁴
Appreciate it! 🤟
@wimdenherder ปีที่แล้ว ⁺²
I think that once people understand something, they are so happy that they forget to share it with others. Real teachers like to share their journey, it's a rare talent indeed.
@tuapuikia ปีที่แล้ว ⁺⁶
Tips to get a good audio sample:
1. Record a short audio clip without background noise. Use a high-end microphone or clean audio using RTX audio.
2. Use the audio clip to generate a 30-minute good speaking sample using TTS with your voice sample.
3. Train the model using the TTS voice sample (clean, perfect sample) (less than 2000 epoch with crepe).
4. The output of the song has the best quality based on my multiple attempts to get a clean audio source with studio recording quality.
@ThisIsntmyrealnameGoogle ปีที่แล้ว ⁺¹
Love these tutorials! Never stop em coming so long as you arent burned out! Hey there Delilah was such a great throw back haha
@Jarods_Journey ปีที่แล้ว
Haha appreciate it! It's all stuff I enjoy atm so I'm glad that I'm able to share these things lol
@jaylicator ปีที่แล้ว
Thank you for creating this video. It's very easy to understand for noobs and it's very helpful. My brain has difficulty remembering information, so I've watched this video multiple times, but each time I've watched, it's been able to follow along easily.
@Jarods_Journey ปีที่แล้ว
Glad I could be of help 🙏
@kuerst ปีที่แล้ว
Hey Jarod, thank you for your very clear and straightforward tutorial! I was having trouble yesterday and made a comment, but things started working, so I deleted my comment. And then I was having trouble again today.. and I made another comment.. but it just started working again.
So yeah. Sorry if you've been getting notifications and then been thinking wtf.. where'd the comment go?
Your tutorial is great and I'm looking forward to hearing my results once training is completed and I've done the inference. Keep doin' good things bro. You're the man!
@Jarods_Journey ปีที่แล้ว
Haha appreciate it! I never got the notifications so all good lol. That does happen sometimes and it looks like someone deletes their comment xD.
@stevengn7245 ปีที่แล้ว
Thanks your video was really clear and helped me with my project on So Vits
@badrinarayanans355 26 วันที่ผ่านมา
Thanks a lot man, great demonstration !
@spoonie1972 ปีที่แล้ว ⁺¹
If this was done manually in daw land, it sounds a bit like singing your vocal (likely poorly), but picking out all your pitches in Little Alter Boy plugin. Thanks for the great vid. Pretty wild tech and it's just in its' infancy.
@Jarods_Journey ปีที่แล้ว ⁺²
Definetely! The tech is pretty wild with it being so new. Excited to see where it goes (and a little spooked)
@amixam ปีที่แล้ว ⁺⁴
that song at the end caught me off guard lol
@Jarods_Journey ปีที่แล้ว
Lool which end, end at the beginning of the video or hey there Delilah 😅?
@amixam ปีที่แล้ว
@@Jarods_Journey hey there delilah lel
@alonsonadeau ปีที่แล้ว ⁺²
Hello good morning, you are the best tutorial I have ever seen, you explained the batches very well! I had a question for you. In case google collab tells me that my time to use your gpu has expired, is there an option to resume the training already created?
@Jarods_Journey ปีที่แล้ว ⁺²
Appreciate it! You can resume training, it will just restart at your last checkpoint which is the counted for at each eval interval (I believe).
@alonsonadeau ปีที่แล้ว ⁺¹
@@Jarods_Journey I understand. Thank you very much for your answer. It would be great if you could make a tutorial on how to restore from a checkpoint, because looking for information on the internet there is nothing that refers to that.
@Jarods_Journey ปีที่แล้ว ⁺²
@@alonsonadeau Its deceptively simple actually! All you do is just rerun everything again and it starts from where you had previously left off. If I do a more in depth tutorials of sovits, it will be covered there
@wadewoods2793 ปีที่แล้ว
I'm a little confused, you went from collab to a different program but was the collab steps done for the training? is it the same steps for both collab and other program to get the song?
@Jarods_Journey ปีที่แล้ว ⁺¹
You might wanna check out a newer tutorial on this software: th-cam.com/video/xgvT7UnUTng/w-d-xo.html
I narrow it down to just Collab here
@PatandKring ปีที่แล้ว ⁺¹
comprehensive turotial please! 🎉
@mentalo4038 ปีที่แล้ว
I have the tensorboard which appears each time and the train does not start, yet I followed your instructions exactly?
@nepaliitlessons4136 ปีที่แล้ว
My Train step gets done within a few seconds and never shows Epoch 1/10000 .. etc. What could be wrong? I have tried doing `!svc pre-hubert -fm {F0_METHOD} -n 2` in the previous step and still no luck
@kingover-all9966 7 หลายเดือนก่อน
Please how much wud you charge ne for you to train a specific type of voice id give you?
@candyman3537 ปีที่แล้ว ⁺¹
Did you subscribe to Google? My free codelab is only able to run a few hours (maybe 4-5 hours) then stops.
@Jarods_Journey ปีที่แล้ว ⁺¹
Not for this video, but ive read it should run for 12 hours.
@elidelia2653 ปีที่แล้ว
interesting. So if you are to call this out in a chatbot script, you could avoid the use of another API and reduce lag time?
@Jarods_Journey ปีที่แล้ว ⁺¹
That is the idea, in this case, it would be limited by your own hardware speed for the production of it. As for chat bots, to my dismay, so-vits-svc cannot be used for TTS as it's's limited to only converting audio from previous audio. However, VITS does so I'm looking in that direction as well
@aichau3593 ปีที่แล้ว
Hello Jarods. I want to ask about training data. I need to "speak" or "sing" in training data? Which one is better?
@Jarods_Journey ปีที่แล้ว
I can't comment as I haven't done one on singing data, but I've heard people say having singing data helps more
@crestlefloyd4271 ปีที่แล้ว ⁺¹
Cool tuto, I use google collab but you have skipped the last step on collab so I didn’t get the last step to have the final trained file. Any tips would be great :)
@Jarods_Journey ปีที่แล้ว ⁺²
I'm assuming you mean the inference step where you actually use the trained voice model. I assume that you download it and place it in a folder to run locally, but I will go over that in a future tutorial
@sergiofernandez-t7k ปีที่แล้ว
I was following the Colab tutorial perfectly, but I don't understand the ending. I've already clicked on 'train' and it completed successfully, so what do I do now? From what I understand, you start using the actual machine from this point, but do we need to do that if we don't have 10GB of VRAM? How do the rest of the steps continue in the Colab version? I'm confused.
@Jarods_Journey ปีที่แล้ว
That video was made before the latest ones, you might wanna check out the colab video I did on SVC here: th-cam.com/video/xgvT7UnUTng/w-d-xo.html
Since you already trained, if you skip to 30:00, this is where I start doing the audio inference.
@PatandKring ปีที่แล้ว
hey jarolds, would this work in a mac pc using arm based processor?
@Jarods_Journey ปีที่แล้ว ⁺¹
I'm not sure, you'd have to check if arm can run torch CPU
@MyTubeAIFlaskApp ปีที่แล้ว ⁺¹
I love the power of AI and am a Python freak. I definitely will be giving this a try.
@Jarods_Journey ปีที่แล้ว ⁺¹
I'm quite glad python is one of the languages of choice for ML and AI 🤟. C++ is cool too but...
@MyTubeAIFlaskApp ปีที่แล้ว ⁺²
@@Jarods_Journey I am seventy-five years old. Python is an easy learn, but C++ is too rough on the ol'brain.
@Jarods_Journey ปีที่แล้ว
Commendable! Python is so much easier to read, I am totally in line with you there. Keep it going :D!
@007-e2q ปีที่แล้ว
thank you so much for this, btw i chose 10,000 epoch and i think it will take quite a while like 30 hrs or something. What i want to know is how long can collab keep working? is there a cool off period? or will my progress be lost if something happened in between ?
@Jarods_Journey ปีที่แล้ว ⁺¹
The free Collab can only run for a maximum of 12 hours at a time. As long as you left your log intervals to the default value, it saves these as checkpoints. You'll just have to restart all the cells after your runtime ends and it'll continue from the checkpoint
@007-e2q ปีที่แล้ว
thanks man
@candyman3537 ปีที่แล้ว
My training work only last 20s without any error message. Do you know what is the possible reason? I noticed you encountered the same issue in one of your video. I has no problem previously when I run in free code lab. But got this issue when I paid.
@Jarods_Journey ปีที่แล้ว
Make sure the pre-hubert stage finishes all the way through, if not, it'll stall on training
@stevengn7245 ปีที่แล้ว
Do I need to run the "#!rm -r "dataset_raw" line of code?
@trustcsgo2710 ปีที่แล้ว
From the automatic preprocessing phase i get a TF-TRT Warning: Could not find TensorRT message in the log. Why is that?
@vstman-pd5ml ปีที่แล้ว
TensorRT error here to...
@trustcsgo2710 ปีที่แล้ว
@@vstman-pd5ml I just closed the error message and it worked after
@AITESTSANDPROJECTS-bm5ji ปีที่แล้ว
Can we train without a GPU and if GPu is way faster what is the card you use ? Thanks
@Jarods_Journey ปีที่แล้ว ⁺¹
As of right, only GPU training is supported as its MUCH faster. I use a 4090, but I just tested a 3060 and it's relatively good too. If you don't have a GPU, I recommend google collab
@AITESTSANDPROJECTS-bm5ji ปีที่แล้ว
Thanks a lot I'll go for it ! @@Jarods_Journey
@Qcof ปีที่แล้ว
Does it only support the English language?
@stevemata8350 ปีที่แล้ว
Sweet thanks
@不想看大海 ปีที่แล้ว
does this support nvidia a4500?
@CalvinSaxena ปีที่แล้ว
Hey man please help me, I'm trying to install Anaconda on my PC but it says FAILED TO INITIALIZE CONDA DIRECTORIES during Installation process. I cant install it. I'm a Music Producer and i don't know anything about codes but still trying to make a model of a INDIAN SINGER who passed away recently. Please help me. I have my DATASET ready
@Jarods_Journey ปีที่แล้ว
Try this: github.com/ContinuumIO/anaconda-issues/issues/6589
If not I'm not too sure as I don't use conda. It not necessarily needed to get this working either
@CalvinSaxena ปีที่แล้ว
@@Jarods_Journey If i want to train the model locally on my PC then what should i download ? Anaconda is not working so is their any alternative
@Jarods_Journey ปีที่แล้ว ⁺¹
@@CalvinSaxena Download python which is the core of conda. th-cam.com/video/Xk-u7tTqwwY/w-d-xo.html
@snakezo4218 ปีที่แล้ว
simple size i don't understand ?
@andriodenavarrete4495 ปีที่แล้ว
I was running the program I was on the epoch 132, and then got disconnected, does anyone knows how to resume where it left?
@Jarods_Journey ปีที่แล้ว ⁺¹
I believe sovitssvc allows you to recontinue at a later period, so you have to run all the cells again like you did to set it up
@andriodenavarrete4495 ปีที่แล้ว
@@Jarods_Journey thank you and yeah I did it and it worked, it creates the G_0.pth again and then it continues where it should
@andrejsshewchenko7190 ปีที่แล้ว ⁺¹
good jobb:)
@hkDesigner ปีที่แล้ว
when click training,tensorboard 403 error
@Jarods_Journey ปีที่แล้ว
Not sure what's happening here, but 403 appears when you don't have access to the website, in this case, the tensorboard. Most likely some type of wifi issue.
@bhuddhaswisdom ปีที่แล้ว
@@Jarods_Journey cp: cannot stat '/content/drive/MyDrive/so-vits-svc-fork/dataset/kiritan/': No such file or directory what is this bro
??
@towakona ปีที่แล้ว ⁺¹
Can you make a part 2 please?
@Jarods_Journey ปีที่แล้ว ⁺²
In the pipeline 🤟
@KshitizMagar-y3y ปีที่แล้ว
hello when wiafu code will you provide
@iamioo ปีที่แล้ว ⁺¹
how do i get the config?
@Jarods_Journey ปีที่แล้ว
Config is made when you run pre-config
@vstman-pd5ml ปีที่แล้ว
not appear here@@Jarods_Journey
@TheExtremeOne ปีที่แล้ว
my pre resample never gets past 35 percent :/
@Jarods_Journey ปีที่แล้ว
Audio files could be too long. If they're all under 10 seconds, you could have too many samples possibly, though this usually isn't an issue, idk if it is on colab
@utterbollocks3107 ปีที่แล้ว
**Colab is broken**
Building dependencies step is riddled with error warnings. No configs folder nor .json file is created; even manually creating one and placing .json file doesnt work because then Training step fails. Tensorflow fails, "could not find TensorRT".
Havent used this for months but last time it was so easy and now it doesnt work at all. Need this tool! If anyone can fix the colab youre #HEROIC
@KiNiNom ปีที่แล้ว
can i do on macbook
@Jarods_Journey ปีที่แล้ว ⁺¹
Collab you can do on the MacBook since it's on Google servers. I'm not sure about locally though
@SuperDao ปีที่แล้ว ⁺²
Can we use this to transform our voice in realtime like for discord?
@Jarods_Journey ปีที่แล้ว ⁺¹
That is 100% possible as there is an option to do that in SVC, but I have no knowledge on how that works as I haven't tried 😅
@kureizekkukarasu4336 ปีที่แล้ว
i just want to make a ai voice parody, why i had to be a Programmer.
@Jarods_Journey ปีที่แล้ว
A little bit of programming just gets you access to these cool things first, it's what motivates me to keep going. Easier solutions are out there and will start to become more available as technology grows though so just gotta be patient
@3iu0wmxxpxyn8 ปีที่แล้ว
Nice voice you have but scary technology this all in video will be for some year in like 10 seconds we are not so far who knows ahah
@Jarods_Journey ปีที่แล้ว ⁺¹
Future is moving quick definitely xd, just gotta try and keep up is all we can do
@yashironene9380 ปีที่แล้ว ⁺¹
this tutorial is confussing
@Jarods_Journey ปีที่แล้ว
Let me know what parts are confusing to you, I might be able to clarify any of the parts you need help on
@yashironene9380 ปีที่แล้ว
I figured it out on my own, but eg from 15:26 the tutorial gets confusing!!
@Jarods_Journey ปีที่แล้ว ⁺¹
@@yashironene9380 Well glad you were able to figure it out! 🤟 Might have to clarify some things, but this is an older tutorial so that's maybe why!
@KurtStaInes ปีที่แล้ว
Lol they still copyrighted this even it is in fair use
@Jarods_Journey ปีที่แล้ว ⁺¹
😂 forreals, I didn't know you guys could see the copyright part of it, but yeah the hey there Delilah part. If I didn't include the instrumental it would've been fine lol
@KurtStaInes ปีที่แล้ว
I guess it is fine with the algorithm can also promote your video but… changing the Chord progression and Timing of the instrumental ,It won’t be copyrighted. Also this is educational video so it’s under Fair Use of TH-cam guidelines
@Jarods_Journey ปีที่แล้ว
@@KurtStaInes luckily though it's a harmless copyright as there are any punishments that came with this one, definetely will keep an eye out for the future though lol even though it should be educational
@zap0p3rr0tr3inta ปีที่แล้ว
18:32
@rudritarahman9719 3 หลายเดือนก่อน
#@title Pre-configure the setup
!svc pre-config
Output: 0% 0/5 [00:00
@Mike-xe8ql ปีที่แล้ว
Please go ahead stop saying "go ahead". It will make your speech sounds. Cheers.
@Ppppppppppppppp__ 11 วันที่ผ่านมา
I invite you to Islam and perform pray
@lalithperera4865 5 หลายเดือนก่อน
Too much west the time, no body can understand perfect ly, sorry
@mediumgentium ปีที่แล้ว
Who asked...
@wildworld534 ปีที่แล้ว
Hi, i wanted to train voice and i got this warrning message
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I tried it different ways but i cannot train. maybe somebody know what should be problem ?

ต่อไป

เล่นอัตโนมัติ

Complete Guide: AI Voice Training with So-Vits-SVC - Part 2: Local GPU