Training Any Language in AI Voice Cloning - Tortoise TTS

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 ก.ย. 2024
  • Links referenced in the video:
    Previous Video - • Training Tortoise TTS ...
    Watch Before This Video (Tortoise TTS Installation) - • Local AI Voice Cloning...
    Tokenizers HuggingFace - huggingface.co...
    Hardware for my PC:
    Graphics Card - amzn.to/3pcREux
    CPU - amzn.to/43O66Ir
    Cooler - amzn.to/3p98TwX
    RAM - amzn.to/3NBAsIq
    SSD Storage - amzn.to/42NgMFR
    Power Supply (PSU) - amzn.to/430bIhy
    PC Case - amzn.to/447499T
    Mother Board - amzn.to/3CziMXI
    Alternative prebuilds to my PC:
    Corsair Vengeance i7400 - amzn.to/3p64r22
    MSI MPG Velox - amzn.to/42MnJHl
    Cheapest and PC recommended:
    Cyberpower 3060 - amzn.to/3XjtZoP
    Come join The Learning Journey!
    Discord - / discord
    Github - github.com/Jar...
    TikTok - / jarodsjourney
    If you found anything helpful, please consider supporting me and the content I am trying to produce!
    www.buymeacoff...

ความคิดเห็น • 64

  • @bomar920
    @bomar920 7 หลายเดือนก่อน +1

    Your amazing dude, I been following you almost a year and I have learned a lot from your channel, Keep it up :)

  • @lunch69
    @lunch69 7 หลายเดือนก่อน +3

    will tortose be able to work with cyrilic characters if i make a tokenizer with cyrilic characters?

  • @Vantaz
    @Vantaz 7 หลายเดือนก่อน +3

    Are you going to share your japanese models at some point?
    I am working on a script that uses LLMs to generate sentences that I turn into infinite comprehensible input by scraping google images for the words and using ffmpeg to turn the audio and images into a video where for every sentence it displays an image representing the words in that sentence.

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      I generally don't share my models so that'll be the same in this case. As for Infinite comprehensible input, that is a good one! I'd love to see a demo of that when you complete it.

  • @SAnsAN091190
    @SAnsAN091190 7 หลายเดือนก่อน +1

    Hi Jarod! I am glad to see your new video! Thank you!
    In fact, the most interesting thing I wanted to know is how you prepared the dataset for training. I asked about this under the previous video 😅. Well, I hope you will tell us about this soon)))

    • @SAnsAN091190
      @SAnsAN091190 7 หลายเดือนก่อน

      Regarding those clumps of red and green that you're talking about at 19:36, I've also come across this. This effect appears only when the training is resumed. I noticed that if I saved the results at epoch 10, and training was interrupted at epoch 11, then when resuming training from epoch 10, the points from the previous training up to epoch 11 are saved on the graph, and while my training continues from epoch 10 to epoch 11, the points for each iteration are duplicated by the current and previous training, therefore they appear these are clumps of red and green colors. (I don't think this affects anything other than the visual perception of the graph, which is becoming difficult to read for the current training period)

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      Thanks! Might make a followup video on how I prepared my dataset here. I still generally follow the same way as I've discussed in other videos too though so if you've seen those, it's not far off.

    • @SAnsAN091190
      @SAnsAN091190 7 หลายเดือนก่อน

      @@Jarods_Journey Yes, I certainly adhere to the approaches that you used earlier, but I'm still not sure about some of the points that I'm doing and how much this may affect the final result

  • @Oqalualaat
    @Oqalualaat 5 หลายเดือนก่อน +1

    Is that 800h dataset only one speaker? If im gonna collect that much data it would take me 100 years to transcribe it manually lol I have no way to use transcriber to do it automatically....

  • @juanjesusligero391
    @juanjesusligero391 7 หลายเดือนก่อน

    Oooh, this is great!! :D I want to try training a Spanish language voice! I'll watch this video asap! (I'm working now XD) Thank you very much for sharing it! :D

  • @ahmetalpergultekin
    @ahmetalpergultekin หลายเดือนก่อน

    I want to hear the voice training of Charlie you have there ahahhaha

  • @mitchelljams
    @mitchelljams 7 หลายเดือนก่อน +1

    Hey Jarod! Been watching all your videos and I think I might have a unique challenge. I’d like to remove a tremor in someone’s voice. Since it’s possible to voice clone in other languages, this doesn’t seem impossible. I’m wondering how you would approach?

  • @schakuun1995
    @schakuun1995 7 หลายเดือนก่อน

    As always great Video Thanks!

  •  หลายเดือนก่อน

    I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??

  • @3k3k3
    @3k3k3 7 หลายเดือนก่อน +1

    Just need that 4070 Super Ti , then i am going in..

  • @caesq_r
    @caesq_r 14 วันที่ผ่านมา

    hi, after training my model, I try to load its .pth file onto okada's AI voice changer but it says that the pth file is missing a "config" parameter or something. how do i fix that!?#@?!#@

  • @ChasingStars7111
    @ChasingStars7111 2 หลายเดือนก่อน

    Yo Jarods thanks for the guide! Could you please make another guide using tokenizer for English voices ?

  • @adamrastrand9409
    @adamrastrand9409 7 หลายเดือนก่อน +2

    Oh, so if it’s a Latin alphabet language, for example, Swedish, Spanish or German, could I just use the whisper transcribed Swedish text to train the model or how will that come out?

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      I would say yes, as the text cleaners will normalize the accented characters. However, it might be needed for proper accenting and that is where a custom tokenizer might come in handy.

    • @ElmorenohWTF
      @ElmorenohWTF 7 หลายเดือนก่อน

      I would say you could try it but you would run into some problems, for example, I saw a nanonomad video where he trained a tortoise-tts model in Spanish, and although it worked, the accent was not a correct Spanish accent and the model had difficulties with some words. For example, in the example shown by jarods in this video with Spanish (I am Spanish), I would say that it has not been tokenized correctly, since there are certain letters that should not be separated (although I may be wrong, I do not know exactly how the tokenizer works).

    • @adamrastrand9409
      @adamrastrand9409 7 หลายเดือนก่อน

      @@Jarods_Journey what do you mean with accent characters and stuff like that and I also wonder will the model learn to speak the language that I’m training it

  • @allan59796
    @allan59796 6 หลายเดือนก่อน

    Jarod thank you for the great tutorial! Really appreciate your content is unique ✨️. I've got a question by the way concerning tokenizers: In many Turkic languages including Turkish there are letters such as "s" and "ş" both tokenize the same way into -> "s" and won't it make the model confused? Since those 2 are different letters, are written and spelled differently but tokenized into 1 letter I think there's a chance that the model will misspell them and could be confused because of the tokenizer. What do you think about it?🤔

  • @EfeSteve-on6gd
    @EfeSteve-on6gd 5 หลายเดือนก่อน

    Hi Jarod, nice channel you got. Can you train a TTS tokenizer that can sing out lyrics of any song? Have you got a video on that? Cheers

  • @charleswang2515
    @charleswang2515 6 หลายเดือนก่อน

    great video!!!👍

  • @ElmorenohWTF
    @ElmorenohWTF 7 หลายเดือนก่อน

    Amazing video! I have a few questions:
    How much file size was approximately the 840 hours of audio you used?
    Do you know where I could find a tortoise-tts model in Spanish to fine-tuning it with the voice I want to train?
    Or maybe I could train my own model in Spanish and then fine-tuning it but doing it all inside the free version of google colab?

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      I used mp3 files at 22050 hz and it came out to around 19gb of data
      Not too sure, you would have to probably train one up to Finetune ATM and as for Collab, the repository isn't setup for Collab. Some people have gotten it running though but I am not too sure.

  • @siamsurf
    @siamsurf 4 หลายเดือนก่อน

    13:45 haha, Charlie

  • @derBenIsPlaying
    @derBenIsPlaying 5 หลายเดือนก่อน

    There is so much documentation missing on tortoise.
    For example, how to install new models. I have a model that is fully trained in tortoise, but i just cant install it, its a few files, but no documentation lists where you have to put them.
    They are for french, but it sounds nothing like french when installed, plus the interface doesn't seem to recognize them properly.
    And why do i have like hundreds of autoregressive models to pick from after training a new model myself, which one is the correct one to use???

  • @gorizon9802
    @gorizon9802 7 หลายเดือนก่อน

    Well, I have a doubt, for example, if I trained a model that spoke in English but wrote in Portuguese, would that work? I would love to know this because I wanted to dub games using tts.

  • @agris350
    @agris350 6 หลายเดือนก่อน

    Hi, How to fix it us during the training process, cmd always shows me a notification "ai-voice-cloning>pause"?

  • @petals-gg7bc
    @petals-gg7bc 7 หลายเดือนก่อน

    What does this mean in RVC client on collab please ?
    ERROR: pip's dependency resolver does not currently take into account all the packages that are
    installed. This behaviour is the source of the following dependency conflicts. chex 0.1.85 requires
    numpy>=1.24.1, but you have numey¥ 1.23.5 which is incompatible. torchdata 0.7.0 requires
    torch==2.1.0, but you have torch 2.0.1 which is incompatible. torchtext 0.16.0 requires torch= 2.1.0, but
    you have torch 2.0.1 which is incompatible. torchvision 0.16.0+cu121 requires torch= 2.1.0, but you
    have torch 2.0.1 which is incompatible

  • @VGHOST008
    @VGHOST008 6 หลายเดือนก่อน

    Could you help me, please? I installed TTS recently but my interface looks NOTHING like yours. Yours only has 5 tabs while my has 15 and the settings don't look alike at all. There's no Generate, History, Utility, Training. It has Generation (Bark), Bark Voice Clone, MusicGen + AudioGen, RVC Beta Demo, Demucs Demo, Seamless M4Tv2 Demo, Magnet, Vocos, Tortoise TTS, Outputs, Favorites, Collections, Voices. It's not RVC, but it's not simply Voice Generation either. What should I download?

  • @creativenets2
    @creativenets2 7 หลายเดือนก่อน

    Hey Jarod, can i follow your installation on mac? I know it says windows but wondering if that will also work on mac.

  • @SyntheticVoices
    @SyntheticVoices 7 หลายเดือนก่อน

    You think this could be better in pronunciation than XTTSv2? Interesting making a German model, I attempted on Tortoise a few months back but it wasn't great. So not sure if there been a big change since.

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน +1

      The pronunciation isn't horrendous for the Japanese one as far as I can tell, but there are a bunch of quirks with it. Seems to work well finetuning on some voices, but not others so I can't really tell if it's going to be overall better than XTTS. However, I can say that it does sound better so far than the finetune that I did for Japanese of XTTS, but that's just my initial impression

    • @SyntheticVoices
      @SyntheticVoices 7 หลายเดือนก่อน

      @@Jarods_Journey cool thanks

  • @WorldYuteChronicles
    @WorldYuteChronicles 6 หลายเดือนก่อน

    Great video!
    Converting to latin is all you need, really?
    Even if the language you want to train contains a lot special characters that are part of the International Phonetic Alphabet(like "ɖ, Ƒ, ɣ, ọ, ʋ") and is tonal? Leading to actual voiced labiodental approximant when "ʋ" from the Latin IPA is written?

    • @Jarods_Journey
      @Jarods_Journey  6 หลายเดือนก่อน +1

      I'm starting to find out that it may not be as easy as that, but I have a follow up video on some things I've been finding out

    • @WorldYuteChronicles
      @WorldYuteChronicles 6 หลายเดือนก่อน

      @@Jarods_Journey Alright. Thank you again. Your work is very valuable!

  • @dthSinthoras
    @dthSinthoras 7 หลายเดือนก่อน

    Bekommen wir alles für ein deutsches Modell zusammen? Ich kann das Training durchführen, habe aber nicht ganz verstanden wo ich den Tokenizer herbekomme. Und ich bekomme vielleicht 100 Stunden an Trainingsdaten zusammen, aber keine 800+. Wenn also wer zusammenarbeiten möchte...

  • @rudritarahman9719
    @rudritarahman9719 2 หลายเดือนก่อน

    Is it possible to train Bengali language with Tortoise TTS?

  • @danielkuperstein1835
    @danielkuperstein1835 7 หลายเดือนก่อน

    if i upload an model that is trained in portuguese te output of generation gona be in portuguese? will i be hable to use an RVC audio file in english with text in portuguese?

  • @lowskillpanda
    @lowskillpanda 29 วันที่ผ่านมา

    how can i voice large text?

  • @daibaogoh5487
    @daibaogoh5487 7 หลายเดือนก่อน

    So how low spec can you go to use this and rvc(not real time) could you do this on a laptop with a 4060 8GB vram?

  • @guycq
    @guycq 7 หลายเดือนก่อน

    The link didnt work in tortoise. Any solution?

  • @iweiteh
    @iweiteh 7 หลายเดือนก่อน

    Hi Jarod, thanks so much for this demo! I learn so much from your videos. Keep up the great work! I followed you tutorial here and managed to train a spanish model using a multi speaker dataset. The training job took about 12 hours to complete successfully. After the training job, I tried generating a voice from the finetuned model. However, due to the volume of my training data, the generation process failed with OOM error. The error indicted that it ran out of memory in the compute_latent process. I have about 25 hours of voice data in my training folder. I wonder if you have any suggestions on how to overcome this issue? I am using an A10 GPU with 24GB VRAM. Thanks in advance!

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      I don't show it in this video, it's in several others, but you usually move all your audio data and training files into a backup folder and inference from only 2 samples so you don't get OOM.
      Another thing you can do is make a new voice folder and infer from there

    • @iweiteh
      @iweiteh 6 หลายเดือนก่อน

      Thanks for your guidance! @@Jarods_Journey !

  • @glowstorm334
    @glowstorm334 7 หลายเดือนก่อน

    Hi , Love the video , This exactly what I was looking for, could you please provide the training scripts for these as I also want to train TTS in my native language.

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      The training scripts are inside of the AI Voice Training repo, so make sure you understand that process

    • @glowstorm334
      @glowstorm334 7 หลายเดือนก่อน

      Thanks for the reply , so all I have to do is prepare the dataset as showing in the train.txt file and training folder and then run generate configuration on to get a valid training dataset.

  • @elviskent9104
    @elviskent9104 3 หลายเดือนก่อน

    How long did it take ?

  • @vitmine
    @vitmine 7 หลายเดือนก่อน

    Hi, i have one question:
    If I have the word čau and the tokenizer says ['ca', 'u'], will it work?

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      My intuition says yes as the model will learn to associate ča with ca. Now if there are two versions, č and c in your language, a custom tokenizer with both characters in it might be valuable if you find that it's not doing it correctly.

    • @vitmine
      @vitmine 7 หลายเดือนก่อน

      ​@@Jarods_Journey Thanks for the reply, if I played cau and čau, it sounds the same. Now I'm going to try train again with my own tokenizer.

  • @Mosen_xd
    @Mosen_xd 7 หลายเดือนก่อน

    thank you . wbt google colab case ?

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      No one has adapted it well for collab yet, so not available unfortunately.

  • @ORDER-yl6qs
    @ORDER-yl6qs 7 หลายเดือนก่อน

    Is 10GB of vram enough for training?

    • @SAnsAN091190
      @SAnsAN091190 7 หลายเดือนก่อน

      It seems like the minimum size is 6 GB of VRAM. But even with 4 GB of VRAM, people manage to run)

    • @Jarods_Journey
      @Jarods_Journey  7 หลายเดือนก่อน

      Yes, you should be fine here

  • @soorenapars
    @soorenapars 7 หลายเดือนก่อน

  • @beysachpromax
    @beysachpromax 7 หลายเดือนก่อน

    Just now

  •  หลายเดือนก่อน

    I don't understand where I went wrong. I'm training Vietnamese language. I used about 1 hour of my voice for training, created tokenzier with your python file for Vietnamese language "vi". Then I tested it with a sentence that was already in the audio sample. It produced a sound that was my voice. However, the sound produced was meaningless, not Vietnamese at all. Please tell me where I went wrong??