F5 Text to Speech Tutorial | Hit "Refresh" on Your AI Voice!

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 ม.ค. 2025

ความคิดเห็น • 69

  • @stefankargl
    @stefankargl หลายเดือนก่อน +4

    Hi, Thorsten, the community thrives because of people like you - thanks for your work!

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      Thank you for your very kind words 🥰

  • @LearnOpsViet
    @LearnOpsViet 8 วันที่ผ่านมา

    Thank you so much for your video; it’s truly valuable to me. Oh, and I really like your teaching style in the video as well. 😁😁

    • @ThorstenMueller
      @ThorstenMueller  7 วันที่ผ่านมา +1

      Thank you a lot, dear LearnOpsViet for your kind feedback - happy you like it 😊.

  • @MarcRitzMD
    @MarcRitzMD 6 วันที่ผ่านมา

    Thorsten, installing it all with Pinokio is so much better. Installing everything with Pinokio in the AI world is so much better.

  • @kardiokode-g8v
    @kardiokode-g8v หลายเดือนก่อน +1

    hey @ThorstenMueller, great work as always! one thing that caught my eye: you mention that the code is released under MIT licence, which is right. But i think its also important to note that usually inference code and models have different licences (which you covered on other videos!). Here the model itself has a different licence: at 3:13 you can see it on top middle and in the text under it, that the model files are CC-BY-NC-4.0 licenced, which means no commercial use. This means you can not use generated voices for anything commercial like voice overs for youtube channels or in companies. It would be great to have this information as well in your videos, since people using this in any commercial environment or a simple monetized youtube channel can bring you in trouble if the owner enforces the licence. It would be great if you could make a video with an overview of fully open and free TTS/cloning models, that allow also commercial use.. i havent seen such a list anywhere and im sure lots of people would be interested.

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน +1

      Thanks for the clarification 😊. I've seen another comment asking for usage as voiceover - did you reply to this? I added your hint to the video description and linked you - hope it's okay for you.
      I added your video topic suggestion to my list as i think it is a great idea 👍.

  • @magenta6
    @magenta6 2 หลายเดือนก่อน

    That was great!! Thanks for your content! I've got this running now and it is amazing!!

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน

      Thanks for your nice feedback 😊.

  • @fabiano8888
    @fabiano8888 5 วันที่ผ่านมา

    I couldn't resist. 😂

  • @beneadie3202
    @beneadie3202 หลายเดือนก่อน

    that's really really good quality for open source

  • @GeorgAubele
    @GeorgAubele 2 หลายเดือนก่อน +2

    Thanks for your video. F5 TTS is absolutely stunning!
    Let's hope they will include other languages (GERMAN) soon. ;)

    • @GeorgAubele
      @GeorgAubele 2 หลายเดือนก่อน

      Additional question: Does the model "re-learn" the voice everytime I want it to generate a sentence? Is there a way to learn the voice once and then use the trained model over and over again?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน +1

      According to their community they are working on additional languages, including german 😊

    • @lichtundliebe999
      @lichtundliebe999 หลายเดือนก่อน +1

      huggingface "marduk-ra/F5-TTS-German"

  • @vijisrangoli
    @vijisrangoli หลายเดือนก่อน

    OMG, you are life saver for me!! Awesome!!

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      Wow, thanks for your kind feedback 😊.

  • @adamrastrand9409
    @adamrastrand9409 หลายเดือนก่อน

    Hello Torsten I have heard that some languages in Piper TTS sound pretty bad for example the Swedish model like that when you train a new voice like when you find tune from the existing checkpoint mall that exists it sounds quite bad and such is that true because the default Swedish NST voice sounds very monotone but when you find tune from that will it sound like me or will it sound different just with the pronunciation errors and when you find two from scratch How many hours of speech do you need I have an RTX 40 6016 GB card so is that good for AI training and the thing is also that do I need to set up Linux and Windows at the same time and fiddle around with complicated stuff because it’s just easier to have a Windows set up And not worry about Windows for Linux so can I just do it with a command

    • @ThorstenMueller
      @ThorstenMueller  20 วันที่ผ่านมา

      Hello, i only trained my german "Thorsten-Voice" tts piper voice. So i have no experience on other languages, their quality and need for training material. I used multiple hours (around 10 for finetuning my piper model), but i additionally played around with just 1000 phrases and these worked too. It's a little bit of a try'n error.

  • @BBZ101
    @BBZ101 3 วันที่ผ่านมา

    what is the system requirement to run it locally

  • @loyd1298
    @loyd1298 5 วันที่ผ่านมา

    hello, is there any way to use the model directly on python like xtts v2. Gratefuly

  • @-bret
    @-bret หลายเดือนก่อน

    I tired this out on a rtx 3600 12gb model and it's fast. Quicker than speaking, maybe 2x faster to process than to listen to. Sounds really good to me.

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน +1

      Thanks for your helpful comment and performance indicator on a 3600 👍🏻.

    • @-bret
      @-bret หลายเดือนก่อน

      @ThorstenMueller I should have said it's paired with a 2700 ryzen. It's a pretty cheap rig now, I think you could buy both parts used for about 300 pounds on eBay. 30 pound cpu and 270 for the gpu.
      Or wait a year and pick up a 3090 24gb for same price, currently sitting around 500. I did pick up a tesla 24gb I forget model number, from China for 245 which is good for really large llm.
      Thank you for showing me this, I have project I can purposely upgrade now.

    • @dtesta
      @dtesta 25 วันที่ผ่านมา

      Where can I buy a 3600? I've only have a 3060...

    • @-bret
      @-bret 25 วันที่ผ่านมา

      @@dtesta I'm sorry, I meant rtx 3060 16gb version

    • @dtesta
      @dtesta 25 วันที่ผ่านมา

      @-bret Cool! Where can I find that 16GB version? I only have 12GB.

  • @charlenechen2507
    @charlenechen2507 หลายเดือนก่อน

    Hello Thorsten, can you have a check and review of PopPop AI text to speech?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      Thanks for your topic suggestion 😊. I've added it to my todo list.

  • @FrankGraffagnino
    @FrankGraffagnino 2 หลายเดือนก่อน

    great stuff!

  • @ei23de
    @ei23de 2 หลายเดือนก่อน

    Haha the F5 joke😂.
    The progress is amazing, right?
    Still waiting for german support for F5...
    Anyway in english it is now already easy to create synthetic voice datasets for piper for example, just an idea😊

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน +1

      H(ei) 👋,
      thanks for your nice comment 😊 and yes, progress is really impressive.

  • @ernieprevost6555
    @ernieprevost6555 หลายเดือนก่อน

    Hi Thorsten, thank you for another excellent tutorial. I have installed f5 on a Raspberry Pi 5 and it generates very good quality output but to be expected it is very slow. I am trying to understand how f5 works, does it take a standard model and modify it in some way using the ref_text & audio before generating the desired output (gen_text)? Is there an intermediate stage that could be executed separately? Thanks Ernie

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      Thanks for your nice feedback 😊. As i can't answer your question you might want to ask this question on their github repo to get (useful) responses.

  • @mercuryin1
    @mercuryin1 หลายเดือนก่อน

    I tried this morning and the cloned voices are the best I have never used. I wonder if I can use the cloned voices in some way with Home Assistant through I don´t know know..piper might be ? I can´t find if this is possible to do with this software, it is only tts ? is possible to synthesise a dataset with this ? Thanks

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      AFIK you can use piper tts voices in Home Assistant. But for this you have to record way more audio data to train/finetune a piper tts model. Do you know my video about piper tts voice cloning? th-cam.com/video/b_we_jma220/w-d-xo.html

  • @Marty72
    @Marty72 2 หลายเดือนก่อน

    I enjoyed the intro it made me laugh.

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน

      I'm happy you liked it 😊.

  • @ikarosound2504
    @ikarosound2504 หลายเดือนก่อน

    thanks! it is faesabel to do all of that trought scripted pyton code?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน +1

      Good point 👍🏻. I took a quick but did not see an obvious solution for native python integration.

  • @magenta6
    @magenta6 2 หลายเดือนก่อน

    That whisper at the beginning really sounded like Stephan Molyneux?!!!

  • @dontmindbeingblindd
    @dontmindbeingblindd 2 หลายเดือนก่อน +1

    May I ask what gpu you are using, or if it is using a gpu?

    • @RaminAssadollahi
      @RaminAssadollahi หลายเดือนก่อน

      when you start gradio the fist time and the model is downloading, it shows that pytorch loading the models into CPU, i'll investigate on that

    • @RaminAssadollahi
      @RaminAssadollahi หลายเดือนก่อน

      correction: I'm running it on a 1080ti, it takes 16 sec for 4 sec of speech to synthesise. Don't know, whether it's always re-analysing the reference as well.

    • @RaminAssadollahi
      @RaminAssadollahi หลายเดือนก่อน

      okay, further investigation: i let the output text the same but uploaded a longer reference, it then also takes longer to synthesise. so, the whole time is comprising reference learning as well as synthesis. would be interesting to see how much time mere synthesis would take...

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      If you use f5 on huggingface it will use a random gpu that is available in that momoment. If you use it locally without cuda (nvidia gpu) it will use cpu.

  • @SimpleTechAI
    @SimpleTechAI 2 หลายเดือนก่อน +1

    I tried it and it works but it did not sound like me. Nothing close to what you did. Not a fan at this time it really should have done better. Thanks for sharing you got my thumbs up...

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน +1

      Thanks for your "thumb up" and sorry to hear it didn't work for you as expected.

    • @SimpleTechAI
      @SimpleTechAI 2 หลายเดือนก่อน

      ​@ThorstenMueller not your fault, you laid it out perfectly. Its probably the quality of my samples.
      Thanks again

  • @SyamsQbattar
    @SyamsQbattar หลายเดือนก่อน

    Is online Huggingface better than local?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      The tts model is the same. It's just the question of your local available compute power. In my case huggingface has been more performant.

  • @RaminAssadollahi
    @RaminAssadollahi หลายเดือนก่อน

    What GPU do you have on your computer?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน +1

      An nvidia 1050 ti in this case.

    • @dtesta
      @dtesta 25 วันที่ผ่านมา

      @@ThorstenMueller You need RTX card for this kind of thing. Anything else would be dogshit :)

  • @HendersonHood
    @HendersonHood 2 หลายเดือนก่อน

    You made a reference to your computer speed. Care to elaborate on its GPU and CPU and ram?

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน +1

      You're absolutely right. I forgot adding it to the description. Thanks to your hint, my computer specs are now in description 😊.

  • @mogbattlesapp
    @mogbattlesapp 2 หลายเดือนก่อน

    can this be deployed and hosted on a server?

  • @AmaymonF
    @AmaymonF 2 หลายเดือนก่อน

    Great

    • @ThorstenMueller
      @ThorstenMueller  2 หลายเดือนก่อน +1

      Thank you 😊, i'm impressed by f5 too.

  • @christoph9620
    @christoph9620 หลายเดือนก่อน +1

    Hello Thorsten, thanks for your great channel. I came about these videos which shows how one can train F5 with different languages th-cam.com/video/UO4usaOojys/w-d-xo.html th-cam.com/video/RQXHKO5F9hg/w-d-xo.html As you are experienced with training of speech models, I am wondering how much hours material would be required to train a German language model in good quality and what things should be considered in regards to training data. In the referenced youtube video the creator simply takes audiobooks. Can one expect to get a good quality model in this way?

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      Hello Christoph, thanks for your nice feedback on my channel 😊.
      Currently f5 tts can't be trained in german, but they are working on it. github.com/SWivid/F5-TTS/issues/87#issuecomment-2418043522
      For my german "Thorsten-Voice" datasets i recorded over 30k audio files, but this should not be required now.

  • @JubayerAhmed-f5i
    @JubayerAhmed-f5i หลายเดือนก่อน

    can we use it for making TH-cam videos and monetize it ? i mean is legal

    • @PatrickAngwin
      @PatrickAngwin หลายเดือนก่อน

      I'm no expert, but from what I understand, no, because although the f5 model itself is open source and available to use commercially, the license for the dataset on which it was trained is restricted and does not allow commercial use. I would love someone to tell me I'm wrong about this as I was getting really excited about f5 until I found this out...

    • @ThorstenMueller
      @ThorstenMueller  หลายเดือนก่อน

      I can not give any legal advices. Here (huggingface.co/SWivid/F5-TTS) is written:
      "2024/10/14. We change the License of this ckpt repo to CC-BY-NC-4.0 following the used training set Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause. Our codebase remains under the MIT license."
      So i guess @PatrickAngwin seems right.