This is nice and saves a lot of time. Even so, while most audio segments are nicely split some are not. E.g., some may be cut off at the end. In such cases, it may be beneficial to implement a secondary process specifically designed to identify and eliminate those 'rejects' from the dataset. I'm finding it's a delicate balance of adjusting the parameters specific to a particular dataset but rejects may still occur and are best removed if possible for improved training.
Thanks for your feedback 😊. Yes, it's a little bit try'n error to adjust parameters to find best way that sentences are split correctly. But in most cases it will require manual control and adjust afterwards. But it's way better than doing the whole process manually 🙃.
Thanks for explaining. It's definitely way better than manual. Also, I don't know if my thoughts on removal of rejects was wise as in cases those chunks may form part of a sentence. I'm still trying to understand the how and why.
So in principle, I can record German and English sentences since Whisper will recognise both. How does Piper handle two languages at once? Will it be able to learn German and English phonetics together?
For whisper: yes For piper: imho this will not work perfectly right now. As german every day talk uses lots of english words switching phoneme language is important. But imho this does not work perfectly out of the box. Maye you preprocess the text before running tts.
Yes, whisper automatically discovers the spoken language. It works for all languages supported by whisper. I tried it with german too and it really worked very well 😊.
According to original LJSpeech (keithito.com/LJ-Speech-Dataset/) dataset the 3rd column is "Normalized Transcription" and is required by some tts projects. Normally you would replace strings like "mr." to "mister" and "2" to "two". I just made it to lowercase and think on how i can integrate text cleaners that work for multiple languages.
Thank you so much for the video! it helps so much to automate the process and saves lot's of time. As you are running on a mac... Do you have any video planned on how to use the dataset on a mac to create the voice as well? Or any updated tutroials how to use it in an updated google colab or lightning studio (this will be amazing, as google colab is a pain in the butt these days :). )
Thanks for your comment 😊. Mostly i use linux to train a tts voice model on a voice dataset. Did not know about lightning studio but looks promising on a first look. Thanks for pointing out 👍.
Ein Folge-Video,was nun genau mit dem generierten Dataset gemacht werden kann und wie man vorgeht, wäre super. Falls es das schon irgendwo gibt, bitte verlinkten.
Mit einem eigenen LJSpeech Sprachdatensatz kannst du deine Stimme klonen:. Entweder mit Coqui TTS oder (bevorzugt) mit Piper TTS. * Coqui: th-cam.com/video/4YT8WZT_x48/w-d-xo.html * Piper: th-cam.com/video/b_we_jma220/w-d-xo.html
This is nice and saves a lot of time. Even so, while most audio segments are nicely split some are not. E.g., some may be cut off at the end. In such cases, it may be beneficial to implement a secondary process specifically designed to identify and eliminate those 'rejects' from the dataset. I'm finding it's a delicate balance of adjusting the parameters specific to a particular dataset but rejects may still occur and are best removed if possible for improved training.
Thanks for your feedback 😊. Yes, it's a little bit try'n error to adjust parameters to find best way that sentences are split correctly. But in most cases it will require manual control and adjust afterwards. But it's way better than doing the whole process manually 🙃.
Thanks for explaining. It's definitely way better than manual. Also, I don't know if my thoughts on removal of rejects was wise as in cases those chunks may form part of a sentence. I'm still trying to understand the how and why.
Cheers that is a brilliant video! I have a question could I prepare a dataset for singing voice dataset too ?
Thanks for your kind feedback 😊. A singing voice dataset is sounding like a interesting use case 👍. But i don't have any experience with that (yet).
please focus and zoom the area that you are talking about. that will not be fancy. thank you.
So in principle, I can record German and English sentences since Whisper will recognise both. How does Piper handle two languages at once? Will it be able to learn German and English phonetics together?
For whisper: yes
For piper: imho this will not work perfectly right now. As german every day talk uses lots of english words switching phoneme language is important. But imho this does not work perfectly out of the box. Maye you preprocess the text before running tts.
Thx for sharing. Like
hi, but how can I invoke the cuda usage?
IMHO CUDA should be automatically be detected/used by whisper? Is it installed in your (venv) environment?
5:16 This size is good enough for me.
Thanks for your feedback to the font size / scale 😊
Thanks, can I run on CPU because i don't have GPU
Yes, that's possible. It's just slower than with GPU.
Cool! Does it work With other languages than English?
Yes, whisper automatically discovers the spoken language. It works for all languages supported by whisper. I tried it with german too and it really worked very well 😊.
@@ThorstenMueller ok thanks for the info
Does it work on other than English recordings? Like Arabic for example?
Hi, this should work for all languages that are supported by whisper stt.
Also, I'm curious about the most recently added "3rd column with cleaned/lowered text" what you have planned?
According to original LJSpeech (keithito.com/LJ-Speech-Dataset/) dataset the 3rd column is "Normalized Transcription" and is required by some tts projects. Normally you would replace strings like "mr." to "mister" and "2" to "two". I just made it to lowercase and think on how i can integrate text cleaners that work for multiple languages.
Ahh, I see. I recall different datasets e.g., having in one case "2" and in another "two". Interresting, much appreciated.
Thank you so much for the video! it helps so much to automate the process and saves lot's of time. As you are running on a mac... Do you have any video planned on how to use the dataset on a mac to create the voice as well? Or any updated tutroials how to use it in an updated google colab or lightning studio (this will be amazing, as google colab is a pain in the butt these days :). )
Thanks for your comment 😊. Mostly i use linux to train a tts voice model on a voice dataset. Did not know about lightning studio but looks promising on a first look. Thanks for pointing out 👍.
Informative ❤
The video is exactly what I need :) Thank you so much!
Happy to hear that, you're welcome 😊.
really I love your surprised❤
Thank you, since I am not a native English speaker, I am sometimes surprised at what I say 😅.
Ein Folge-Video,was nun genau mit dem generierten Dataset gemacht werden kann und wie man vorgeht, wäre super. Falls es das schon irgendwo gibt, bitte verlinkten.
Mit einem eigenen LJSpeech Sprachdatensatz kannst du deine Stimme klonen:. Entweder mit Coqui TTS oder (bevorzugt) mit Piper TTS.
* Coqui: th-cam.com/video/4YT8WZT_x48/w-d-xo.html
* Piper: th-cam.com/video/b_we_jma220/w-d-xo.html