I agree, the chunks are noticeable, however there are many ways to fix this. I will look into this. It's interesting, maybe consider the 100 token as option B. The real time speech, is impressive. I but do question whether if it's really worth to match the speed of something like XTTS. There is a cross-fade system in the F5TTS. About the concern relating to sample out in chunks. A speculative thought might have come to mind relating to not just the quality of the character's speech style and intonation, but that session of the audio that was used for reference. The quick fix would be RCV, but this may not be necessary. This task, specifically, is a question of consistency.
I'm moving towards a zero-crossing algorithm where I can take the nearest zero crossing and use that instead. Dunno how I'm gonna implement it yet, but some testing around tells me it's possible and the cut is no longer heard. Gpt-sovits does better on dialogue than I've heard with xtts/tortoise, and has things like laughter and sighing so much better for speech imo
I do not know exactly what is happening with what you are currently using. I know in the past, TTS was technically an image generator. A spectrogram was diffused and a synthesizer would translate the image into audio. If that is still what is happening, perhaps using something like a panoramic image stitching program (algorithm) would be able to align and blur the overlapped portion. I do not have the skill to implement this, it is just a thought.
Thanks for the forecast! Could you help me with something unrelated: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
I wonder what Vedal uses. Neuro's latency is bonkers. Edit: oh sorry I thought this was speech to text lol, my bad. I know Vedal uses play ht for his speech to text.
This is definitely the right thinking, I've finished some testing on a zeros crossing algorithm for merging chunks and it works beautifully, I just have to figure out how to implement it into the TTS sequence
jarod you should try using a fast version of whisper to convert your voice into tts, and see how it is. Also about the neuro sama thing, I like how you left out the possibility that neuro-sama is not entirely locally hosted, which probably is true, The only thing he probably self hosted is a real time whisper, which sends data to a server with powerful GPUs so it'd be processed wayyyy faster.
I genuinely think vedal self-hosts everything nowadays. Once upon a time I think he used Open AI, but I don't think that's the case anymore. Whisper turbo is nice, it's blazing quick!
@@Jarods_Journey hmm he might just have an entirely different rig just for these tasks which would have the same effect but locally. It's like using a different computer just for recording your gameplay or stream lol.
It's not about accumulation amount, the latest token is likely not complete, so instead of waiting for a longer chunk, you should wait for a shorter chunk like you first did it and you just need to wait for one more token. EG, wait for next token but only process latest -1 token if that makes sense. Thank you for sharing, you drove me to finally seriously look at sovits
I think tokens are a little imprecise, I've done little experiments to confirm some of my suspicions. I'm moving towards developing a little zeros crossing algorithm that can do a better job at concatenation, though details will have to be explained in a video
CAN YOU TELL ME. AFTER USING IT, F5 TTS SHUT IT DOWN. THE NEXT DAY I WANT TO OPEN IT, WHAT SHOULD I DO? WHEN I OPEN IT, I RUN AGAIN FROM THE BEGINNING AND IT LOST MY PREVIOUS TRAINING DATA
Many thanks, Jarod! I am also arm wrestling this for my voice bot. Appreciate all your insights!
"No you dummy :)" so cute 😂
Thanks for the update
I'm super excited to see this improve
Fish Speech with multilanguages support released. Can you make a new fork with web interface?
Can i get access to the git please ? I'm working on it and maybe i can help with it too 🙏
Can we get the code for the streaming version you made?
I agree, the chunks are noticeable, however there are many ways to fix this. I will look into this. It's interesting, maybe consider the 100 token as option B. The real time speech, is impressive. I but do question whether if it's really worth to match the speed of something like XTTS. There is a cross-fade system in the F5TTS. About the concern relating to sample out in chunks. A speculative thought might have come to mind relating to not just the quality of the character's speech style and intonation, but that session of the audio that was used for reference. The quick fix would be RCV, but this may not be necessary. This task, specifically, is a question of consistency.
I'm moving towards a zero-crossing algorithm where I can take the nearest zero crossing and use that instead. Dunno how I'm gonna implement it yet, but some testing around tells me it's possible and the cut is no longer heard.
Gpt-sovits does better on dialogue than I've heard with xtts/tortoise, and has things like laughter and sighing so much better for speech imo
I do not know exactly what is happening with what you are currently using. I know in the past, TTS was technically an image generator. A spectrogram was diffused and a synthesizer would translate the image into audio. If that is still what is happening, perhaps using something like a panoramic image stitching program (algorithm) would be able to align and blur the overlapped portion. I do not have the skill to implement this, it is just a thought.
Thanks for the forecast! Could you help me with something unrelated: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
I wonder what Vedal uses. Neuro's latency is bonkers.
Edit: oh sorry I thought this was speech to text lol, my bad. I know Vedal uses play ht for his speech to text.
Could you please make a video on comparing all the voice models released till date which is best and give rankings(eleven labs,ftts5 etc...)
Are you working with these guys to develop this?
Is there any way to adjust the chunk size so it ends the generation on the low side of the wave form?
This is definitely the right thinking, I've finished some testing on a zeros crossing algorithm for merging chunks and it works beautifully, I just have to figure out how to implement it into the TTS sequence
will this work with 1660ti laptop?
jarod you should try using a fast version of whisper to convert your voice into tts, and see how it is. Also about the neuro sama thing, I like how you left out the possibility that neuro-sama is not entirely locally hosted, which probably is true, The only thing he probably self hosted is a real time whisper, which sends data to a server with powerful GPUs so it'd be processed wayyyy faster.
I genuinely think vedal self-hosts everything nowadays. Once upon a time I think he used Open AI, but I don't think that's the case anymore.
Whisper turbo is nice, it's blazing quick!
@@Jarods_Journey hmm he might just have an entirely different rig just for these tasks which would have the same effect but locally. It's like using a different computer just for recording your gameplay or stream lol.
Start with a small chunk, then generate larger and larger chunks?
Thought about this, the same issue will be present just at different points unfortunately 😅
It's not about accumulation amount, the latest token is likely not complete, so instead of waiting for a longer chunk, you should wait for a shorter chunk like you first did it and you just need to wait for one more token. EG, wait for next token but only process latest -1 token if that makes sense. Thank you for sharing, you drove me to finally seriously look at sovits
I think tokens are a little imprecise, I've done little experiments to confirm some of my suspicions.
I'm moving towards developing a little zeros crossing algorithm that can do a better job at concatenation, though details will have to be explained in a video
can you summarize this video, i dont speak japanese
Sir does this work on CPU also?
Yes but slow 🦥
CAN YOU TELL ME. AFTER USING IT, F5 TTS SHUT IT DOWN. THE NEXT DAY I WANT TO OPEN IT, WHAT SHOULD I DO? WHEN I OPEN IT, I RUN AGAIN FROM THE BEGINNING AND IT LOST MY PREVIOUS TRAINING DATA
How can I get this software
Probably once I finish I'll push an update to GitHub
@ i appreciate