Realtime Text-to-Speech with GPT-SoVITS

Jarods Journey

มุมมอง 3 125

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น •

@gr8tbigtreehugger 4 วันที่ผ่านมา ⁺²
Many thanks, Jarod! I am also arm wrestling this for my voice bot. Appreciate all your insights!
@sinayagubi8805 3 วันที่ผ่านมา ⁺²
"No you dummy :)" so cute 😂
@fulldivemedia 2 วันที่ผ่านมา
Thanks for the update
@sinayagubi8805 3 วันที่ผ่านมา
I'm super excited to see this improve
@mikhailv4686 16 ชั่วโมงที่ผ่านมา
Fish Speech with multilanguages support released. Can you make a new fork with web interface?
@samsoum1999 3 วันที่ผ่านมา ⁺¹
Can i get access to the git please ? I'm working on it and maybe i can help with it too 🙏
@soraygoularssm8669 3 วันที่ผ่านมา ⁺¹
Can we get the code for the streaming version you made?
@megamayo2500 4 วันที่ผ่านมา ⁺¹
I agree, the chunks are noticeable, however there are many ways to fix this. I will look into this. It's interesting, maybe consider the 100 token as option B. The real time speech, is impressive. I but do question whether if it's really worth to match the speed of something like XTTS. There is a cross-fade system in the F5TTS. About the concern relating to sample out in chunks. A speculative thought might have come to mind relating to not just the quality of the character's speech style and intonation, but that session of the audio that was used for reference. The quick fix would be RCV, but this may not be necessary. This task, specifically, is a question of consistency.
@Jarods_Journey 4 วันที่ผ่านมา
I'm moving towards a zero-crossing algorithm where I can take the nearest zero crossing and use that instead. Dunno how I'm gonna implement it yet, but some testing around tells me it's possible and the cut is no longer heard.
Gpt-sovits does better on dialogue than I've heard with xtts/tortoise, and has things like laughter and sighing so much better for speech imo
@isaacinnis 3 วันที่ผ่านมา
I do not know exactly what is happening with what you are currently using. I know in the past, TTS was technically an image generator. A spectrogram was diffused and a synthesizer would translate the image into audio. If that is still what is happening, perhaps using something like a panoramic image stitching program (algorithm) would be able to align and blur the overlapped portion. I do not have the skill to implement this, it is just a thought.
@SharonEdwards-k3n 3 วันที่ผ่านมา
Thanks for the forecast! Could you help me with something unrelated: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
@SahilP2648 3 วันที่ผ่านมา
I wonder what Vedal uses. Neuro's latency is bonkers.
Edit: oh sorry I thought this was speech to text lol, my bad. I know Vedal uses play ht for his speech to text.
@bgmspot7242 4 วันที่ผ่านมา
Could you please make a video on comparing all the voice models released till date which is best and give rankings(eleven labs,ftts5 etc...)
@supermandem 4 วันที่ผ่านมา
Are you working with these guys to develop this?
@DooryardGarage 3 วันที่ผ่านมา
Is there any way to adjust the chunk size so it ends the generation on the low side of the wave form?
@Jarods_Journey 3 วันที่ผ่านมา
This is definitely the right thinking, I've finished some testing on a zeros crossing algorithm for merging chunks and it works beautifully, I just have to figure out how to implement it into the TTS sequence
@16comic 3 วันที่ผ่านมา
will this work with 1660ti laptop?
@MahdeenSky 4 วันที่ผ่านมา
jarod you should try using a fast version of whisper to convert your voice into tts, and see how it is. Also about the neuro sama thing, I like how you left out the possibility that neuro-sama is not entirely locally hosted, which probably is true, The only thing he probably self hosted is a real time whisper, which sends data to a server with powerful GPUs so it'd be processed wayyyy faster.
@Jarods_Journey 4 วันที่ผ่านมา ⁺¹
I genuinely think vedal self-hosts everything nowadays. Once upon a time I think he used Open AI, but I don't think that's the case anymore.
Whisper turbo is nice, it's blazing quick!
@MahdeenSky 4 วันที่ผ่านมา
@@Jarods_Journey hmm he might just have an entirely different rig just for these tasks which would have the same effect but locally. It's like using a different computer just for recording your gameplay or stream lol.
@dement242 4 วันที่ผ่านมา
Start with a small chunk, then generate larger and larger chunks?
@Jarods_Journey 4 วันที่ผ่านมา
Thought about this, the same issue will be present just at different points unfortunately 😅
@roygatz 4 วันที่ผ่านมา
It's not about accumulation amount, the latest token is likely not complete, so instead of waiting for a longer chunk, you should wait for a shorter chunk like you first did it and you just need to wait for one more token. EG, wait for next token but only process latest -1 token if that makes sense. Thank you for sharing, you drove me to finally seriously look at sovits
@Jarods_Journey 4 วันที่ผ่านมา ⁺¹
I think tokens are a little imprecise, I've done little experiments to confirm some of my suspicions.
I'm moving towards developing a little zeros crossing algorithm that can do a better job at concatenation, though details will have to be explained in a video
@讙 3 วันที่ผ่านมา
can you summarize this video, i dont speak japanese
@lokeshart3340 3 วันที่ผ่านมา
Sir does this work on CPU also?
@Da-Bolt 3 วันที่ผ่านมา
Yes but slow 🦥
@dathuynh-l4k 4 วันที่ผ่านมา
CAN YOU TELL ME. AFTER USING IT, F5 TTS SHUT IT DOWN. THE NEXT DAY I WANT TO OPEN IT, WHAT SHOULD I DO? WHEN I OPEN IT, I RUN AGAIN FROM THE BEGINNING AND IT LOST MY PREVIOUS TRAINING DATA
@KarenBrown-w9d 4 วันที่ผ่านมา
How can I get this software
@Jarods_Journey 4 วันที่ผ่านมา ⁺⁴
Probably once I finish I'll push an update to GitHub
@KarenBrown-w9d 4 วันที่ผ่านมา ⁺¹
@ i appreciate

ต่อไป

เล่นอัตโนมัติ

A New State-of-the-Art Text-to-Speech Program - Install & Testing MaskGCT