Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI

All About AI

มุมมอง 99 355

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ก.ค. 2024
Local Low Latency Speech to Speech - Mistral 7B + OpenVoice / Whisper | Open Source AI
👊 Become a member and get access to GitHub:
/ allaboutai
🤖 AI Engineer Course:
scrimba.com/?ref=allabtai
Get a FREE 45+ ChatGPT Prompts PDF here:
📧 Join the newsletter:
www.allabtai.com/newsletter/
🌐 My website:
www.allabtai.com
Openvoice:
github.com/myshell-ai/OpenVoice
LM Studio:
lmstudio.ai
I created a local low latency speech to speech system with LM Studio, Mistral 7b, OpenVoice and Whisper. This work 100% offline , uncensroed and with dependencies like APIs etc. Still working on optimizing the latency. Running on a 4080.
00:00 Intro
00:31 Local Low Latency Speech to Speech Flowchart
01:32 Setup / Python Code
05:13 Local Speech to Speech Test 1
07:06 Local Speech to Speech Test 2
10:06 Local Speech to Speech Simulation
12:37 Conclusion
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 172

@JohnSmith762A11B 5 หลายเดือนก่อน ⁺⁵⁰
More suggestions: add a "thought completed" detection layer that decides when the user has finished speaking based on the stt input so far (based upon context and natural pauses and such). It will auto-submit the text to the AI backend. Then have the app immediately begin listening to the microphone at the conclusion of playback of the AI's tts-converted response. Yes, sometimes the AI will interrupt the speaker if they hadn't entirely finished what they wanted to say, but that is how real human conversations work when one person perceives the other has finished their thought and chooses to respond. Also, if the user says "What?" or "(could you) Repeat that"" or "please repeat?" or "Say again?" Or "Sorry I missed that." the system should simply play the last WAV file again without going for another round trip to the AI inference server and doing another tts conversion of the text. Reserve the Control-C for stopping and starting this continuous auto-voice recording and response process instead. This will shave a many precious milliseconds of latency and make the conversation much more natural and less like using a walkie-talkie.
@SaiyD 5 หลายเดือนก่อน ⁺¹
nice let me give one suggest to your suggestion. add a random choice with 50% chance to re play the audio or send your input to backend.
@ChrizzeeB 5 หลายเดือนก่อน
so it'd be sending the STT input again and again with every new word detected? rather than just at the end of a sentence or message?
@deltaxcd 4 หลายเดือนก่อน ⁺³
I have better idea to feed it partial prompt without waiting user to finish and it starts generating response if there is a slightest pause if user continues talking more text is added to the prompt and output is regenerated. If user talks on top of speaking AI. Ai terminates its response and continues listening
this will improve things 2 fold because moel will have a chance to process partial prompt and it will reduce time required to process the prompt later
if we combine that to now wasting for full reply conversation will be completely natural
there is no need for any of that say again because AI will do that by itself if asked
@Canna_Science_and_Technology 5 หลายเดือนก่อน ⁺¹⁹
Awesome! Time to replace my slow speech to speech code using openAI. Also, added eleven labs for a bit of a comedic touch. Thanks for putting this together.
@ales240 5 หลายเดือนก่อน ⁺¹
Just subscribed! can't wait to get my hands on it, looks super cool!
@williamjustus2654 5 หลายเดือนก่อน ⁺¹¹
Some of the best work and fun that I have seen so far. Can't wait to try on my own. Keep up the great work!!
@tommoves3385 5 หลายเดือนก่อน ⁺¹
Hey Kris - that is awesome. I like it very much. Great that you do this open source stuff. Very cool 😎.
@ryanjames3907 5 หลายเดือนก่อน ⁺¹
very cool, low latency voice, thanks for sharing, i watch all your videos, and i look forward to the next one,
@BruceWayne15325 5 หลายเดือนก่อน ⁺¹⁶
very impressive! I'd love to see them implement this in smartphones for real-time translation when visiting foreign countries / restaurants.
@optimyse 5 หลายเดือนก่อน ⁺¹
S24 Ultra?
@deltaxcd 4 หลายเดือนก่อน
there are models that so speech to speech translation
@codygaudet8071 3 หลายเดือนก่อน
Just earned yourself a sub sir!
@DihelsonMendonca 6 ชั่วโมงที่ผ่านมา
That's wonderful. I wish I had the knowledge to implement that on my LLMs in LM Studio.
@aladinmovies 5 หลายเดือนก่อน
Good job. Interesting video
@researchforumonline 5 หลายเดือนก่อน
wow very cool! Thanks
@cmcdonough2 หลายเดือนก่อน
This was great 😃👍
@swannschilling474 5 หลายเดือนก่อน ⁺³
I am still using Tortoise but Open Voice seems to be promising! 😊 Thanks for this video!! 🎉🎉🎉
@nyny 5 หลายเดือนก่อน ⁺¹³
Thats supah cool, I actually built something almost exactly like this yesterday. I get about the same performance. The hard part is needing to figure out threading/process pools/asyncio. To get that latency down. I used small instead of base. I think I get about the same response or better.
@user-rz6pp5my4t 5 หลายเดือนก่อน ⁺⁷
Hi ! Very impressive !! Do you have a github to share your code ?
@CognitiveComputations 4 หลายเดือนก่อน
can we see your code please
@limebulls 4 หลายเดือนก่อน
Im interested in it as well
@arvsito 5 หลายเดือนก่อน ⁺¹
It will be very interesting to see this in a web application
@avi7278 5 หลายเดือนก่อน ⁺⁵
In the US we have this concept, if you watch a football game which is notorious for having a shizload of commercials (ie latency), if you start watching the game 30 minutes late but from the beginning, you can skip most of the commercials. If you just shift the latency to the beginning, 15 seconds of "loading" would probably be sufficient enough for a 5-10 minute conversation between the two chatbots, and also avoid loops by having a third party observer who reviews the last 5 messages and determines if the conversation has gone "stale" and interjects a new idea into one of the interlocutors.
@MelindaGreen 4 หลายเดือนก่อน ⁺⁴
I'm daunted by the idea of setting up these development systems just to use a model. Any chance people can bundle them into one big executable for Windows and iOS? I sure would love to just load-and-go.
@zyxwvutsrqponmlkh 4 หลายเดือนก่อน ⁺²
I have tried open voice and bark, but VITS by far makes the most natural sounding voices.
@denisblack9897 5 หลายเดือนก่อน ⁺¹
I know about this for more than a year now and it still blows my mind. wtf
@user-bd8jb7ln5g 5 หลายเดือนก่อน
This is great. But personally I think a speech recognition with push to talk or push to toggle talk is most useful.
@LFPGaming 5 หลายเดือนก่อน ⁺²
do you know of any offline/local way to do translations? i've been searching but haven't found a way to do local translations of video or audio using LargeLanguageModels
@deltaxcd 4 หลายเดือนก่อน ⁺¹
there is a program "subtitle edit" which can do that
@zedboiii หลายเดือนก่อน ⁺¹
that's some Bethesda level of conversation
@josephtilly258 2 หลายเดือนก่อน
really interesting, lot of it i can't understand because I don't know coding but speech to speech could be a big thing within few years
@yoagcur 5 หลายเดือนก่อน ⁺¹
Fascinating. Any chance you could upgrade it so that specific voices could be used and a recording made automatically, Could make for some interesting Biden v Trump debates
@deeplearningdummy 4 หลายเดือนก่อน ⁺³
I've been trying to figure out how to do this. Great job. I want to support your work and get this up and running for myself, but is TH-cam membership the only option?
@PhillipThomas87 5 หลายเดือนก่อน ⁺⁷
I mean, this is dependent on your hardware... Are the specs anywhere for this "inference server"
@darcwader หลายเดือนก่อน
this was more comedy show than tech , lol. so hilarious responses from johnny.
@Embassy_of_Jupiter 5 หลายเดือนก่อน ⁺⁷
This gave me an interesting idea. Once could build streaming LLMs that at least partially build thoughts one word at a time (I mean the input, not the output).
Basically precomputing most of the final embedding with an unfinished sentence, and if it has the full sentence and it's time to answer, it only has to go threw just a few very low latency, very cheap layers.
Different but related idea: Similarly you could actually feed unfinished senteces into Mistral with a prompt that says "this is an unfinished sentence, say INTERRUPTION if you think it is an appropriate time to interrupt the speaker", to make the voice bot interrupt you. Like a normal person would. Would make it feel much more natural.
@deltaxcd 4 หลายเดือนก่อน
Actually AI can do that you can feed it partial prompt let it process it then acd more and continue from where you left. thats huge speedup.
but prompt processing is pretty fast anyway
to make it respond faster you need to let it speak before it finishes "thinking"
@gabrielsandstedt 5 หลายเดือนก่อน ⁺⁸
If you are fine venturing into c# or c++ then I know how you can improve the latency and create a single .exe that includes all of your different parts here, including using local models for the whisper voice recognition. I have done this myself using LLama sharp for runnign the GGUF file, and then embedding all external python into a batch process which it calls.
@matthewfuller9760 2 หลายเดือนก่อน ⁺¹
code on github?
@gabrielsandstedt 2 หลายเดือนก่อน ⁺²
@@matthewfuller9760 i should put it there actually. I have been jumping between projects lately without sharing much. Will send a link when it is up
@matthewfuller9760 2 หลายเดือนก่อน
@@gabrielsandstedt cool
@arkdirfe 4 หลายเดือนก่อน
Interesting, this is similar to a small project I made for myself. But instead of a chatbot conversation, the whisper output is fed into SAM (yes, the funny robot voice) and sent to an audio output. Basically makes SAM say whatever I say with a slight delay. I'm chopping up the speech into small segments so it can start transcribing while I speak for longer, but that introduces occasional weirdness, but I'm fine with that.
@normanalc 26 วันที่ผ่านมา
I'd like to get a copy of the script please, this one is really cool! thanks for sharing this.
@JohnSmith762A11B 5 หลายเดือนก่อน ⁺⁴
I wonder if you are (or can, if not) caching the processed .mp3 voice model after the speech engine processes it and turns it into partials. That would cut out a lot of latency if it didn't need to process those 20 seconds of recorded voice audio every time. Right now it's pretty fast but the latency still sounds more like they are using walkie talkies than speaking on a phone.
@levieux1137 5 หลายเดือนก่อน ⁺³
it could go way further by using the native libs and dropping all the python-based wrappers that pass data between stages using files and that copy, copy, copy and recopy data all the time. For example llama.cpp is clearly recognizable in the lower layers, all the tunable parameters match it. I don't know for openvoice for example however, but the state the presenter arrived at shows that we're pretty close to reaching a DIY conversational robot, which is pretty cool.
@JohnSmith762A11B 5 หลายเดือนก่อน
@@levieux1137 By native libs, you mean the system tts speech on say Windows and macOS?
@levieux1137 5 หลายเดือนก่อน ⁺²
@@JohnSmith762A11B not necessarily that, but I'm speaking about the underlying components that are used here. In fact if you look, this is essentially python code built as wrapper on top of other parts that already run natively. The llama.cpp server for example is used here apparently. And once wrapped into layers and layers, you see that it becomes heavy to transport contents from one layer to another (particularly when passing via files, but even memcpy is expensive). It might even be possible that some elements are re-loaded from scratch and re-initialized after each sentence. The python script here appears to be mostly a wrapper around all such components,working like a shell script recording input from the microphone to a file then sending it to openvoice, then send that output to a file, then load another component with that file, etc... This is just like a shell script working with files and heavy initialization at every step. Dropping all that layer and directly using the native APIs of the various libs and components would be way more efficient. And it's very possible that past a point the author will discover that Python is not needed at all, which could suddenly offer more possibilities for lighter embedded processing.
@squiddymute 5 หลายเดือนก่อน ⁺¹
no api = pure genius
@MegaMijit 4 หลายเดือนก่อน
this is awesome, but voice could use some fine tuning to sound more realistic
@ProjCRys 5 หลายเดือนก่อน ⁺¹
Nice! I was about to create something like this for myself but I still couldn't use OpenVoice because I keep failing to run it on my venv instead of conda.
@Zvezdan88 5 หลายเดือนก่อน
How do you even install OpenVoice?
@SaveTheHuman5 5 หลายเดือนก่อน ⁺⁵
Hello, please can inform to us what is your cpu, gpu, ram etc?
@duffy666 2 หลายเดือนก่อน
I really like it! It this already on Github for members (could not find it)?
@LadyTink 4 หลายเดือนก่อน
Kinda feels like something the "rabbit R1" does
with the whole fast speech to speech thing
@jacoballessio5706 5 หลายเดือนก่อน
I wonder if you could directly convert embeddings to speech to skip text inference
@mastershake2782 5 หลายเดือนก่อน
I am trying to clone a voice from a reference audio file, but despite following the standard process, the output doesn't seem to change according to the reference. When I change the reference audio to a different file, there's no noticeable change in the voice characteristics of the output. The script successfully extracts the tone color embeddings, but the conversion process doesn't seem to reflect these in the final output. I'm using the demo reference audio provided by OpenVoice (male voice), but the output synthesized speech remains in a female voice, typical of the base speaker model. I've double-checked the script, model checkpoints, and audio file paths, but the issue persists. If anyone has encountered a similar problem or has suggestions on what might be going wrong, I would greatly appreciate your insights. Thank you in advance!
@JG27Korny 5 หลายเดือนก่อน
I run the oobabooga silero plus whisper, but those take forever to make voice from text, especially silero.
@skullseason1 4 หลายเดือนก่อน
How can i do this with the Apple M1, this is soooo awesome i need to figure it out!
@fatsacktony1 4 หลายเดือนก่อน
Could you get it to read information and context from a video game, like X4: Foundations, so that you could ask it like a personal assistant to help you manage your space empire?
@Ms.Robot. 4 หลายเดือนก่อน
❤❤❤🎉 nice
@weisland2807 4 หลายเดือนก่อน
would be funny if you had this in games - like the people on the streets of gta having convos fueled by somthing like this. maybe it's already happening tho, i'm not in the know. awesomesauce!
@TanvirsTechTalk 26 วันที่ผ่านมา
How did you actually set it up?
@kleber1983 3 หลายเดือนก่อน ⁺¹
Hi, I´d like to know the computer specs required to run your speech to speech system, I m quite interested but I need to know first I my computer can handle it. thanks.
@inLofiLife 5 หลายเดือนก่อน
looks interesting but where is this community link you mentioned? :)
@irraz1 2 หลายเดือนก่อน ⁺¹
wow! I would love to have such an assistant to practice languages. The “python hub” code, do you plan to share it at some point?
@googlenutzer3384 4 หลายเดือนก่อน
Is it also possible to adjust to different languages?
@SonGoku-pc7jl 5 หลายเดือนก่อน
thanks, good project. Whisper can translate my spanish to english to spanish directly with little change in code? and tts i need change something also? thanks!
@binthem7997 5 หลายเดือนก่อน
Great tutorial but I wish you could share gists or share your code
@Yossisinterests-hq2qq 5 หลายเดือนก่อน ⁺¹
hi I dont have talk.py, but is there another way of running it im missing?
@microponics2695 5 หลายเดือนก่อน ⁺¹
I have the uncensored model the same one and when I ask it to list curse words it says it can't do that. ???
@jungen1093 4 หลายเดือนก่อน
Lmao that’s annoying
@tijendersingh5363 5 หลายเดือนก่อน
Just wao
@fire17102 5 หลายเดือนก่อน ⁺²
Would love to see some realtime animations to go with the voice, could be a face, but also can be minimalistic (like the R1 rabbit).
@wurstelei1356 5 หลายเดือนก่อน
You need a second GPU for this. Lets say you put on Stable Diffusion. Displaying a robot face with emotions would be nice.
@leucome 5 หลายเดือนก่อน
Try Amica AI . It has VRM 3D/vtuber character and multiple option for the voice and the llm backed.
@fire17102 3 หลายเดือนก่อน
@@leucomedoes it work locally in real time?
@fire17102 3 หลายเดือนก่อน
@@wurstelei1356Again, I think a minimalistic animation would also do the trick , or prerendeing the images once, and using them in the appropriate sequence in realtime.
@leucome 3 หลายเดือนก่อน ⁺¹
@@fire17102 Yes it can work in real-time locally as long as the GPU is fast and has enough vram to run the AI+Voice. It can also connect to online service if required. I uploaded a video where I play Minecraft and talk to the AI at same time with all the component running on a single GPU.
@JohnGallie 4 หลายเดือนก่อน ⁺¹
is there anyway that you can give the python 90% of system resources so it would be faster
@khajask8113 หลายเดือนก่อน
Hindi and Telugu language supports..?
@alexander191297 5 หลายเดือนก่อน ⁺¹
I swear on my mother’s grave lol… this AI is hilarious! 😂😂😂
@mertgundogdu211 2 หลายเดือนก่อน ⁺¹
How I can try this in my computer?? I couldnt find the talk.py in github code??
@ExploreTogetherYT 4 หลายเดือนก่อน
how much RAM do you have to run mistral 7b locally? using gpu or cpu?
@MrScoffins 5 หลายเดือนก่อน ⁺²
So if you disconnect your computer from the Internet, will it still work?
@jephbennett 4 หลายเดือนก่อน ⁺¹
Yes, this code package is not pulling APIs (which is why the latency is low), so it doesn't need internet connection. Downside is, it cannot access info outside of it's core dataset, so no current events or anything like that.
@darik31 หลายเดือนก่อน
Thanks for sharing this mate! I wonder if the code is available somewhere? If so, could you please provide a link? Thanks
@Stockholm_Syndrome 5 หลายเดือนก่อน
BRUTAL! hahaha
@ayatawan123 4 หลายเดือนก่อน
This made me laugh so hard!
@matthewfuller9760 3 หลายเดือนก่อน
I think at even 1/3rd the speed with my rtx titan it would run just fine to learn a new language. Waiting 3 seconds is perfectly acceptable as a novice language learner.
@aboudezoa 5 หลายเดือนก่อน
Running on 4080 🤣 makes sense the damn thing is very fast
@NirmalEleQtra หลายเดือนก่อน
Where can i find whole GitHub repo ?
@kumar.jayanti9700 หลายเดือนก่อน
Hi Kris, Where is the Github code for this one. I could not locate it in the Member github.
@_-JR01 4 หลายเดือนก่อน
does openvoice perform better than whisper's TTS?
@OdikisOdikis 5 หลายเดือนก่อน
the predefined answer timing is what makes it not real conversation. It should spit answer questions at random timings like any human can think of something and only then answer. Randomizing timings would create more realistic conversations
@64jcl 4 หลายเดือนก่อน
Surely the response time is a function of what rig you are doing this on - an RTX 4080 as you have is no doubt a major contributor here, and I would guess you have a beast of a CPU and high speed memory on a newer motherboard.
@aestendrela 5 หลายเดือนก่อน ⁺²
It would be interesting to make a real-time translator. I think it could be very useful. The language barrier would end.
@deltaxcd 4 หลายเดือนก่อน
meta didi it already they created speech to speech translation model
@suminlee6576 4 หลายเดือนก่อน
Do you have a video for showing how to do this step by step? I was going to be paid member but I couldn't see how to video in your paid channel?
@TheDailyMemesShow 2 วันที่ผ่านมา
OMG, I just noticed I've watched gazillion videos of yours.
Why haven't subscribed, though?
I swear I thought I had done it before?
Something's not adding up here...
@mickelodiansurname9578 5 หลายเดือนก่อน
can the llm handle being told in a system prompt that it will be taking in the sentences in small chunks? say cut up into 2 second audio chunks per transcript. Can the mistral model do that? Anyway if so you might even be able to get it to 'butt in' to your prompt. now thats low latency!
@deltaxcd 4 หลายเดือนก่อน
No it cant be told that but it is not necessary.
just feed it the chunk and then if user speaks before it managed to reply more restart and feed more
@musumo1908 5 หลายเดือนก่อน
Hey cool…anyway to run this self hosted for an online speech to speech setup? Want to drop this into a chatbot project…what level membership to access the code thanks
@deltaxcd 4 หลายเดือนก่อน
I think to decrease latency more you need to make it speak before AI finishes its sentence
unfortunately there is no obvious way to feed it partial prompt but waiting until it will finish generating reply takes asy too long
@TheDailyMemesShow 2 วันที่ผ่านมา
Would this work on the cloud? If so, how?
@JohnGallie 4 หลายเดือนก่อน
you need to get out more man lol. that was toooo much!
@VitorioMiguel 5 หลายเดือนก่อน
Try fast-whisper. Open source and faster
@tag_of_frank 4 หลายเดือนก่อน
Why LM Studio over OogaBooga? What are the pros/cons of them? I have been using Ooga, but wondering why one might switch.
@BrutalStrike2 5 หลายเดือนก่อน ⁺¹
Jumanji Alan
@ajayjasperj 5 หลายเดือนก่อน
we can make youtube content with those conversation between bots😂❤
@ArnaudMEURET 5 หลายเดือนก่อน
Just to paraphrase your models: “Dude ! Are you actually grabbing the gorram scrollbars to scroll down an effing window !? What is this? 1996 ? Ever heard of a mouse wheel? You know it’s even emulated by double drag on track pads, right?” 🤘
@Edward_ZS 5 หลายเดือนก่อน
I dont see Dan.mp3
@MetaphoricMinds 4 หลายเดือนก่อน ⁺¹
What GPU are you running?
@AllAboutAI 4 หลายเดือนก่อน ⁺¹
4080 RTX!
@TheRottweiler_Gemii หลายเดือนก่อน
Anybody done with this and have a code or link can share please
@picricket712 2 หลายเดือนก่อน
can someone please give me that source code
@Nursultan_karazhigit 5 หลายเดือนก่อน ⁺¹
Thanks . Is whisper api free ?
@m0nxt3r 2 หลายเดือนก่อน
it's open source
@witext 3 หลายเดือนก่อน
I look forward to actual speech to speech LLM, not any speech to text translation layers, pure speech in and speech out, it would be revolutionary imo
@MetaphoricMinds 4 หลายเดือนก่อน
Dude just made a JARVIS embryo.
@jeffsmith9384 5 หลายเดือนก่อน
I would like to see how a chat room full of different models would problem solve... ChatGPT + Claude + * 7B + Grok + Bard... all in a room, trying to decide what you should have for lunch
@DihelsonMendonca 6 ชั่วโมงที่ผ่านมา
Too complex for the average guy. We need a ready LLM with easy voice options on LM Studio.
@mickelodiansurname9578 5 หลายเดือนก่อน ⁺¹
AI: "We got some rich investors on board dude, and their willing to back us up!"
I think this script just announced the games commencing in the 2024 US Election... [not in the US so reaches for popcorn]
@artisalva 4 หลายเดือนก่อน
haha AI conversations could have their own chanels
@wurstelei1356 5 หลายเดือนก่อน ⁺⁴
Sadly this video has fewer hits than it should have. I am looking forward for a more automated version of this. Hopefully the low amount of views wont hinder it.
@jerryqueen6755 3 หลายเดือนก่อน ⁺¹
How can I install this on my PC? I am a member of the channel
@AllAboutAI 3 หลายเดือนก่อน
did you get the gh invite?
@jerryqueen6755 3 หลายเดือนก่อน
@@AllAboutAI yes, thanks
@miaohf 2 หลายเดือนก่อน
@@AllAboutAI I am a member of the channel too, how to get gh invite?
@robertgoldbornatyout 3 หลายเดือนก่อน
Could make for some interesting Biden v Trump debates
@calvinwayne3017 5 หลายเดือนก่อน
now add a metahuman and auidio2face :)

ต่อไป

เล่นอัตโนมัติ

SUPER Fast AI Real Time Speech to Text Transcribtion - Faster Whisper / Python