Build a voice assistant with OpenAI Whisper and TTS (text to speech) in 5 minutes

Ralf Elfving

มุมมอง 20 397

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 17 ม.ค. 2025

ความคิดเห็น • 67

@TestTalk ปีที่แล้ว ⁺⁶
My word, I can't tell you how much I now look forward to your videos! Keep up the great work!
@ralfelfving ปีที่แล้ว ⁺¹
@TestTalk thank you so much for the kind words, hopefully many more coming in the weekend and months ahead :)
@TestTalk ปีที่แล้ว
Windows user here, I'm not sure if you mentioned it in your article but I had to download sox then Edit Environment Variable. Not sure if that helps or not but figured I would share and help the YT algorithm for you. @@ralfelfving
@mahtabalam9604 ปีที่แล้ว ⁺³
Immense value bro thanks for the informative videos!
@ralfelfving ปีที่แล้ว
Glad it helps, thanks for the comment! ♥️
@nabgilby 3 หลายเดือนก่อน
Just tried this, works great, thanks and I liked it too!
@greendsnow ปีที่แล้ว ⁺⁴
Pricing:
Google
Transcription: $0.024 / minute
TTS $0.016 / 1K characters
Open AI
Whisper $0.006 / minute
TTS $0.015 / 1K characters
TTS HD $0.030 / 1K characters
@dorg9502 ปีที่แล้ว ⁺¹
Or you could use one of the non-gpt related alternatives and run it locally or from your own server.
@yantaosong ปีที่แล้ว
good idea , which alternatives ? whisper for speech to text and llama to answer ? @@dorg9502
@greendsnow ปีที่แล้ว
@@dorg9502 I don't have an Nvidia GPU, I'm not planning to buy one
@JoJoAcrylicArtwork ปีที่แล้ว ⁺³
fantastic! thanks so much for sharing, this exactly what I was looking to do
@ralfelfving ปีที่แล้ว ⁺¹
Great, it's what's my tutorials are for! :)
@JoJoAcrylicArtwork ปีที่แล้ว
@@ralfelfving love it! Open source baby yeah!!
@marcuscarter 11 หลายเดือนก่อน ⁺¹
Hi, great video, well above my level, but I have a quick question, could you actually have a 'meaningful' conversation with at as you would with chatgpt?
@ralfelfving 11 หลายเดือนก่อน
Yes, its OpenAI's GPT levels under the hood of both so they'd be very similar.
@marcuscarter 11 หลายเดือนก่อน
ok great, thanks for the information, I'm trying to work out how to put this tech into an app so this could be the way, many thanks and good luck with the channel
@EL-tirol ปีที่แล้ว ⁺¹
As I understand, it is connected to general gpt 3.5 model, not to customized API Assistant? It would be cool to create same voice-input - voice output but with your own customized assistant. In a similar way, the did during DevDay presentation :)
@ralfelfving ปีที่แล้ว
The GTP model you chose to use is just an API call, you can switch it out for whichever model you prefer by changing the API call -- GPT4, Assistants API, custom model running locally, ....
@EL-tirol ปีที่แล้ว
@@ralfelfving yep, but calling Assistants API seem trickier as they do not support streaming as of now
@burakince4283 17 วันที่ผ่านมา
Can I use my own data for TTS?
@zoltanfejedelem9372 2 หลายเดือนก่อน
Great work, thank you.
I have a question, if I want to text for example: 3999 characters to recite and save to mp3 in the given language how does it work?
@AI_Escaped ปีที่แล้ว ⁺²
Awesome, can't wait to try. Too bad GPT is all jacked lately. How would one do this using a wakeup word or other stimulation to get the program's attention?
@ralfelfving ปีที่แล้ว ⁺¹
I'm not sure about wakeup words, because you'd need a process to listen at all times and recognize a word. A shorthand would probably be a keyboard shortcut which you could do if you packaged it with e.g. Electron.
@AI_Escaped ปีที่แล้ว ⁺²
@@ralfelfving I guess leaving the mic open would would, but you would be paying for api for everything it processes. Maybe a local open model running locally to just listen for the wakeup word, and then it's passed to you openai api?
@biancapietersz ปีที่แล้ว ⁺³
I just found your content and am glad you are making tutorials on this. Have you been able to mitigate the latency?
@ralfelfving ปีที่แล้ว
Which latency are you thinking of?
@biancapietersz ปีที่แล้ว ⁺¹
for example if someone responds it takes generation time for the api requests to get the proper info and generate the text and then the speech so there is a 5-10 second lag in response time. I’m trying to figure out a way to make it have faster response.
@ralfelfving ปีที่แล้ว ⁺³
If I remember correctly the way that I set it up in this tutorial is the fastest currently possible with OpenAI. You have these processing components:
1. The person speaks for 10 seconds
2. Send audio to Whisper
3. Whisper process said audio and responds with transcript
4. Send transcript to GPTx (I used 3.5 turbo)
5. GPT process and returns response
6. Send response to TTS
7 TTS responds with audio and play back to user.
In 1&2 you could technically stream chunks of audio and get them transcribed as the user speaks, such that much of the transcription is done once the user has stopped talking, and then join that all together for step 4.
Step 4 has to happen after all of step 1-3 has completed. For GPTx to give you a useful answer, it needs to receive the full question from the user.
Step 5 supports streaming output, but iirc step 6 doesn't support streaming input (yet). That means that as of today, you have to wait for GPTx to give you the entire output before you can process the TTS response. You could look into something similar to mentioned above, chunk up GPTx responses into sentences and get TTS to generate the audio piece by piece. The TTS response itself is streaming in my script, so it will start playing when it has the first few words.
The only clear handover point where the full information is needed is 3-4, the rest is solvable -- and OpenAI will make it better over time.
@biancapietersz ปีที่แล้ว ⁺¹
@@ralfelfving yeah I’ve considered chungking in bits but it’s possible the responses would be inaccurate without the full scope and context of what is being said.
It’s helpful that you’ve mentioned this with step 4
This is a wildly helpful answer. I so appreciate it!
@firaunic 5 หลายเดือนก่อน ⁺¹
Can we do the speech to text part with Whisper from OpenAi but the actual response from some other GPT model? like Gemini or my any other local model endpoint other than ChatGpt?
@ralfelfving 5 หลายเดือนก่อน
Yeah, just chain in another API call
@MariastellaALBARELLI 6 หลายเดือนก่อน
Hello, How can I attach the audio to an assistant using threads messages? Thank you
@AndAllTravel ปีที่แล้ว ⁺¹
Excellent content... I'm also having an issue with 'node install speaker'. Rosetta didn't seem to help. Any other ideas? Without speaker, the app otherwise seems to work but fails after hitting 'enter'
@ralfelfving ปีที่แล้ว
Thanks. I think I forgot to mention it in the blog post because it's not an npm package -- but did you get prompted to install SoX (sound exchange)? It would be done using brew.
@AndAllTravel ปีที่แล้ว
@ralfelfving sox installed but doesn't seem to make a difference. (gyp is not happy lol) It seems to be a common problem but also appears unfixed in the community. I tried to edit 'node-gypi' with the proper MACOSX version to no avail. Here is the log if you are interested: drive.google.com/file/d/1_aNOfPjiAfIBqf2KvUHUVx-Hd9JJu6lJ/view?usp=share_link
@musumo1908 ปีที่แล้ว ⁺¹
Hey great vid, anyway to add tts as a function to the new GPT4 preview openai assistant.thx
@ralfelfving ปีที่แล้ว
I don't understand your question, can you describe it in an example?
@musumo1908 ปีที่แล้ว
@@ralfelfvinghey my reply seemed to go? Let me rephrase. I was hoping to use TTS with my openai assistant that uses the new gpt4 preview (the assistants post 06/11/23). What’s the best way to integrate this? So basically I want a talking openai assistant…
@aranthos ปีที่แล้ว ⁺¹
Are there ways to tweak the output in terms of pacing and vocal intensity?
@ralfelfving ปีที่แล้ว
No, not with OpenAI TTS right now/yet. The only option with that API is the speed of the audio in the file, but its not pacing/vocal intensity.
@pennychewer8931 10 หลายเดือนก่อน
Is there a way to customise the voice?
@doston8795 ปีที่แล้ว
hey can you i add this to UI and how i can do can you advise me please? thank you
@user-us2um3zk7n ปีที่แล้ว ⁺¹
unfortunately I got stuck with an error:
Press Enter when you're ready to start speaking.
Recording... Press Enter to stop
Recording stopped, processing audio...
Error: 400 - Bad Request
@ralfelfving ปีที่แล้ว
Console log the API inputs before the call and the errors of the API call to the terminal to find out what's causing the 400. I suspect the root cause is that you're not appending an audio file because the app doesn't have access to the microphone, or that the microphone source is incorrect and you're sending a silent file.
@Shardus ปีที่แล้ว ⁺¹
I had the same issue. It was because nothing was getting recorded and the output.wav file was empty. On my Linux system I had to set the device to 'default' by changing the new Microphone line to: mic = new Microphone({device:'default'});
@mickelodiansurname9578 11 หลายเดือนก่อน
You need more subscribers mate, 2.5k is a shame to be honest given the knowledge you are sharing, what is the YT algo up to?
@crististanciu7708 5 หลายเดือนก่อน
Hi there, thanks for this great job.
Can you tell us how can we make this 2in1, meaning to give audio responses also when the users type the questions not only when they speak it?
Thank you!
Edit:
Never mind, chat gpt updated the code, and now it works via messages. Thanks.
@snot8783 ปีที่แล้ว ⁺¹
can i do the same using python?
@LearnCode_withAI ปีที่แล้ว
Yes off course you ca find all the details on openai platform
@ralfelfving ปีที่แล้ว
Absolutely. The OpenAI community has a lot of people building with Python, and sharing examples.
@kamalkamals ปีที่แล้ว ⁺¹
The question is how u install speaker package ?
@ralfelfving ปีที่แล้ว
Try running Terminal with Rosetta.
@AndAllTravel ปีที่แล้ว
same problem... terminal with rosetta didn't seem to help
@kamalkamals ปีที่แล้ว
@@ralfelfving cannot understand ur answer, what is relation between installation package speaker and terminal Rosetta !!!
@ralfelfving ปีที่แล้ว
@@kamalkamals Some packages may only work/be compatible with running Terminal with Rosetta.
@kamalkamals ปีที่แล้ว
that s not a best practice to force using x terminal, probably u need to update ur code :)@@ralfelfving
@armankarambakhsh4456 ปีที่แล้ว ⁺¹
Could someone pleaaaase tell me if they could've successfully run this on their windows? I use VS Community 2022 and I constantly get dependency errors like for -node microphone.
I have .JS + .env file in the project + node.js installed and configured for VS + ffmpeg address is listed in windows environment variables.
Feels so stupid to he stuck at such simple thing 😭
@ralfelfving ปีที่แล้ว
Someone commented on the linked Medium article that they got it working on Windows. Did you install dependencies like Node package microphone?
@armankarambakhsh4456 ปีที่แล้ว ⁺¹
@@ralfelfving I ran them all and it sais successful. Luke 25 dependencies. But when I ran the app.js, it gave error for microphone. And when I ran the npm install for microphone, it gives like tons of errors 😕
@ralfelfving ปีที่แล้ว
You'd need to resolve the errors for the microphone npm install.
@ventureaddict ปีที่แล้ว ⁺¹
Love this! Thank you! How would I swap out OpenAI TTS for Eleven Labs TTS model?
@ralfelfving ปีที่แล้ว
You'd just change the OpenAI TTS call to a ElevelLabs API call instead.
@Hazar-bt6nf 6 หลายเดือนก่อน
Can it be run on raspberry pi5
@ralfelfving 6 หลายเดือนก่อน
I don't know
@irangasamarakoon4160 ปีที่แล้ว
this is amazing...
@Mirkolinori 7 หลายเดือนก่อน
Perfect

ต่อไป

เล่นอัตโนมัติ

Learn Function calls in OpenAI Assistants API (NodeJS tutorial)