I keep reading people crapping on the new advanced voice, but I am just absolutely Blown Away by it. The fact that it was able to solve any of those problems just being told them verbally, and also throw in some comedy by doing it in a New York accent, I mean seriously, I still can't believe we actually have this technology right now. When you take a step back and truly try to grasp what this thing is doing, at least for me, my jaw just drops to the ground. Great videos by the way.
The NY accent was ok, I’m a New Yorker and it didn’t quite nail it, but I’m just being picky, what really matters is that it actually solved it! Through verbal instruction! Amazing! People that crap on it will keep pushing the goal posts further and further as it comes closer and closer to AGI, that just shows their own level of fear and threat they feel from this technology.
I understand that people have high exceptions because of the demos but it's mostly to do stuff without real value (for me at least), things like moaning, telling stories with very specific or invented accents or languages.
Humans are just endlessly ungrateful spoiled brats. That's never going to change unfortunately. We may have god-level tech, we'll always be greedy for more.
100% agree. All the people who complain about anything about Advanced Voice Mode should stop everything they’re doing right now, and watch Louis CK’a bit: “Everything is amazing and nobody is happy.”
Hey Kyle. Fun video. Just wanted to comment again that the advanced model is not actually transcribing to text, but is taking the audio as input directly. That's what the old version was doing. In a way this actually makes it even more impressive in my opinion and it would be interesting to try to figure out if there is any distinction between it's knowledge and understanding in the audio modality vs text modality. The best case is that it has been able to generalize and internally project both to the same semantic representation, but in practice I would guess it ends up not being as good at things like math (which seems to be the case). With the text modality there is a discrete and unambiguous representation available to it in its context, whereas with audio this is not the case.
I think openAi need to hire you as an employee to create them your own dataset to train gpt and improve accuracy in astronomy tasks You are doing great job, keep going 💯
Such fun. I had no problem getting her to speak like an angry Australian, but when I asked it to sound like an angry Italian, she said she won't do stereotypes. As someone half Italian, I think I resent that!
That is interesting... perhaps in a few more weeks, they'll nerf it to stop producing angry Australian tone.. it's a shame big business always having to conform to politically correctness/wokeness.
Regarding the issue you were having with the integration problem - as far as I know, the advanced voice mode is using 4o as a basis (not o1). This means advanced voice mode suffers from the same historical encoder-related issues - i.e., the joke about LLMs not knowing how many R's are in strawberry still apply. This is a well understood problem that is solved in o1. Give it time :)
A couple of points to consider: 1. Voice input for these types of problems become a nuisance for humans and a source of noise for LLMs. 2. Some people are quick to suggest that certain problems might have been present in the training data. While this may be true, I always think that the training process is geared more towards creating accurate representations rather than simply memorizing answers. However, you provided an example where the steps toward the solution were inaccurate, and the model still arrived at the correct answer. In another example, it read a bunch of fractions that weren't actually there, but got the rest of the problem right. This makes me think that there might be some level of memorization at play here, especially since, as far as I know, these models process audio input in an end-to-end manner.
Here are a few different tests the user could try with OpenAI's advanced voice mode: Differential Equations: Present a basic first-order or second-order differential equation and ask for the general solution or a particular solution given initial conditions. Optimization Problem: Pose a multivariable calculus problem, such as finding the local minima or maxima of a function using partial derivatives or Lagrange multipliers. Physics Kinematics: Give a scenario involving an object under projectile motion with initial velocity, angle, and gravitational force, and ask for time of flight or maximum height. Logic Puzzles: Present a complex logic problem (e.g., involving truth-tellers and liars) and ask the AI to reason through and provide a solution. Chemistry Stoichiometry: Give a balanced chemical equation and ask the AI to calculate the number of moles or mass of a product formed from given reactants. These tests would assess different areas of STEM and reasoning capabilities in a broader scope.
Very good! The fact that it missed the two resistors being in series rather than parallel may have been a programmed error to make it seem more human..
@@mgscheue Next step for Advanced Voice Mode: to learn to pause for thinking while routing the advanced query to o1-mini (or even the full o1), then integrate the result from that call into its response.
3:48 I was even more persistent, but with textual input, prompt for programming in Wolfram Language, and I had at least 80%-90% unsuccessful attempts. Not sure how can it solve tasks at phd student niveau 😕
Try the integral problem again with a different number - I have a weird suscpicion that the issue might actually be a result of them trying to stop it from engaging in both 69 and 420 jokes ...
Copilot had no problems with the integral. You can see that if they integrate the chat capabilities with collaborative software it would be much more powerful. I imagine this is what Khan Academy is doing to create an online tutor. th-cam.com/video/_nSmkyDNulk/w-d-xo.html
It's not text-to-speech The audio file, like the actual .wav file (not sure what kinda audio file) is being tokenized, like the wave form is being tokenized, and the AI is generating audio tokens back. No text involved.
This is not a text 2 speech its a natively multimodal model so it’s voice to voice it hears what you’re actually saying. It’s not transcribing your voice to text.
Do you want me to take gratitudes back then? I didn't find the confirmed information that they are using purely end2end approach to produce the sound wave. They definitely have some sort of tokenization, and untill I see the implementation, or at least a paper with detailed model architecture, I will assume that they use text2speech, because they already have tremendously large llm. Why don't just produce text with this model. You don't need wave to wave model to generate speech of this quality. I may be wrong, but I trust myself more than I trust openai, sorry. In any case, my point was completely tangent to this discussion
@@bashbarash1148 Go watch the Gpt-4o launch live stream from 4 months ago. They talk about how old voice mode used text-to-speech which introduced a lot of latency but 4o is multimodal and reasons nativly in text, speech, and vision. "they already have tremendously large llm. Why don't just produce text" - because 4o was actually trained on text, video, and audio -its a fully multimodal model, its just that up to this point text has been the only available input. Now with advanced voice mode straight audio is avialable as an input and once the vision gets itegrated into advanced voice mode straight images will be allowed as an input. Yes there is tokenization, but the audio is being tokenized those tokens are being feed in and audio tokens are being produced - there is no textual middle man. This is provably true by the fact that advanced voice mode can literally steal your voice. I've had it actually start to respond for me in my own voice, which is creepy btw, but that wouldn't be possible with speach to text, it's only possible if my voice with its timber, affectation, etc.. is being tokenized.
Might have to show this to all this to some of my anti ai copium frens. Can’t wait to see ai take everything. Fingers crossed we’re allowed to integrate into the ai god
I've been doing a role play with it since yesterday by having it do a countdown with different emotions in different scenario. I am blown away and often shocked at how realistic it's portrayal is, especially crying voice. It's incredible.
I keep reading people crapping on the new advanced voice, but I am just absolutely Blown Away by it. The fact that it was able to solve any of those problems just being told them verbally, and also throw in some comedy by doing it in a New York accent, I mean seriously, I still can't believe we actually have this technology right now. When you take a step back and truly try to grasp what this thing is doing, at least for me, my jaw just drops to the ground. Great videos by the way.
Agreed
The NY accent was ok, I’m a New Yorker and it didn’t quite nail it, but I’m just being picky, what really matters is that it actually solved it! Through verbal instruction! Amazing! People that crap on it will keep pushing the goal posts further and further as it comes closer and closer to AGI, that just shows their own level of fear and threat they feel from this technology.
I understand that people have high exceptions because of the demos but it's mostly to do stuff without real value (for me at least), things like moaning, telling stories with very specific or invented accents or languages.
Humans are just endlessly ungrateful spoiled brats. That's never going to change unfortunately. We may have god-level tech, we'll always be greedy for more.
100% agree.
All the people who complain about anything about Advanced Voice Mode should stop everything they’re doing right now, and watch Louis CK’a bit: “Everything is amazing and nobody is happy.”
Finally some different demo than "count from 0 to 100 as fast as you can". Thanks
Hey Kyle. Fun video. Just wanted to comment again that the advanced model is not actually transcribing to text, but is taking the audio as input directly. That's what the old version was doing.
In a way this actually makes it even more impressive in my opinion and it would be interesting to try to figure out if there is any distinction between it's knowledge and understanding in the audio modality vs text modality. The best case is that it has been able to generalize and internally project both to the same semantic representation, but in practice I would guess it ends up not being as good at things like math (which seems to be the case). With the text modality there is a discrete and unambiguous representation available to it in its context, whereas with audio this is not the case.
Thanks so much!
I think openAi need to hire you as an employee to create them your own dataset to train gpt and improve accuracy in astronomy tasks
You are doing great job, keep going 💯
Such fun. I had no problem getting her to speak like an angry Australian, but when I asked it to sound like an angry Italian, she said she won't do stereotypes. As someone half Italian, I think I resent that!
That is interesting... perhaps in a few more weeks, they'll nerf it to stop producing angry Australian tone.. it's a shame big business always having to conform to politically correctness/wokeness.
Watching you interrupt it politely and then progressing to frustration is 😂
Regarding the issue you were having with the integration problem - as far as I know, the advanced voice mode is using 4o as a basis (not o1). This means advanced voice mode suffers from the same historical encoder-related issues - i.e., the joke about LLMs not knowing how many R's are in strawberry still apply. This is a well understood problem that is solved in o1. Give it time :)
ywep
I meant to say tokenizer, not encoder.
For what it's worth, o1 still uses a similar tokeniser (as far as we know) and still has frequent issues with Strawberry r's...
A couple of points to consider:
1. Voice input for these types of problems become a nuisance for humans and a source of noise for LLMs.
2. Some people are quick to suggest that certain problems might have been present in the training data. While this may be true, I always think that the training process is geared more towards creating accurate representations rather than simply memorizing answers.
However, you provided an example where the steps toward the solution were inaccurate, and the model still arrived at the correct answer. In another example, it read a bunch of fractions that weren't actually there, but got the rest of the problem right. This makes me think that there might be some level of memorization at play here, especially since, as far as I know, these models process audio input in an end-to-end manner.
Holy shit. That relativity problem is handled superbly well. It's not just about multiplication and division , but also exponents.
Here are a few different tests the user could try with OpenAI's advanced voice mode:
Differential Equations: Present a basic first-order or second-order differential equation and ask for the general solution or a particular solution given initial conditions.
Optimization Problem: Pose a multivariable calculus problem, such as finding the local minima or maxima of a function using partial derivatives or Lagrange multipliers.
Physics Kinematics: Give a scenario involving an object under projectile motion with initial velocity, angle, and gravitational force, and ask for time of flight or maximum height.
Logic Puzzles: Present a complex logic problem (e.g., involving truth-tellers and liars) and ask the AI to reason through and provide a solution.
Chemistry Stoichiometry: Give a balanced chemical equation and ask the AI to calculate the number of moles or mass of a product formed from given reactants.
These tests would assess different areas of STEM and reasoning capabilities in a broader scope.
So this is GPT 4o not GPT 01 so it will be very interesting to see where this grows when we have access to GPT 01
We are only at the very beginning .... just imagine what AI will be like in 10 years from now
Trivial insight
10 years with double exponential growth will be crazy
Just think, this will be the worse it will ever be and it will just improve from here.
Can it take vision photo upload of circuit diagram to solve properly?
It can however OpenAI hasn't released that feature yet
@@pigeon_official ok
6940 lol why is it getting stuck there?
That’s what I wanted to know
I think there is some problem where the speech is converted to tokens
o1-mini was doing the same thing to me - ignoring my instructions after I corrected it
Interesting. Still more work to do.
he starts to think that human are stupid and ignore them .
@@programmingpillars6805 'he'? i don't think it thinks it's a he.
@@lanceguilin in today's world it just a matter of time till you must say "He" or "she" to ai LLMs
Very good!
The fact that it missed the two resistors being in series rather than parallel may have been a programmed error to make it seem more human..
So this is using GPT 4o right? So no chain of thought?
Right, advanced voice mode so far only works with 4o.
@@mgscheue Next step for Advanced Voice Mode: to learn to pause for thinking while routing the advanced query to o1-mini (or even the full o1), then integrate the result from that call into its response.
Remember that the new voice mode only does GPT 4o so won’t be as smart as the non-new-voice mode GPT-01..
Thanks
so coool
This was hilarious. 😂🤣
God damn!😮
And it is all matrix operations and algorithms ran fast!
I'm impressed about that too!
🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, This is amazing!
3:48
I was even more persistent, but with textual input, prompt for programming in Wolfram Language, and I had at least 80%-90% unsuccessful attempts. Not sure how can it solve tasks at phd student niveau 😕
The model used in the video is 4o, not the one that solves PhD problems
The AI is hard of hearing
Try the integral problem again with a different number - I have a weird suscpicion that the issue might actually be a result of them trying to stop it from engaging in both 69 and 420 jokes ...
I tried it without the 69420 and it still did not do it unfortunately
Copilot had no problems with the integral. You can see that if they integrate the chat capabilities with collaborative software it would be much more powerful. I imagine this is what Khan Academy is doing to create an online tutor.
th-cam.com/video/_nSmkyDNulk/w-d-xo.html
their text-to-speech team deserves much more gratitude than their llm scammers
It's not text-to-speech
The audio file, like the actual .wav file (not sure what kinda audio file) is being tokenized, like the wave form is being tokenized, and the AI is generating audio tokens back. No text involved.
Bruh...😆
This is not a text 2 speech its a natively multimodal model so it’s voice to voice it hears what you’re actually saying. It’s not transcribing your voice to text.
Do you want me to take gratitudes back then?
I didn't find the confirmed information that they are using purely end2end approach to produce the sound wave.
They definitely have some sort of tokenization, and untill I see the implementation, or at least a paper with detailed model architecture, I will assume that they use text2speech, because they already have tremendously large llm. Why don't just produce text with this model. You don't need wave to wave model to generate speech of this quality.
I may be wrong, but I trust myself more than I trust openai, sorry. In any case, my point was completely tangent to this discussion
@@bashbarash1148 Go watch the Gpt-4o launch live stream from 4 months ago. They talk about how old voice mode used text-to-speech which introduced a lot of latency but 4o is multimodal and reasons nativly in text, speech, and vision.
"they already have tremendously large llm. Why don't just produce text" - because 4o was actually trained on text, video, and audio -its a fully multimodal model, its just that up to this point text has been the only available input. Now with advanced voice mode straight audio is avialable as an input and once the vision gets itegrated into advanced voice mode straight images will be allowed as an input.
Yes there is tokenization, but the audio is being tokenized those tokens are being feed in and audio tokens are being produced - there is no textual middle man. This is provably true by the fact that advanced voice mode can literally steal your voice. I've had it actually start to respond for me in my own voice, which is creepy btw, but that wouldn't be possible with speach to text, it's only possible if my voice with its timber, affectation, etc.. is being tokenized.
😆🤣😆🤣😆🤣😆🤣😆🤣😆😂
Been waiting for this mod and then eu bans its so dumb Jesus
I have to say this is painful at times to listen - your are annoying her lol
That accent bro why
🤣🤣🤣
Might have to show this to all this to some of my anti ai copium frens. Can’t wait to see ai take everything. Fingers crossed we’re allowed to integrate into the ai god
I've been doing a role play with it since yesterday by having it do a countdown with different emotions in different scenario. I am blown away and often shocked at how realistic it's portrayal is, especially crying voice. It's incredible.