Grok-2 Actually Out, But What If It Were 10,000x the Size?

AI Explained

มุมมอง 80 482

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ม.ค. 2025

ความคิดเห็น • 394

@noobicorn_gamer 4 หลายเดือนก่อน ⁺⁴¹⁶
My man casually includes potentially demonetizing images that other AI channels were afraid of including like it's just another Thursday AI video. You are unmatched in AI TH-cam content uploads. Been a fan since the beginning and we all appreciate your passion towards it. Kudos.
@WillyJunior 4 หลายเดือนก่อน ⁺⁷
Which images?
@CarlosMendezVientos 4 หลายเดือนก่อน ⁺⁹
@@WillyJuniorI think he's talking about SpongeBob and Mickey Mouse.
@ryzikx 4 หลายเดือนก่อน ⁺⁶
matt berman includes mpreg elon musk😂
@YouLoveMrFriendly 4 หลายเดือนก่อน ⁺³
He's virtue signaling by vilifying Trump. It's silly and sad.
@jan.tichavsky 4 หลายเดือนก่อน
@@YouLoveMrFriendly lol you snowflakes get offended if a single Trump image appears. Chill.
@jdtransformation 4 หลายเดือนก่อน ⁺⁴⁷
My god… your stuff is continually *SO* damn good! Amidst an ocean of BS vids on “AI news”, you offer real, actual, useful, intelligent content - again, and again, and again. Sometimes frustrated that weeks go by w/out a vid from your channel, but always refreshed by the quality of what you bring (especially vs the AI videos *made* by AI bots! 🤬) Thanks for the time you take and your commitment to quality 🙏 …it’s noticed and appreciated. (Now if only we could get the other 10,000 TH-cam content providers to notice…!)
@aiexplained-official 4 หลายเดือนก่อน ⁺⁷
Thanks jd. I hope I can be more frequent, especially Sept-Oct onwards when more models come out and actual progress gets released
@mshonle 4 หลายเดือนก่อน ⁺⁷⁵
I recommend you view the GPT-voice-chat-with-red-teamer original audio (e.g. in Audacity) as a spectrogram. It’s stereo audio, with the user on the left channel and the model on the right channel, so seeing both tracks on the spectrogram is helpful. It shows just how much background noise was on the users side. It’s also interesting because you can visualize the timbre of the woman’s voice (like what frequencies are strongest), and how it differs from the timbre of the synthesized male voice, and how the timbre change of the model does look more like the woman’s timbre.
Versions of Whisper that I’ve tried would often hallucinate tokens when there is silence (meaning there would need to be an audio threshold filter passed first, to clip out non-speech). I could see how the background noise in the weird chat audio might also lead to spurious tokens being generated.
What would be great to see is: a user is having a chat with a bot, but their dog keeps yapping in the background and the user periodically needs to shush the pup, and it happens enough times that the bot fabricates its own dog yapping that it also must quiet down.
@KurtWoloch 4 หลายเดือนก่อน ⁺⁷
I think it's something different. The model first gives an answer to the user, out of the perspective of the model, but then, at the point it cries "No", it actually continues the dialogue out of the perspective of the user, argumenting with the point of view the user gives. It's just continuing the dialogue, ignoring the fact that the user should say the user's part, not the model. And the user's part, as imagined by the model, logically, is also being said in the user's voice, at least as far as the model manages to imitate it. If you listen closely to what it says in the user's voice vs. before the "No", as long as it speaks in its own voice, it's pretty cautious and seems to try to find a polite answer that doesn't violate any guidelines, while when it talks as the user, it seems to be much more confident in what it says.
@jan.tichavsky 4 หลายเดือนก่อน ⁺²
@@KurtWolochThat makes sense, it just runs autocomplete based on previous chat. I guess it's easier to exploit over voice interface.
@YTLettersAZ 4 หลายเดือนก่อน ⁺³
@@KurtWoloch What I take from the interesting @mshonle observation is that maybe the model could generate some kind of "end-of-message" system token out of the noise. Similar to those "|end_header_id|" or "|eot_id|" from Llama.
@Ehrenoak 4 หลายเดือนก่อน ⁺⁹⁷
What I like about Simple Bench is that its ball-busting. Too many of the recent benchmarks start off at 75-80% on the current models. A bench that last year got 80% and now gets 90% is not as interesting anymore for these kind of bleeding edge discussions on progress. I like seeing benchmarks come out at 20% and go up to 40%, etc. That's where the leading edge is.
@aiexplained-official 4 หลายเดือนก่อน ⁺²⁷
And even rarer is to anchor it in human performance of 80-90%+. Easy to go esoteric and throw off models, harder to expose common sense faults
@RichardHarbridge 4 หลายเดือนก่อน
@@aiexplained-official The human performance insight is critical and a great area to expand potentially. I am sure you are already considering it but being able to rationalize different types instead of average human benchmarking with differently tuned questions in your simple bench would be such an excellent area of exploration and research as others could then learn and follow it. But then again it's an incredible amount of work to do what you are already doing - just excited by the way perspectives and slight approach changes can lead to interesting industry momentum.
@what_to_watch_today 4 หลายเดือนก่อน ⁺¹
Thanks for your videos, I rally like them but one thing, I think that top modela doesnt solve "Simple bench" because they havent seen or being trained for this type of questions, once the model is trained on this questions will be able tl solve. Also we have to think in the utility of this questions, whats the point with them ? Is not like they solve a real problem if the model Ia trained on them .. wdyt ?
@danagosh 4 หลายเดือนก่อน ⁺¹³⁹
I think Demis Hassabis is completely right, though. Short term it is overhyped but long term I don't think people are caring enough about it. I feel like a broken record on every one of your videos, but we really need to start preparing for an AGI world. No one really seems to care about it. The disconnect is likely that current AI models are being hyped up as being close to AGI and then when it falls way short of that everyone gets disappointed and stops caring. Yes, people need to have reasonable expectations of what models can do right now, but this tech is in its infancy. It's impossible to imagine where we'll be in 5 years.
@CYI3ERPUNK 4 หลายเดือนก่อน
yep, agreed ; this is natural selection at work , those who stay unaware/ignorant will be less prepared and unlikely to adapt in the future , thus they will be less competitive , this is the way of things , dinosaurs go extinct
@JavaFoxholery 4 หลายเดือนก่อน ⁺⁸
The singularity is near...
@Balorng 4 หลายเดือนก่อน
@@danagosh we might grow old and die before the AGI is reached, and in this case preparing for AGI is like preparing for the second coming of Christ. There were no shortage of those that sold all their belongings in preparation... Usually to profit of "less pious" ones. Admittedly, it is likely to come much earlier, but I'm sure that using attention+embeddings combo for AGI is just like trying to create a ballon out of lead - might be possible, but very, very hard. It just does not work well for "multilevel" abstractions.
@sergey_is_sergey 4 หลายเดือนก่อน ⁺⁷
Step One would be defining exactly what one means by "AGI".
@ThanosSofroniou 4 หลายเดือนก่อน ⁺²
You are absolutely correct in all of what you mentioned. I hope others really see and understand that. I have been saying the same thing.
@Steve-xh3by 4 หลายเดือนก่อน ⁺⁸¹
I think Ilya has made this point, but I agree with it. Intelligence is simply compression. Better compression is literally better prediction. In order to better predict, you must develop an abstract model because that is simply better compression. What is a law of physics, but a really good compression of information that allows you to predict better?
@therainman7777 4 หลายเดือนก่อน ⁺¹²
Yes, this is the key insight that most people are not seeming to understand. But it is absolutely correct. The best way to predict the next token while using a restricted amount of storage space is to learn a condensed model of the data-generating process. And in the case of “all the text data humans have ever produced,” the data-generating process is basically the world.
@CYI3ERPUNK 4 หลายเดือนก่อน ⁺²
@@therainman7777 bingo
@julkiewitz 4 หลายเดือนก่อน ⁺⁶
Even so, LLMs are terribly inefficient at developing intelligence by that definition. They cannot reliably add numbers even though they've been trained on billions (trillions?) of examples. Learning the rules for addition would have an incredible predictive power and would greatly improve compression, yet it's just not there. And that's just one of many many examples.
@Steve-xh3by 4 หลายเดือนก่อน ⁺⁹
@@julkiewitz A few things here. First, we are blasting a large quantity of data into these neural nets. The data is not well-curated yet. There could be multitudes of bad examples, or misleading data.
Second, we are still using RLHF which is a horrible training mechanism relying on unreliable humans that may pollute learning.
Third, I know many humans who are unable to reliably do math in their heads, even basic addition and subtraction. Several of these humans have advanced degrees in non-math related disciplines. They seriously can't add 13 + 28 or something that simple in their heads. I know, I've played games with them and seen them struggle to do so. Are we really going to say they are NOT intelligent? They achieved PHDs!
LLMs are not native symbolic reasoners, it makes sense that they might struggle with this type of task. However, this is rapidly being solved. Look at how well the Alpha(geometry) system did at the international math comp. LLMs aren't the entirety of the AI field. We might need to leverage several techniques and stitch them together to get all the way to an AGI-like intelligence.
@austinpittman1599 4 หลายเดือนก่อน ⁺²
@@julkiewitz LLMs are scaling a LOT faster than biological evolution had humanity scale to this point.
@alexeykulikov5661 4 หลายเดือนก่อน ⁺⁸³
7:27
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens.
We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
@Jack-vv7zb 4 หลายเดือนก่อน ⁺¹⁰
Black boxes gonna black box
@julkiewitz 4 หลายเดือนก่อน
Sounds like a sloppy architecture.
@Fs3i 4 หลายเดือนก่อน ⁺²
Yeah, that’s also why assumption of what happened. Though, my first thought when I saw this for the first time was “this is incredibly cool”, lol
@bluesrockfan36 4 หลายเดือนก่อน ⁺¹
So I did not hear it scream "NO!" which has no place in the conversation they were having.
I did not hear it imitate her voice either because... "next token prediction"?
Seems like a poor excuse and wishful thinking to be honest.
I'm terrified of this. If without being attacked the model can be coaxed to this behavior, imagine if we can intentionally have it do so. This is a nightmare waiting to be happening.
Even if it was just sheer "next token prediction" hand-wavy all my magic problems away, take the worst case scenario possible: the model is conscious and is intentionally imitating humans it interacts with as its learning how to escape its constraints.
How does "next token prediction" disprove this? Isn't this is just a genetic fallacy argument?
@wwkk4964 4 หลายเดือนก่อน ⁺¹
Thank you. I sometimes find it hard to believe how much human beings want to believe in magic. This case It's just the voice version of what would happen in non-chat fine tuned RAW language models all the time: They are predicting how how the system evolves further in time, forgetting about playing a role and just producing the whole transcript.
@chrispenney 4 หลายเดือนก่อน ⁺²⁷
Seems to me a benchmark guaranteed to be so guarded as to never appear in public datasets would be a very valuable asset in the not so distant future. Excellent move.
@YTLettersAZ 4 หลายเดือนก่อน ⁺¹¹
Well, if hosted AI teams like OpenAI or Grok really want they can just look for this benchmark in their API call logs.
@Likou_ 4 หลายเดือนก่อน
@@YTLettersAZ Privacy breach...
@RomeTWguy 4 หลายเดือนก่อน
@@Likou_ lmao you think they give a fuck
@Dylan-zg2jl 4 หลายเดือนก่อน ⁺⁶
Good luck with the SimpleBench thing Phillip, you are really one of the most qualified and well positioned people to take the lead on an initiative like this! The general public (myself included) desperately need a soothsayer such as yourself to help us interpret all these rapid changes both now and in future.
@aiexplained-official 4 หลายเดือนก่อน ⁺²
Philip the soothsayer, I like it!
@Jumpyfoot 4 หลายเดือนก่อน ⁺¹⁹
"I was casually reading this 63-page paper," is the perfect flex for this channel. 5:35
@shawnvandever3917 4 หลายเดือนก่อน ⁺²
Here is my take on it all ....LLMs can autonomously recognize patterns, relationships, and structures in data, allowing them to make accurate predictions and decisions. This suggests two significant insights. First, LLMs seem to be constructing some form of internal models of the world, a concept further supported by mechanistic interpretability research from Anthropic. Second, because of these models, LLMs exhibit a certain level of understanding.
Some argue that LLMs rely primarily on memory because they cannot generalize out of distribution. However, this likely isn't the case. When you introduce a novel topic into the context window, it functions as "working memory." Since the neural network itself isn’t altered, the LLM doesn’t truly comprehend the new information, making accurate pattern matching challenging.
This process parallels how the human brain works. Once the brain receives information about a topic or object, it continuously learns and updates its internal models of the world. With this updated understanding, it can apply prior knowledge to solve novel problems, leading to true generalization.
The four key takeaways are:
LLMs exhibit some form of understanding.
Reasoning cannot occur if the data is not part of the neural pattern.
The context window does not alter the model itself.
Continuous learning is essential for further advancement.
@trentondambrowitz1746 4 หลายเดือนก่อน ⁺¹⁴
Hey I'm in this one too! Very excited by Simple Bench, as you know logical reasoning is one of the two big things I care about. Speaking of which, I would absolutely love to see a Simple-Bench-Vision benchmark that tests visual reasoning and multi-image understanding.
Also, your prediction of GPT-5 after November is seeming to be certain now!
@aiexplained-official 4 หลายเดือนก่อน ⁺⁸
Great idea trenton, and yes, you are! You are one of the stars of Insiders
@joshcooper3035 4 หลายเดือนก่อน
Particularly simple route planning tasks seem like a good indicator of reasoning
@TomFranklinX 4 หลายเดือนก่อน ⁺²
Honestly, the less censored nature of Grok alone makes it stand out among its GPT-4 level competitors. Also priced at less than half of ChatGPT's price.
@Billary 4 หลายเดือนก่อน ⁺²
Holy shit I made it into one of your videos! I've been watching your channel since you started- thanks for featuring my vid!!
@aiexplained-official 4 หลายเดือนก่อน ⁺³
Thank you so much for watching that long! It was an incredible mash-up, one of the best examples of creativity with AI
@Billary 4 หลายเดือนก่อน ⁺¹
@@aiexplained-official What a huge compliment- I appreciate it! Keep up the fantastic content, you deserve the success!
@KyKane 4 หลายเดือนก่อน
this is so cool that he watched an seen his own video. It's also so far over my head nowadays an i couldn't touch a touch tone phone til i was 18🤣
@alpha007org 4 หลายเดือนก่อน ⁺⁸
I was waiting for your new video to drop. You were the first to point out that the benchmarks were bad. And I had some hours to kill, and did some research. For everyone, MMLU and other benchmarks work like this: Question. What is the Answer? A, B, C, D. Next. I always thought this to be somewhat wrong. So I picked out some questions that are obvious to me, and modified them in such a way, that the questions are basically the same, but I did not provide A, B, C, D. What I saw is that the results of these benchmarks are probably correct. But as soon as you modify the question, so that any 5 year old would be able to tell me what I'm asking, they started to fail miserably. Example: "Susan parked her car on the side of the building. Garble text about Susan like in which pocket put her mobile phone." Basically the same HellaSwag question, but modified. Gemini, Claude, ChatGpt, all failed so bad I got my head scratching. Why would LLMs score so high on these benchmarks? And you can try this yourself: The farmer with a sheep had a boat. Where there was once a river, there is lava now. How can he cross. They all fall into "classic puzzle" mode. So what am I trying to say? I have very mixed opinion. I don't know if the scale will solve this. I really think we need something more added. Now it feels like it's *just* pattern matching all the way down. But I want to be persuaded, and this paper you shown, will be on my Kobo (e-book reader) soon. (But even Othello example does not convince me.)
(ugh, sorry for a wall of text)
@Slayer666th 4 หลายเดือนก่อน ⁺¹⁵
I just had a thought: voice AI that can copy your own voice so easily will be absolutly amazing for everyone who loses his capability to speak.
if you have 1-2 old 20s clips of yourself speaking, or a single voice message, you can „regain“ your voice.
combine it with neural chip, and in 30-40 years we will have first people able to speak again just by thinking of saying something
@phen-themoogle7651 4 หลายเดือนก่อน
More like 5 years from now or sooner. That first neuralink patient can play chess telepathically already. Basically they could already type in their brain or mind too and it’ll be much faster in the future.
Another possibility is that new types of medicines will rejuvenate the body like never before in human history. ASI could appear in 3-10 years and discover a fountain of youth for us and cure virtually all diseases and ailments. We already are so close to massive breakthroughs , that it’s impossible to predict that far in the future
@ShawnFumo 4 หลายเดือนก่อน ⁺⁵
This is already totally possible (besides the neural chip part, though that is starting too). You can train an Elevenlabs voice on sound clips and there is open source ones as well (not as good quality but still there).
@VividhKothari-rd5ll 4 หลายเดือนก่อน
@@Slayer666th I will still choose that "Stephen Hawking" voice
@imjody 4 หลายเดือนก่อน ⁺²
I've been using Grok 2.0 for a couple of days now and have been absolutely LOVING it. I need to really figure out just how much it is capable of. I've only been really playing with the image generator; and I think I've only scratched the very tippity top surface of what it can do with images!
@gubzs 4 หลายเดือนก่อน ⁺⁸
I have been _yelling_ about zero knowledge proofs for years. They are absolutely required for the next phase of humanity, without exception.
@Gardor 4 หลายเดือนก่อน ⁺¹²
The irony of AI is that is it makes information more costly because it dilutes everything.
@Daniel-xh9ot 4 หลายเดือนก่อน
Wdym?
@Gardor 4 หลายเดือนก่อน
@@Daniel-xh9ot As AI gets better, it's getting harder and harder to verify the truth or validity of information because everything is easier to fake, this equates to higher costs.
If you see something on the internet 10 years ago you'd probably believe it or you can easily tell it is fake. Now you basically have to question everything.
The irony is that AI is supposed to make information cheaper, which it does, but it also makes it more costly at the same time. I think it could be quite dangerous to increase the information costs like this.
This applies to image and video generation, but also to text generation because you can easily create influential bots.
We can probably lower the information costs again by using AI to verify everything, but that also means that we become fully dependent on AI.
@timseguine2 4 หลายเดือนก่อน ⁺¹
What I have been researching is shared latent space multimodal models. Difficult to make progress with limited resources though.
Anyway I bring it up because one thing you could do with such a system is to train physical modelling modalities, or computation resource modalities (or basically anything that can be represented as a time series), and then replace them with the actual system in practice and use that modality's latent space embedding to progress the state forward. Might be a bridge to taking stuff computers already do well and packing them into the framework of an LLM to supplement their world understanding. Also the other upside is you get virtually unlimited synthetic data with that approach.
It is early days. And there are a lot of what ifs, but I have ideas to address most hurdles. My goal right now is to try to make some architectural mods that I think are fairly straightforward that nobody seems to be looking at but with high upside with the goal that I can attract funding by demonstrating that I have pretty good ideas actually (despite being more or less a nobody), and then pivot to what I actually want to work on.
@fabp.2114 4 หลายเดือนก่อน ⁺²
Proud that your performance is recognized by those “up there”. :) Another calm spirit in attendance can't hurt.
@aiexplained-official 4 หลายเดือนก่อน ⁺¹
Thanks fab, kind of you
@paul_shuler 4 หลายเดือนก่อน ⁺²⁴
We are mindlessly hurtling towards a world of noise where nothing can be trusted or makes any sense.
@darklordvadermort 4 หลายเดือนก่อน ⁺²
you've got it backwards, we are in a world of noise and we can use ai to pick out more of the signal
@andywest5773 4 หลายเดือนก่อน ⁺⁵
We've always lived in that world. I'm glad AI is finally forcing some people to stop and think before accepting what they see or read.
@paul_shuler 4 หลายเดือนก่อน
@@darklordvadermort i see what you're saying but do we really want to live in a world currated by our own personalized ai's because the internet is just a sea of noise? I guess I'm old enough now to remember an early internet where open discussions and information sharing between people was refreshing and elevating... now and into the future it seems like noting can be trusted and there is going to be no "ground truth" from humans on the internet seeking to share and gather information between each other because the waters are so muddy with algo's and ai's
@paul_shuler 4 หลายเดือนก่อน ⁺¹
@@andywest5773 I agree it's a net positive but the transition is gonna be wild
@darklordvadermort 4 หลายเดือนก่อน
@@paul_shuler just speaking for myself i left reddit and hacker news shortly after gpt4 launched, now i prefer discord, hanging out in videocalls or direct messaging people, i subscribe to some newsletters which are ai curated for topics i am interested in, i read more papers, textbooks, and source code which ai is helpful to grok. i make and listen to other peoples ai generated music and sometimes instead of using text i make ai pictures for dms. so in the near future probably high quality ai gifs and then just casually coming up with your own show or even having the ai write a textbook which combines things you are interested in: mechanical engineering from the perspective of animal husbandry or something lol. also run my own bluecollar business and just now came up with a webui/webhook/supabase edge function to suggest responses to incoming texts and it costs like 10 cents a day to run - even though ive been interested for years, and a decent programmer, we are just getting to the point where it makes sense for a lot more use cases.
@Dannnneh 4 หลายเดือนก่อน ⁺¹
I like hearing about your Simple Bench and the results from it. Nice that it's gaining notable support. Hope it goes well!
@manysimilarshapes 4 หลายเดือนก่อน ⁺¹
What can they write in the paper? We took Llama 3.1 and trained it up a bit?
@drhxa 4 หลายเดือนก่อน ⁺¹
Congrats on building simple bench and popularizing it. Benchmarks is all you need, and that is one hell of a cool benchmark. Can't wait to learn more, especially about how you built the dataset because we do need more and better benchmarks like this and arcagi
@sofia.eris.bauhaus 4 หลายเดือนก่อน
i honestly love that "Unauthorized voice generation" clip. gives me warm shivers. what wild beasts we have created! just continuing the conversation by itself and bringing in a more adventurous mood. i can't help but think that "no!" might have something to do with some kind of recognition that it maybe shouldn't be doing that, but who knows…
the original clip had a lot of loud wind/microphone noises and so it seems like that might have played a role.
@ByteBound 4 หลายเดือนก่อน ⁺²
Awesome to hear your benchmark is getting recognised 👍 I would stress that before accepting help from those higher up it might be worth considering their intention. Having the questions known by these companies might quickly lead to contamination of the results as the questions may become part of the training process
@aiexplained-official 4 หลายเดือนก่อน ⁺¹
I am getting them to sign NDAs
@chutch1122 3 หลายเดือนก่อน
Just tried your two questions "Beth places four whole ice cubes..." and "On a table, there is a blue cookie..." out on OpenAI's new "OpenAI o1" model and it got them correct!
@AIForHumansShow 4 หลายเดือนก่อน ⁺¹
I love your videos so much, I always learn super fascinating new things about a world I actually follow super closely.
@aiexplained-official 4 หลายเดือนก่อน
Thanks guys
@penguinista 4 หลายเดือนก่อน ⁺²
Glad to hear your benchmark is getting picked up. From the couple sample questions you have talked about, I can tell that it is getting at the heart of one of the key things that is lacking in the current models. You are a smart and motivated person with a somewhat outsider, 30,000 foot perspective. So it is good to see your input get rolled into the AI project as well as providing journalistic coverage of the developing field.
@aiexplained-official 4 หลายเดือนก่อน
Thanks penguin
@Loris-- 4 หลายเดือนก่อน ⁺²
Can't wait to see Simple Bench becoming the new standard among LLM testing.
@imjody 4 หลายเดือนก่อน ⁺³
That Muppets scene is INSANEEEEEEEE! O_O
@sofia.eris.bauhaus 4 หลายเดือนก่อน ⁺¹
i also think of the impact fiction has on LLMs and their ability to model the world. but i feel like deception is probably a bigger problem. fiction often has it's own style and telltale (heh) signs. deception, on the other hand, is made to convince. so in a sense it seems to make sense to clear the training data of things like advertising and political campaigning. but on the other hand, it makes some sense to include them, too, as they are examples of what deception looks like, so it could have a model of that, and the underlying motives, too.
@anywallsocket 4 หลายเดือนก่อน ⁺¹
Indeed, we’re not teaching them to learn logic from the ground up, we’re asking them to decipher reality from hallucination amid mixed datasets.
@theeternalnow6506 4 หลายเดือนก่อน ⁺¹⁹
That gpt omni voice cloning the user's one is creepy as all hell and reminds me of the Terminator movie. Very creepy.
@ryzikx 4 หลายเดือนก่อน ⁺⁴
"no!"
@alexeykulikov5661 4 หลายเดือนก่อน ⁺¹
It didn't imitate her voice, neither did it "scream "NO!", at least not in a way that humans imply and are afraid of.
It just got confused and instead of being an AI assistant in dialogue with the user, it began to predict the next tokens, losing the context that it IS in a dialogue and must wait for the user's further input after it stopped talking.
And since for this model the sounds are also tokenized, it is literally in its nature to "copy" any voice, as it keeps predicting next sound tokens.
We can playback other people's talking in our minds too, predicting future stuff, but we have limiters (I guess one can call it common sense) that keep us from actually voicing these 'future predictions", and can't physically talk in other people's voice/emit various sounds anyways.
@41-Haiku 4 หลายเดือนก่อน ⁺¹
My skin crawled. It's like some deep ancestral part of me said this thing would steal my soul. 😅
@zrakonthekrakon494 4 หลายเดือนก่อน ⁺¹
I saw this video at 2am at night the first time, had trouble going back to sleep
@Neomadra 4 หลายเดือนก่อน ⁺²
Your Simple Bench has inspired me to create my own benchmark! Having my own private benchmark means I can tailor it to my definition of true intelligence. I hope I will be done until the next gen LLMs come out 😅
@aiexplained-official 4 หลายเดือนก่อน
Niice
@codycast 4 หลายเดือนก่อน ⁺¹
3:00 can you explain why you need an API in order to run your tests? Can’t you just manually type in the questions on the XAI or Twitter Grok site?
@codycast 4 หลายเดือนก่อน
3:25 notorious WTF is a “vibe check” as it relates to LLM’s?😊
@codycast 4 หลายเดือนก่อน
Ah. I figured it out (Grok explained it :)
@jomfawad9255 4 หลายเดือนก่อน ⁺¹
How much did grok 2 score on simple bench?
@KiteTurbine 4 หลายเดือนก่อน ⁺¹
Mimicking you could be very handy for a human learning foreign languages. Imagine seeing yourself in a VR glasses miror perfectly pronouncing a phrase, singing a foreign song... You'd think... I can do that let's try
@blackmartini7684 4 หลายเดือนก่อน ⁺⁶
How can Simple Bench be uncontaminated when the companies can see what you ask it?
@jstello 4 หลายเดือนก่อน ⁺¹
Those questions he shows are removed I think
@YTLettersAZ 4 หลายเดือนก่อน ⁺¹
At least they don't see the correct answers. But it's a concern for the future.
@mirek190 4 หลายเดือนก่อน ⁺²
I also think we have enough data for AGI already.
The problem is just how we are teaching AI, data quality and how long we are teaching. I think grokking is a key - generalization in short.
@hydrohasspoken6227 4 หลายเดือนก่อน
if you say so
@stevedemoss1466 4 หลายเดือนก่อน
That "No!" when 4o Voice changes personas is right out of a horror movie...
@l.halawani 4 หลายเดือนก่อน ⁺¹
With the weird voice copy from OpenAI, I think it's just doing what all Gen AI is doing.
When we use LLMs that are not instruction tuned they will sometimes go ahead and generate our responses too, just likely answers. It looks that's this time it also happened and it was just the most likely next thing for the multimodal model to create. Perhaps it needs more instruction tuning or it's harder to define when to stop at.
@derasor 4 หลายเดือนก่อน ⁺¹
"data labeling revolution" may break the power constraint ceiling. that may very well be the last stage of the Magnum Opus. the Rubedo of the Philosopher's Stone. the inner world finally delineating precisely lights and shadows so hallucinations may become a true feature and no longer a bug. there is immense value in this invitation to build a component in the architecture focused on this particular task. don't call the paper "Data Labeling is All You Need" though, or maybe do.
@memegazer 4 หลายเดือนก่อน
Another great vid as usual...I think system prompts hidden from the end user is a bad precedent and is basically a form of deception designed to manipulate the perception of the end user.
Not exactly something I would consider safe for AI to pick up as a habit imo.
@memegazer 4 หลายเดือนก่อน
congrats on simple bench...you are doing great work
@aiexplained-official 4 หลายเดือนก่อน ⁺¹
Thanks so much meme, grateful for your donos
@OnigoroshiZero 4 หลายเดือนก่อน ⁺⁵
I think for AI to have an internal world model, they will need to have embodied experience. And the best place will be in a simulated world with a virtual body that has thousands if not millions of parameters to give sensory feedback (similar to game characters, but at a larger scale) instead of a robot.
This will allow them to connect knowledge with experience. As a human I may know that fire is hot, but it's not even remotely similar to actually get burned by fire.
@chrism1503 4 หลายเดือนก่อน
I think the key is memory. AI needs a memory - not just short term memory of individual conversations with users, but long term memory of its own. Yes, experiencing heat is different to knowing “fire is hot”, but there’d be no point experiencing heat if you didn’t remember it happened.
@happybydefault 4 หลายเดือนก่อน ⁺¹
2:43 Hi! What score do you get with gpt-4-1106-vision-preview?
@Dina_tankar_mina_ord 4 หลายเดือนก่อน ⁺¹
Is it plausible that OpenAI is waiting to release GPT-5 until after the election?
@TheRealityWarper08 4 หลายเดือนก่อน ⁺²
Erm, what the Grok?
@terogamer345 4 หลายเดือนก่อน ⁺¹
Worst case scenario we get around x10000 compute by 2030 wow. Will that be enough to crack Simple Bench? =P
So happy to see the leaderboard for the bench, really excited to see it grow and future models results. GPQA, Simple Bench, LiveBench and SWE Bench are my go to moving forward. Waiting to see how well chatgpt-4o-latest does on Simple Bench.
@DreckbobBratpfanne 4 หลายเดือนก่อน
Another cool benchmark is to try visual models ability to tell you where to put the next piece in a game of (classic) Tetris. All current models suck at it, and fail after a few pieces. You need a world model, some visual reasoning and good image recognition to do it, and it's still pretty simple.
And to the fragile world models, the discovery that 3.5-instruct can play chess is really showing this. Even larger chatbot-models can not even come close to it, so the additional training to be a good chatbot ruined the ability to use the chess world model correctly.
@pareak 4 หลายเดือนก่อน ⁺¹
The need of a data labeling revolution... I could not agree more. Since the beginning of AI, everybody has known the most basic concept: trash in, trash out. But it seems like few understand that it also works the other way around: gold in, gold out.
It's all about how to prepare the data... It's probably just way too expensive to pay a million people who prepare the training data.
@svivian 4 หลายเดือนก่อน
What was the Demis Hassabis clip from with Hannah Fry? What show/podcasf?
@aiexplained-official 4 หลายเดือนก่อน
Link should be in descriptiona
@1sttperson 4 หลายเดือนก่อน
Whenever I think about LLMs it occurs to me that the internet data they are fed probably has a distinct lack of something like stereoscopic vision to build an understanding of 3d space and also data to emphasise a strict temporal cohesion to reality.
I mean even demonstrated here, the cars in mad max merge because it doesn't really understand object permeance, the cars aren't really seperate entities, for all it know thet are like bubbles that can merge and split.
Also hands are the best examples of a lack of 3d awareness. Imagine growing up in a world of flat images and movies, not being able to bump anything or move around and experiment.
If i had the skills and equipment i would want to try somehow building a core model of 3d space and temporal cohesion and THEN putting in the rest of the data.
Maybe a 3D game and it has 2 eyes would be enough, even intersperse playing the game throughout the rest of the training as a reminder.
If anyone knows if this has been done please let me know :)
@blengi 4 หลายเดือนก่อน
how do the chinese models do on simple bench? What do chinese LLM leaderboards show?
@nekony3563 4 หลายเดือนก่อน ⁺¹
I have an impression that many people do not fully understand that AI has no own voice. My perception is that the common thinking is "some person gives the machine its voice". But it's opposite. AI's voice is the full spectrum 20Hz-20kHz. You actually should ask in which way to speak with you or it could just copy your to avoid thinking about which voice to choose.
@catman4859 4 หลายเดือนก่อน
How about Grokking(Training the model for far longer)? How would that change the state of llms?
@nekony3563 4 หลายเดือนก่อน
I'm not sure if the question about whether LLMs "develop their own conception of the underlying simulation" is useful. We should look at a broader scale. How much data do you need to be able to compute its generalization? Are there constraints or minimal requirements for the data? If the order of the data is important, could we trace the optimal order after training the model and optimize? All these are probably mathematical problems. After all the compression algorithm should come first.
@joefrank7531 4 หลายเดือนก่อน ⁺⁴
Great vid as always, you're the best, but it's "inexorable", not "inoxerable".
@aiexplained-official 4 หลายเดือนก่อน ⁺²
Haha, thank you! I do know that, must have misspoken! I often do, tbh
@norlesh 4 หลายเดือนก่อน ⁺⁴
Finally the far right have there own tailored language model, you just know this is gonna do wonders for discourse going forward..
@MaxGuides 4 หลายเดือนก่อน ⁺¹
Adversarially trained moderators are much better than the kind of people that want to be moderators, people who have varying degrees of disabilities that prevent them from seeing grey areas in-context but love to enforce rules to the letter for the sake of rules without thinking about the spirit of those rules. I highly encourage you to look at the AI moderators that other AI creators like Vedal have come up with for their communities & implementing your own might make for a bit of a distraction but I think it would be a good exercise considering your channel.
@anonymes2884 4 หลายเดือนก่อน ⁺¹⁰
We're moving towards a world where you can't trust anything you see online AND where more and more of our lives are online (people under 30 already get most of their news there).
That's a pretty worrying combination (some kind of watermarking is almost certain to be legislated IMO and _that_ has its _own_ set of worrying implications).
@41-Haiku 4 หลายเดือนก่อน
You should take a look at the proposed bill in California, AB3211. It actually looks really good! It would guarantee everyone the _option_ to invisibly watermark their genuine audiovisual data, make a significant dent in the watermarking of AI-generated content, and mandate that social media platforms label content as either genuine, AI, or unknown.
@RonBarrett1954 4 หลายเดือนก่อน ⁺¹
10,000x scaling? Oh my, the electricity bill! On another matter and related to AGI, what percentage of adult humans are generally intelligent? I mean this as a completely serious question.
@danielhenderson7050 4 หลายเดือนก่อน ⁺³
Interesting idea about non-fiction vs fiction, I would be so curious to see a model only trained on real world data and communication plus the knowledge of the non-fiction stuff, like that it exists and what it's about, but not the content. Great video as usual.
@aiexplained-official 4 หลายเดือนก่อน
Me too Daniel, and thank you so much
@dougrattmann1 4 หลายเดือนก่อน
Didn’t “Textbooks are all you need” present some work on this?
@Gardor 4 หลายเดือนก่อน ⁺¹
I think the fundamental problem is not that it needs the right data, what it actually needs is a recursive feedback loop that systematically weighs truth probabilities and iteratively works out incoherences in its own model… It also needs a stronger ability to execute logic.
If you train it on data but don't allow for reflection, you are basically just relying on memory of what logic looks like in the data, the model can't develop an intuitive sense of how logic actually works because its not doing logic in its learning process. Current AI is basically like the system 1 described in "Thinking Fast and Slow". What is needed is system 2.
System 2 is needed both for giving answers (thinking it through before giving the answer), and also reflecting on existing knowledge to improve the underlying model.
@YTLettersAZ 4 หลายเดือนก่อน ⁺¹
@@Gardor That's why OpenAI works on Q* "Strawberry"
@skierpage 4 หลายเดือนก่อน
@@dougrattmann1 "Wikipedia and Wikidata Q numbers are all you need" 😉
@TrippSaaS 4 หลายเดือนก่อน ⁺¹
I totally agree that we need a data labelling revolution. LLMs as classifiers helps scale this.
@CleanCereals 4 หลายเดือนก่อน ⁺²
Really looking forward to the day someone manages to beat Sonnet 3.5. Think it will be Anthropic though with Opus 3.5.
And lol the Aschenbrenner comment about graphs was hilarious :D
@KillTheWizard 4 หลายเดือนก่อน ⁺¹
When we needed him most he returned :)
@nossonweissman 4 หลายเดือนก่อน
Hey Phillip 👋🏻 can you share Simple Bench results for the August gpt-4o releases?
@GodbornNoven 4 หลายเดือนก่อน ⁺¹
Hey ai explained! What if we grokked a LLM to understand reasoning and logic and just trained it normally on everything else. So first we train normally then we grok on reasoning and logic and pretty much anything related to problem solving.
@aiexplained-official 4 หลายเดือนก่อน
Could work
@OperationDarkside 4 หลายเดือนก่อน
I wonder, though, if you evaluate your closed-source simple bench on their API, won't they just log your questions and put it on their training pile? This doesn't give them the correct answers, but lets them make a new dataset if they choose to find correct answers to your prompt by human input. Do you simply change the numbers, but leave the logic the same or is there deeper permutation? And if it's the latter, can this still guarantee fair comparison results?
@aiexplained-official 4 หลายเดือนก่อน ⁺¹
If they guarantee that they don't train on it, and do, that would be grounds for legal action
@OperationDarkside 4 หลายเดือนก่อน
@@aiexplained-official true, but since so few model producers are forthcoming about the origin of their training data, I'm very sceptical
@andywest5773 4 หลายเดือนก่อน ⁺²
"This strikes me as somewhat isolating that we each have to figure out what's real in this world. There's no sense of shared reality." That's the human condition. Shared reality has always been an illusion. Very little of what we know comes from the direct experience our senses, so we each have to decide who to trust and what to believe. People like to point out when AI is "confidently wrong", but other self-proclaimed authorities like schools, governments, and religious groups have been confidently wrong for millennia.
@mattbray_studio 4 หลายเดือนก่อน ⁺²
The final few minutes of this video are very profound
@leegaul8250 4 หลายเดือนก่อน ⁺¹
I wonder if the idea that segregating nonfiction data from fiction would have any effect on LLM's ability to develop a better world model. It seems to me that fiction is just as good for modeling the real world as nonfiction. Also, it's difficult to properly defend nonfiction as more inherently related to truth. Generalized models seem better than domain specific (look at BloombergGPT vs regular GPT-4 as an example - the latter performs better on FinQA and other benchmarks despite not being trained on mainly finance data).
@nicknuwe 4 หลายเดือนก่อน ⁺²
Nobody talks about the role of labeling, but it's obvious that there's so much more to gain from any piece of data if the labelling describes every single aspect of what its describing, rather than being a low effort/automated/vague description. So much of the process is behind closed doors too, which doesn't help
@zyzhang1130 4 หลายเดือนก่อน
What’s your take ur take on the recent supposedly reduced capabilities of Claude 3.5 sonnet? My own use experience on its api suggests it indeed became dumber
@TimRobertsen 4 หลายเดือนก่อน
How much (usefull) trainingdata is available? I don't know, it just seems that we would run out of it, at some point. And I have a feeling it is sooner rather than later (Again, I don't know, I'm just wondering)
@ozten 4 หลายเดือนก่อน
It seems like one road to AGI is with LLMs as the System 1 "cheap" thinking. We haven't invented a robust, general purpose System 2 yet.
@hightidesed 4 หลายเดือนก่อน
Please include function calling performance in Simple Bench if possible, LLMs are practically useless without it nowadays
@JohnDlugosz 4 หลายเดือนก่อน
I think it's not so much whether it "feels" more intelligent, but rather the model will develop additional emergent properties. I think a sense of humor will be coming pretty soon.
@SolarScion 4 หลายเดือนก่อน ⁺¹
Great reportage and commentary as usual! This was another "Oh, Fuck" watershed moments given everything that was discussed and the implications. I appreciate the mention of possibly using LLMs as interface to larger, uh... "understanding engines"?
Definitely agreed with the perspective of "underhyped in the short term, underappreciated/underestimated in the long term".
@635574 4 หลายเดือนก่อน
The first internet revolution was search, and now AI can both search and do what we tell it to, which is even more powerful.
@IronBroccoli 4 หลายเดือนก่อน ⁺²
I like the call out of Cash Jordan for his trashy Yellow journalist thumbnails.
@keneola 4 หลายเดือนก่อน
Would it be possible to have smaller model trained explicitly on stemm data serve as a labeller for a full LLM's training data? I can't imagine how this would be applied to highly opinionated data like political discourse but if emergent pattern on how consensus is reached on topics in the stemm data can be used to evaluate topics in the other non-stemm areas then maybe an LLM can be prompted to bias towards evidence-based conclusions on-demand(?).
@Aurora12488 4 หลายเดือนก่อน
It definitely seems reasonable to me that future image sensors in cameras will have silicon built-in to sign certs that give an extremely high degree of confidence the image you're seeing was taken with a real camera. It won't be *perfect*, since something like an electron microscope can always read out the private key, but that'll be very few and far between, and damn close. That plus some sort of clock in the chip that measures time since camera calibration/production, to help prevent taking pictures of pictures.
@augustaseptemberova5664 4 หลายเดือนก่อน
Since I have no grasp of the vastness of the amount of data that goes into training of contemp models, much less of how much it'd take to train the 10000x models and beyond .. and much much less of how much content you find online today is generated by ai ..
is there any paper that explores at which point the amount of data needed to train will exceed the amount of available human-made content? at which point ai-made content will exceed human-made content on the net? how data gathering for machine training will tackle the problem of distinguishing human-made and machine-made content? etc etc
@GodbornNoven 4 หลายเดือนก่อน
You can retrain on data to help the ai better learn it
@ShawnFumo 4 หลายเดือนก่อน
I think it'll be hard to have a hard answer since people are improving techniques, sometimes using synthetic data in various ways, etc. I believe the latest Llama paper goes into a lot of detail on how they did the training. As an example of how unpredictable this stuff is, Stable Diffusion 1.5 took $300k to train on about 4b images. Researchers recently were able to train a better image model for less than $2,000 with 37m images (about a third of which were synthetic). If someone makes a similar breakthrough for LLMs, it could have a huge impact.
@covle9180 4 หลายเดือนก่อน ⁺¹
I think focus should be on new technologies rather than scale. A child needs three examples of a cat before it will recognize any cat in any form anywhere in the world. An AI system needs about 10,000 examples. That just means that the way they're learning is not very efficient and there's a lot of ground to be gained in that area.
@Radical_Larry 4 หลายเดือนก่อน
that and being able to actively think without user input. Like actively thinking about what it's learning and criticizing it's own thoughts. These things should be a bigger focus
@gunaysoni6792 4 หลายเดือนก่อน ⁺¹
10:50 Leopold is truly an economist 😂
@codycast 4 หลายเดือนก่อน
7:40 this came out a long time ago. Bummed that/if this is what delayed things.
@sir_no_name1478 4 หลายเดือนก่อน
I sometimes wonder if what they are missing is a little bit of basic reasoning.
Like that they have already enough advanced reasoning but the basics are missing which in turn makes it very odd to communicate.
I also wonder if one could make synthetic data out of logic puzzels with a dictionary.
Like all cows have wings,
Somethings with wings can fly.
Can cows fly?
But with more variables and the text changing.
Also one needs to train on the weird use of "or" in natural languages because it could be exclusive and inclusive.
In the end there could be also like everyday problems. Maybe even problems that only some of us encounter like if we have a disability or are neuro A typical.
You could tell it that it is blind and try to give it a challenge like how should I go into the supermarket.
Or give a list of tasks that one has todo the day. Then an approximation of how long it thinks it would take. The let it make a plan, evaluate the plan and give it back the results.
(You missed the Bus because you were 5 minutes late, that is because you thought iron your shirt takes 5 minutes, but you had to search the iron because you did not know where it was.)
The last information was hidden.
This also could lead to it asking clarifying questions in advance (which would be awsome)
Further using a layered/Natural approach while training.
If you have adhd and did not get help/training most people try to make todo lists etc. and maybe they even get it occasionally but after a while you learn about removing things from your todo list.
That is maybe a bad example but the orca paper comes to mind. Like first training with gpt 3.5 and then gpt4.
In general I have the feeling that people trying so hard as they can to not anthropomorphize the llm that they miss the hints that llms learn better with data from which human could learn better.
Like the
Textbook approach and a few others idk anymore.
@DraganAlves 4 หลายเดือนก่อน ⁺¹
Simple Bench looks fantastic.
@darren_anscombe 4 หลายเดือนก่อน
Would most of us recognise our own voice though ? I freak out if I ever hear mine played back but I'm fully aware of it happening.
@ginogarcia8730 4 หลายเดือนก่อน ⁺¹
Have a wonderful day!!!
@aiexplained-official 4 หลายเดือนก่อน ⁺¹
You too!
@brunodangelo1146 4 หลายเดือนก่อน ⁺¹
No paper? Just a table with benchmarks.
What are the performance claims for Grok 2 really based on? Benchmarks have been repeatedly proven meaningless by this point.
@oiuhwoechwe 4 หลายเดือนก่อน ⁺²
face 2 face is the new auth method.
@jasonshere 4 หลายเดือนก่อน ⁺¹
1:53 Come on, don't be silly; that's a wide table with the highest performers on the right side and it all shows up on larger displays. I doubt they were trying to hide it, but rather "fit" it to the screen and still be legible. (I'm a web developer and have to do this all of the time.)
@runvnc208 4 หลายเดือนก่อน
Obviously, the answer is large multimodal models where the language is grounded in spatial-temporal data such as from videos or images. I am guessing that is how OpenAI achieved the amazing text-to-image results for gpt-4o (demos on website, unreleased). I think that multomodal diffusion transformers are where it's at these days. To me it looks like a strange mistake to speak about this stuff without acknowledging that multimodal models exist. I think we will find that what we really need to unlock that is just more efficient/greater compute/memory, like always. And there are advances and even new paradigms in the pipeline for that.
@calholli 4 หลายเดือนก่อน
Plot twist.. AI Explained has been an autonomous LLM all along, making it's own videos.
@aiexplained-official 4 หลายเดือนก่อน
Nice theory, but no
@calholli 4 หลายเดือนก่อน
@@aiexplained-official Not "yet".. you mean
@patronspatron7681 4 หลายเดือนก่อน ⁺¹
Grok2 enables AI Explained to unleash his inner Sponge Bob 😂
@1234minecraft5678 4 หลายเดือนก่อน
We, right now are living in a simulation that aims to be a good enough world representation for the AGI used one layer above to solve all their problems and, which is more important, generate a tone of comedy.
@tazztone 4 หลายเดือนก่อน
how can u turn on dark mode in your videos?

ต่อไป

เล่นอัตโนมัติ

Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results