LLMs Are Cheating On Benchmarks

ThePrimeagenClips

มุมมอง 12 705

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 10 ม.ค. 2025

ความคิดเห็น • 31

@johanngambolputty5351 3 หลายเดือนก่อน ⁺⁴⁴
Who knew...
When there's so much money up for grabs, and you're not required to be too transparent about these things, in what world would they not be massively exaggerating everything?
@noobgam6331 3 หลายเดือนก่อน ⁺²
This is not an exaggeration, this is a scam. They undermine the quality of publicly available tests in order to seem better
You telling your parents you got an A is not the same as stealing the test answers ahead of time and actually getting it while knowing nothing.
And mind you, these companies tend to do both at the same time, yet the product is amazing. Just not amazing enough to warrant all the investments
@johanngambolputty5351 3 หลายเดือนก่อน ⁺¹
@@noobgam6331 Hey, I'm not saying you can't use snake oil for something, just not going to believe you if you're telling me its solving every problem...
@Mempler 3 หลายเดือนก่อน ⁺⁹⁵
AI is like a Genie. You must be as specific as possible, or your wish is gonna be something horrible.
@SamMaciel 3 หลายเดือนก่อน ⁺²
"Specific AI" sounds a lot like nested If statements with smaller nested black boxes.
@monad_tcp 3 หลายเดือนก่อน
That's what makes me furious when I try to use those things. If I have to be that precise, I might as well use Python directly than write in English. Its a stupid way of doing computing.
And they keep putting useless AI buttons everywhere, for ex, If I'm using Photoshop I don't want to type what I want, I want to click, WIMP is the best invention ever, AI is not even close, its just much more work to write in English than to click in a button.
Who thought this was a good idea. WIMP-Windows,Icons,Menus,Pointer is the end all solution for telling a computer what to do, AI is basically enshittification . It is not better to use text than to click a button.
And don't ever get me started on using AI to write code, if you're writing the most bland mundane thing ever that there are 1000 copies in the internet, then it works, the moment you need anything non-average, all bets are off, they don't know how to code, they just copy from stack-overflow, and the code there was already shit enough.
LLMs are only good for search, and that if you also check the sources because they lie and create things (mash nonsense together, they never create anything, they have 0 creativity) when they can find it.
@Jarikraider 3 หลายเดือนก่อน ⁺¹¹
Based Claude, "Sounds legit, here you go."
@winter5945 3 หลายเดือนก่อน ⁺⁶⁵
Bro what the fuck did that whole ramble have to do with the video? I think you completely missed the point. This is not a case of malicious compliance or whatever the fuck. What do you mean "of course it knows the answer"? The whole point is that it's not supposed to know this string, because the Big Bench answers are supposed to be excluded from its training dataset, and the canary string is just a method of detecting that it hasn't been properly excluded.
@solimm4sks510 3 หลายเดือนก่อน ⁺¹
agreed
@TheRyulord 3 หลายเดือนก่อน ⁺¹¹
yeah, I know it's been a while since he's done any hands on ML stuff but you think he'd recognize "don't train on your test data" and just say that
@o_glethorpe 3 หลายเดือนก่อน ⁺³
What do you mean? The tech influencer doesn't know what hes talking about? Who could imagine that
@mitchlindgren 3 หลายเดือนก่อน ⁺¹
Thank you, this was driving me nuts as I watched this clip
@monad_tcp 3 หลายเดือนก่อน
You are the one that missed the point. Its the companies training that are cheating, not the AI. THE COMPANIES THEMSELVES ARE CHEATERS.
And you know that the canary isn't there to prove the AI is cheating on the answers, the entire point of having a canary was to test if the bench answer was excluded from the training set, it wasn't even supposed to be there.
The fact that they are there prove that the companies are cheating, what's the best way to get the highest score in the test, just give the AI the answers. OBVIOUSLY
Its the humans that are cheating, this entire AI thing is cheating and lying, the humans creating those things are the real cheaters. Even LLMs only can do what the programmer put in the training set.
There is 0 intelligence in the machine, its a word calculator. Every single benchmark is cheating, the idea of the test itself is already cheating.
That's another form of cheating, marketing is pure cheating, they will tell people the thing is intelligent, why its basically being actually driven by the human. It is not even creating anything, its just mashing text together from the internet, other humans had the intelligence to create the text.
They should first come with a good test of intelligence for humans, oh, those don't use text, they use images, and when you apply them to multi-modal-AI, they're abysmal, that means they're not even close to mimicking what the brain does.
LLMs capture all the proper mechanics of the syntax of proper text, yet there's 0 meaning encoded, its really amusing to see how language works. But intelligence isn't being able to speak, no, humans are intelligent because they literally created language, intelligence comes before language, not after.
This entire endeavor of LLMs are bound to fail, it was all cheating, that was the secret.
@w.o.jackson8432 3 หลายเดือนก่อน ⁺³²
Sounds like the LLM isn't cheating, but the companies training the models?
@rawallon 3 หลายเดือนก่อน ⁺¹
"it's not what it looks like hun"
@CHURCHISAWESUM 3 หลายเดือนก่อน ⁺¹³
Well of course it's not cheating, it's not an agent. It doesn't choose what to do. It's a stochastic parrot: the people at the company are cheating, not the bot.
@autohmae 3 หลายเดือนก่อน
Are they cheating aka did they include a scientific paper by 'accident' that included a mention of this string OR was it really by accident.
My guess is if you are running such benchmarks, you should make really sure papers which talk about this shouldn't be included...
@monad_tcp 3 หลายเดือนก่อน
exactly !
@monad_tcp 3 หลายเดือนก่อน ⁺¹
I remember reading somewhere (and ignoring the click-bait) . AI with Human-level PHD defeats blah blah blah. yeah, sure either your PhDs are really stupid or you are cheating at the tests.
@drxyd 3 หลายเดือนก่อน ⁺¹
All you have to do is train on test data for unicorn status. Why wouldn't they?
@alien5589 3 หลายเดือนก่อน ⁺²
Omg I’m in construction and the "aid will take over the PMs can accurately describe the problem" made me laugh harder than I have in a long time 😂
@monad_tcp 3 หลายเดือนก่อน
Know what's funny, no one needs a PM when you have Jira, just let the users put tickets there directly.
I go even further, if companies really wanted to save money they would realize that workers don't need managers. You already have an ERP suit to manage the company, just use the software, it doesn't even need to be AI, we already have automation to automate managers, just fire the entire C-suit.
But then how are you going to manage people, no, you don't. Just separate people into groups of 5 people and they don't need managers. Shocking, isn't ?
Valve works like that, but yet, that's not a publicly traded company, so it is not just a place to have suits suck up resources and positions of nepotism.
Maybe this don't work for dumb workers (that's more of an education problem, than intelligence per sì), but for knowledge work it totally does.
@johnkost2514 3 หลายเดือนก่อน
I'm GOBSMACKED ..
@Bodom1978 3 หลายเดือนก่อน
I thought the whole point was that these AI's are supposed to respond accurately to natural language, not that you need to prompt exactly right else you get unexpected results.
@vectorhacker-r2 3 หลายเดือนก่อน
I fucking knew it. I absolutely fucking knew it.
@stallhaagen 2 หลายเดือนก่อน
DrCodeRespect lol
3 หลายเดือนก่อน ⁺⁷
Maybe I'm missing something, but your ramble has nothing to do with what the guy in the video was talking about??
@fuby6065 2 หลายเดือนก่อน
he thought the point of the video was the "oh but it's public data" "oh yeah now I know it", and didn't realise that was just a side-point, and the big deal was the fact that it shouldn't know that, no matter how much you press it
@Julzaa 3 หลายเดือนก่อน
You totally missed the point

ต่อไป

เล่นอัตโนมัติ