Can ChatGPT reason mathematically?
ฝัง
- เผยแพร่เมื่อ 8 ก.พ. 2025
- AI Researchers at Apple just released a new paper that breaks down limitations in the ability of Large Language Models (LLMs) like ChatGPT to reason mathematically.
The paper: arxiv.org/pdf/...
AI researchers analyze the mathematical reasoning ability of LLMs using a database of math questions called GSM8k for Grade School Math. What the apple researchers did was to tweak this database by first changing names and numbers, second making problems longer with more clauses at the same reasoning level, and finally by adding irrelevant clauses that should be ignored. Each of these caused some level of problems, but it was the irrelevant clauses that were most challenging to even top LLMs like chatgpt o1-mini and o1-preview.
The video of grade 8 students: • How Old Is The Shepherd?
Twitter thread from Mehrdad Farajtabar, one of the authors: x.com/MFarajta...
BECOME A MEMBER:►Join: / @drtrefor
MATH BOOKS I LOVE (affilliate link):
► www.amazon.com...
COURSE PLAYLISTS:
►DISCRETE MATH: • Discrete Math (Full Co...
►LINEAR ALGEBRA: • Linear Algebra (Full C...
►CALCULUS I: • Calculus I (Limits, De...
► CALCULUS II: • Calculus II (Integrati...
►MULTIVARIABLE CALCULUS (Calc III): • Calculus III: Multivar...
►VECTOR CALCULUS (Calc IV) • Calculus IV: Vector Ca...
►DIFFERENTIAL EQUATIONS: • Ordinary Differential ...
►LAPLACE TRANSFORM: • Laplace Transforms and...
►GAME THEORY: • Game Theory
OTHER PLAYLISTS:
► Learning Math Series
• 5 Tips To Make Math Pr...
►Cool Math Series:
• Cool Math Series
SOCIALS:
►X/Twitter: X.com/treforbazett
►TikTok: / drtrefor
►Instagram (photography based): / treforphotography
“They think that intelligence is about noticing things are relevant (detecting patterns); in a complex world, intelligence consists in ignoring things that are irrelevant (avoiding false patterns)”
-Nassim Taleb
Ooh I like that quote!
AI also does that
In unsupervised learning
@@kichelmoon6365 that’s a great perspective
They honestly remind me of friends I've had when they weren't taking meds for schizophrenia or bipolar and the patterns they'd hallucinate into their own realities.
@@hassanabdullah7569 Unsupervised learning is really just clustering, which is still a choice function of if a value is in a set.
This is more closely related to neggation as failure than abductive reasoning.
One of my favorite stats professors enjoyed adding superfluous information to problems after students requested he didn’t and he always explained that identifying the pertinent information is as necessary a skill as your computational ability. I’d argue it’s even more important in applied mathematics and statistics given how heavily computational processes are offloaded to technology.
That’s cool. And in the “real world” deciding what information is relevant is absolutely an important skill
I would contend that kids make the same mistake on the shepherd question because they're being trained to pattern-match just like the LLMs are. K-12 math education, on the whole, is stifling their creativity and their reasoning ability.
Ya I think the same basic bias of presuming questions can be answered and all clauses are relevant is definitely something teachers arbitrarily impose over and over again in questions.
I don’t know about your school, but in our school, they teach reasoning, problem solving strategies and our kids need to explain their solutions, from 1st grade.
Tbf kids make tons of daily decisions in which they must judge extraneous vs necessary information that LLM coders could only dream of.
I’m using chat gpt with some undergrad math problems.
The number of times I’ve gotten blatantly contradictory answers - within the same question - is very big.
I’ve also convinced it to change its answer and agree with me on many problems and I simply don’t know when I can or can’t trust it.
Ya I think one should never blindly trust an answer. It can generate an answer but we have to always evaluate it ourselves
Hey Trefor, for future videos: its OpenAI o1 and OpenAI o1-mini. ChatGPT 4o is a separate line
Not just that, but o1, o1-preview(smaller than o1, the biggest o1 variant with public (paid) access), then o1-mini(even smaller than o1-preview, has 7x but still paid access).
When ChatGPT came out publicly, I asked it to solve the Gaussian integral. That can't be solved in closed form, except for the special case from -infinity to +infinity. It went through the steps, which are not obvious, but then came up with 1/sqrt(pi). That's wrong, but close. It's just plain sqrt(pi). It seemed to be just copying the steps from somewhere, but didn't actually understand what it was doing. Kind of like some random student copying somebody else's answers to cheat. !! Makes sense for an LLM to do that.
Explanations of that integral is so entrenched in the training data for sure so it can partially pattern match
I am surprised that the results didn't drop by more.
To really test the model we should probably add a bit to each question: "Show your working".
Tbh adding in extra clauses like that would confuse most humans too. Including me, it just gets more confusing and easy to misread something. It really did need to have humans tested too. They can't just claim that humans would be likely to be able to discard useless information without data to back it up.
Edit:
Also what is reasoning? If I look at the qestion in my head am I not pattern matching the words into a sentence and then calculating how it needs to be solved? If I have learnt how to solve these problems, am I not just applying an alogorithm in my head to solve it? Really for me AI needs to be able to learn to solve things it hasn't been taught how to solve and can't reference other things to solve.
Ya and humans are really error prone. What id naively sort of imagine is after an LLM got to a certain reasoning level they could apply that reasoning fairly accurately over and over, beating humans, but that is not yet the case.
It would confuse most humans in most situations, but probably not the humans that sit down to do reasoning.
@@DrTrefor if you try to write down your reasoning and then give that as a example to the model, it will mimick your logic and break the complicated text in the same way, aka: show how you want the model to solve the problem then give a new problem.
On Wall Street the assumption is that true AGI is here, simply increasing the training data and the compute power will do the trick. Thus the justification for the unprecedented capital investments going on. This Apple study, among other things, makes you think the “singularity” is a long way off.
Excellent video! You have a wonderful style of presentation!
I don't get why this is even a question. LLMs aren't doing reasoning *AT ALL.* We already know this. They are predicting what word comes next.
What this paper is really doing is down a level from that, saying well kind of problem types does it do well or poorly at. Like is it good at ignoring irrelevant clauses? (Turns out not)
I think the question being answered is "Does reasoning emerge from next token prediction in the same way that language production does?" The answer being no. Reasoning requires something extra.
What makes you think the human brain's reasoning is anything fundamentally deeper than predicting what word comes next though? What does our brain do that an AI does not (apart from it - for now - probably being a bit better at it, yet)? What is reasoning, apart from prediction?
That is what our brain is evolutionarily selected to do: predict what will happen so we can anticipate that in our actions, thus maximizing our fitness in our environment.
@@ajs1998 Does it? AFAIK, a sufficiently big AI is essentially Turing complete, and I don't buy into the idea that the brain is anything more than an advanced wet computer, so I would be interested to know what more goes into reasoning (and how you know an AI cannot do that in principle). A sufficiently advanced AI should be able to simulate a brain (not that we are close, but in principle) and thus do anything a brain can do.
Calling it just text prediction is a terrible oversimplification. Predicting the next words is really just the way we get information back out of the model. Internally, they operate on huge amounts of vectors encoding various concepts and ideas. These vectors are repeatedly combined. It's entirely unclear that this is incapable of performing what we consider reasoning or even fundamentally different from how our brains work, when done correctly and on the appropriate scale.
You can increase the score by asking o1 to assume there may be misleading language and to first determine which terms are irrelevant. As you know, hinting can have a significant impact and the irrelevancy hint can add extra reasoning steps. Even without a new algorithm for negative pattern recognition, OpenAI could make these steps into a layer pre-requirement, but it would have some impact on response time and token pricing. We may be seeing some "Layer scarcity" while OpenAI tries to figure out what to prioritize, because the model would become too slow and pricy if every possible layer is enabled right now.
Fantastic video as always.
Glad you liked it!
The issue of the tweak 2 is that it is confusing and puts the attention of the model to its limit, unless you try setting temp really low like 0.2 or less, most models will be tanked by that, even 70b models, but if you distribute the info in a table or a list of premises/conditions, suddenly the model can perform just as well
I saw some models gain performance just by ensuring no part of your prompt has a paragraph above 4 lines, like line breaks making it more focused
Chat gpy could even solve a simple gcd(), it kept confidently getting the wrong answer and even when I gave it the right answer it wrote the right answer but the wrong work to get there.
When I was learning myself, I always added these changes to any problem so that I can figure out how to solve them properly, not in the baby cases. It's too bad that with modern math you can't do that because they only work in special cases ;). Also, I tried to teach people like this and they thought I was messing with them, even though I know it helps to remember what's relevant and what's not.
I kind of disagree with the suggestion that this covo is pedantic. I think it's one of the most important problems with "AI" right now. We're already seeing lots of people put a lot of faith in this tech and it seems really important to formally investigate what it's actually capable of intellectually.
Oh I agree, measuring what it is is capable of is super important (kinda why I did the video), but whether we call it “reasoning” or not I care less about
I can't wait until they stop referring to LLMs as AI and trying to sell the idea that they're useful for every single application under the sun.
This is actually very interesting, it sounds like the paper is saying these language models basically lack common sense. But I don't think this means that they're unable to mathematically reason.
To be less vague:
Imagine in another culture saying something like
"5 apples are smaller than average"
implies that you shouldn't include them in your count because everybody knows small apples aren't desirable and it would be like counting apple cores as apples.
To the large language model the above may be the only rational conclusion. Like Dr. Bazett said, these models may have been trained on problems that only included relevant data.
I think it's a more of a hallmark of reasoning to try to figure out WHY data was included. Like the children doing nonsensical computation, it's just that the AI hasn't learned we throw in irrelevant stuff on tests as part of the test (which is something we've all learned to watch for).
As another example: I'm learning how to use Linux, and when stuff inevitably doesn't work I have a hard time fixing it. This is because when I read through the logs I don't really know what information is pertinent or not. Some logs are very explicit and I can easily lookup problems or just experiment and figure it out. My point being that you could say I'm just as incapable of reasoning because purposefully adding irrelevant information in the logs dramatically decreases my problem-solving abilities (because they HAD to include it for a reason RIGHT?!)
I'm a junior double majoring in Math & CS because I want to be an AI researcher in the future. This is a topic I've been reflecting on.
When I was a freshman, I used ChatGPT for some coding HWs, but I haven't used it for math HWs since because it frequently gave me incorrect solutions.
Every time a new model is released, I test it with my HWs, still disapointing.
Usually, I use it to look up concepts or interpretations in Math & CS, but mostly for elective courses.
This is a limitation of LLMs that can’t be helped, and only when AI acquires strong math reasoning skills as a Ph.D level will we enter the era of AGI.
We need a new learning system beyond Deep Learning. What could that be?🤔 It’s intriguing!
Damn!! What are those homeworks? Can you list some?
@MrSur512 In fact most of Math or CS courses.
Yesterday, after I finished Digital Design HW in CS, then ask ChataGPT, and it gave me wrong solutions.😆 Or Fundamentals of Computing Theory, Computer Architecture, etc.
When it comes to math courses, it's worse in my experience such as Probabilities&Statistics, Mathematical Models, Real Analysis, Complex Analysis, etc.
1:03 what is the most surprising to me is that o1 models still get those most simple problems wrong. we recently had our first genearal relativity homework, first time extensively using the index notation, combined with noether theorem and a ton of identities to simpify things, was quite hard and i got stuck a few times. and o1 always knew how to simplify all those crazy expressions, but then it can't do the most simple things slightly changed.
Sometimes seemingly super advanced stuff is well covered in the training data and it appears really sophisticated at that
Speaking of which, 5:49 which one? The one you said or the one you displayed on screen?
Because o1-mini and 4o-mini are VASTLY different models
You can check the original paper for all of them, but in that case the screen was right
They could have also tested removing clauses needed to solve the problem, and see if the AI would identify that the problem cannot be solved the way it was stated.
Ya that’s a good idea actually, does it try to solve unsolvable problems
I do not fear the predictive text llm.
I fear the mass ontology based reasoning machine.
Relevance fallacies are also difficult because many are so-called "informal" fallacies, which means they can't be expressed in binary logic. LLMs are not really "doing logic" when then sort of understand; but they do have experience because such fallacies are used all the time by politicians and other professional liars. A more rigorous understanding of logic is something we all could use
Drawing a comparison to humans without knowing how humans compare on the task seems pretty silly. (And any claim towards reasoning is doing that comparison, whatever happens internally in our brain we call the result reasoning)
The +2 clauses part is a bit misleading in my opinion (in the original paper). As iirc Yannic Kilcher pointed out in his video the second extra clause seems to be different from the other clauses. So while the difficulty might still be similar, it has to do 2 different things and not just the same thing twice/thrice.
That said i really like the idea of their benchmark. If you have things in your text that don't really matter to the nature of the task (like names or specific values) it makes a lot of sense to measure not just how well a model answers a specific set of questions, but also how consistent it is over different variations. I don't remember if their sample size would've been big enough but I'd have liked to see if specific questions are consistently right/wrong with only a few contributing to the variance or if the models get a similar % of the variations right regardless of the underlying question.
It is funny how apple discovers now the LLMs and writes papers for things that have been discussed and published by others in the last few years, I guess better late than never
The thing Dr. Bazet's summary ignores is that this deficiency makes LLMs both unreliable and unsafe. An LLM can help you find relevant information, but YOU must evaluate the quality of that information and apply it. The LLM can't! If you ask it to, it will eventually lead you astray.
Theres various quadratics that LLMs are physically incapable of solving, it discovered a couple a while back by accident.
No matter how much I corrected it, they got it wrong time after time
So basically current LLMs are being trained to take tests that are standardized, *just like human students today,* especially those needing to take standardized tests like the SAT or ACT. Therefore these LLMs really are not so different from students.
There is surely an irony there
I think this is more due to a limited working memory than flawed reasoning - I'd struggle to solve the example given in my head, to say nothing about immediately spitting out the answer as LLMs are expected to.
Ya adding extra clauses is the type of things I'd expect humans to be error-prone on, particularly if having to work things out entirely in our heads given low working memory. But with LLMs at some level you'd expect very high "memory" and if they can do a certain mathematical reasoning level once the computer should have little trouble applying that over and over and over again. But not so, it seems.
ChatGPT can solve some of my econ PhD math problems ( requires close to honour undergrad real analysis math skills ), but is unable to solve an Lagrangian function which I can do at second year undergrad.
I think sometimes these things are a question of how good is the training data. A high level topic might be well covered in the internet so it can perform really well at it.
But, is it irrelevant info, that about the five smaller kiwis? I mean, it is not saying "and my dog is brown", it is info related to the question. It can be inferred that there is a reason for the info to be there. It can be inferred that it means "do not take those five into account". What I always miss with AIs is that they interact. If I were asked that question, my reaction would be to ask back: "do you mean...?"
If GPTs can genuinely produce mathematical reasoning would it start replacing professors
I don’t want to think about that:D
I think it is inevitable.
Couple an AI to a proof assistant...
Teachers do a very important job. Teaching the next generation.
We would not trust it to a robot until it is smarter than a human
@@ccash3290And definitely not if it *is* smarter than a human.
The famous writer Flaubert devised such a nice and senseless math problem in 1841. Today, in French, "L'âge du capitaine" refers to one of these absurd questions to which even attempting to provide an answer would be pointless.
Here is the Wikipedia link in English: en.wikipedia.org/wiki/Age_of_the_captain
once upon a time, when gemini first came out, it accurately answered a factual question i always test any ai:
what is the equivalent date of jdn 171868: the answer according to the us naval observatory is friday, 20th july 4243 bc 12:00:00.00 (ut1)
to date, copilot, gemini, and perplexity ALWAYS get this fact wrong or dismisses the question.
i kept saying this to my colleague... why we divert language excel AI into math AI? we can just use LLM ability to use python to solve math problems given... then let user generate enough these data to fine tune future models based on it, wouldn't that be... somehow works too... rather than wasting compute to fine tune something that isn't what it used for...
yeah claude does this a bunch, whenever i give it a tricky problem it just pythonises it. i think perplexity also uses wolfram for math. trying to get a one model to do everything is pointless, but hey, thats where the money is :)
Language model doesn't mean "language" model, only useful for languages. It means next token prediction, and this task is called language modeling. This can be used for almost all sequential tasks
@@MrSur512 yes that is right. and the most mystery i never able to understand is this kind model (transformer) some how have the emerging ability. with that kind observation most think it is possible to fine tune this model into math excel model... i don't know how big the model should get before it able to understand math in emerging ability state like it does with language.
what i know for sure is, when it's trained with huge math dataset yes it able to recall it, that's why i said just let user do what they needs and solve it using python or even wolfram API, with that data demanded by user got solved, we'll get practical math dataset that are commonly used by most people.
(thanks for reminding me transformer is sequential.... maybe... maybe math way of thinking isn't sequential thinking? so maybe we need parallel transformer with parallel attention?? )
but wait transformer isn't sequential... it's attention head... oh no I'm lost
I saw a statistician post that when you actually do stats on the performance (which Apple didn’t do apparently?), there’s no statistical difference between the two tasks.
wow! Hearing upon this , now i think, mathematically LLMs are just lame! lol!..Thnaks for your info.
What if we papraphrase these problems.
Half the time AI's are just gaslighting you, but other times they can return very nice solutions. But I only use AI for problems where I can verify the solution.
And honestly sometimes it feels like a waste of time trying to use AI.
But they have become better at pointing to sources. I have been pointed to good statistic home pages that would have been hard to find using google or bing.
I wish that they would get much better at summarizing though. Try reading a Wikipedia page and let an AI summarize it... It does not go well every time.
And if you cannot trust the results, would you even use it in the first place?
Ya a bunch of applications it’s impressively good and then others it is just terrible
Matter of time. AI has been around in this form and availability for less than half a decade. You could not expect the first steam engine to have the efficiency that current engines have. Wait half a century and see how well they do...
I've seen this with programming. ChatGPT will correct a bug that I wrote in an instant. But whenever there is an issue between my system and a third party package I need, usually nothing it suggests works.
@@Rollerjb So true....
Sometime it will provide a super simple solution on one line that is easy to read, easy to understand and fast and I will be doubting that I was ever able to program anything. Other times it will make several lines of code come up with a big story about how that code works and when you copy the code it will either not run at all, or it will not do what ChatGPT said at all. The you can explain what happens to ChatGPT and after a few massages back and forth it will proudly explain that the code would never work and that you should do something else. And then you are back to the 50/50 chance that it will work this time.🤦🏼♂
Makes sense. I've been of the opinion these new ais are just better forms of search algorythms, and nothing more.
Yuo have All the questions and All the awsers in a Box.😅
Trevor, what might be good if you made a video that sounded plausible but was based on a non-existent research. Then at the end of the video, as always, invited viewers to check the original paper (which would be pointing at an irrelevant or bogus paper). Then count how many commenters discuss what you've said rather than saying: "wait, the paper doesn't talk about this at all!"
lol you’re a bit evil:D
It's enough to make the title not match the contents, many people will comment based only on the title, without watching the video.
@@DrTrefor 222 - 1/3 evil
If it's not getting 100% on high school maths problems that tells you all you need to know about ChatGPTs reasoning abilities.
The average human learner does worse. What does that tell you about the average human's reasoning abilities?
@@dlevi67 I'm not sure what your point is? Plenty of people can score 100% on high school maths. We should expect more, aim higher. When I'm using a CAD package to design something, I don't take a poll of average people with poor maths ability to figure out the angles.
@@luke.perkin.online I'm not sure what _your_ point is. If I need a tool to improve things, I want that tool to work better than an equivalent other. So far, the only tool we have had to address high school maths problems is a human population that has gone to high school (which is still a minority of humans on Earth).
I'm glad (and at the same time worried) that LLM that haven't been specifically developed to address maths and logic problems do so better than the average human for simple problems.
Are we at the point where we can use 'general purpose' models to solve advanced maths questions? No, not at all.
@@dlevi67 I really wish we could have a conversation with nuance and emotional intelligence, but do you actually want more than point scoring? I use LLMs, you use LLMs. We're on the same team wanting them to be better. I think perhaps the only disagreement is on expectations, the threshold for what 'useful' reasoning ability is.
@@luke.perkin.online "perhaps the only disagreement is on expectations, the threshold for what 'useful' reasoning ability is."
Possibly not even that. In this conversation, it seems to me you were looking at a half empty glass. I was looking at a half full one. That's all!