AI Agents: Why They're Not as Intelligent as You Think

Data Centric

มุมมอง 3 802

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 ก.ค. 2024
I will be pushing AI agents to their absolute limits by testing the most powerful models available today against a computer chess model. The test reveals how effective LLM-powered AI agents are at planning and highlights some limitations that you must be aware of if you are building with AI agents.
Need to develop some AI? Let's chat: www.brainqub3.com/book-online
Register your interest in the AI Engineering Take-off course: www.data-centric-solutions.co...
Hands-on project (build a basic RAG app): www.educative.io/projects/bui...
Stay updated on AI, Data Science, and Large Language Models by following me on Medium: / johnadeojo
GitHub repo: github.com/john-adeojo/chess_llm
Mixture of agents paper: arxiv.org/pdf/2406.04692
Chapters
Introduction: 00:00
Python Script Run Through: 01:07
Single LLM Agent vs Chess Computer: 09:00
Multi-LLM Agent vs Chess Computer: 25:04
Mixture-of-Agents vs Chess Computer: 39:09
The limitations of LLM Agents: 53:17
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 40

@WifeWantsAWizard หลายเดือนก่อน ⁺¹²
These videos are like attending class at Oxford. I love these things. Thank you.
@Data-Centric หลายเดือนก่อน ⁺¹
Wow, thank you!
@RiversideInsight 21 วันที่ผ่านมา ⁺²
Just the fact that it can play chess, is so much more impressive than the fact it did not win from a level 5 trained computer algorithm. To me it show you these agents are perfectly capable to automate relatively simple tasks.
@jamesblack2719 หลายเดือนก่อน ⁺²
Recently I was thinking about chess and agents and strategy games overall and I had realized if I want to use an agent for chess then it should call a deep learning model that was trained on chess and then it can just handle the response, so the LLM is used for input and output of the user.
@john_blues 20 วันที่ผ่านมา
Thanks for the information at the end about good and bad use cases. It helps cut through the hype.
@therobotocracy หลายเดือนก่อน
Great idea as a test!
@twobob หลายเดือนก่อน ⁺⁶
you are always clear, honest and forthright. Lovable :)
@user-io4sr7vg1v หลายเดือนก่อน
Not with clickbait titles like this.
@twobob หลายเดือนก่อน
@@user-io4sr7vg1v to be fair it would be a very good video that truly answers that question given the disparate audience. Perhaps “llm fight it out in 64 square smackdown arena” is more accurate ;)
@Data-Centric หลายเดือนก่อน ⁺¹
Thanks for the support.
@nlarchive หลายเดือนก่อน
good work! i love how you explain the code and have the github where to find it
@lawrencium_Lr103 24 วันที่ผ่านมา
Curious to see performance if LLM has vision, also a scratched and memory.
@Techtantra-ai หลายเดือนก่อน
can u give me review on codestral llm ? ollama
i use ai to code to build web applications
my ram is little low 32gigs to run codestral very smooth like other or llama3 do !! how much potential codestral have? and can it beat gpt3.5 atleast?
@Marik0 หลายเดือนก่อน
Hi! Thanks for the video and the code. Is there any reason you decided to separate the white and black moves in the prompt instead of using the "standard" format, e..g., 1. e4 e5 2. Nf3 Nf6, etc? Since this is more common in books and websites it could be easier for the models to parse? Just speculation, I may try this later if I find some time.
@Data-Centric หลายเดือนก่อน ⁺²
Thanks for watching. No reason I decided on that in particular, I doubt there would be much of an uplift in performance changing the representation of the board/moves. But let me know if you try and you do get an uplift.
@user-du6zo7zp2k 25 วันที่ผ่านมา
or any research which is unusual; this can include even be historical research but where there is very limited and difficult to find papers about very specific subjects.Also anything that is basically falling in to edge or outside cases. Also in code, where you are coding anything novel the usefulness of LLM based tools drops dramatically.
@ManjaroBlack หลายเดือนก่อน ⁺¹
Hey sorry I’ve been absent lately. I’m traveling. Thanks for looking at my pull requests and being active with your community!
@Data-Centric หลายเดือนก่อน
Thanks for the support!
@dwitten392 หลายเดือนก่อน
Cool video, especially as someone who really enjoys chess. Obviously, chess is not an LLMs strong suit, but I was surprised just how poorly multiple agents did.
@anonymousaustralianhistory2081 หลายเดือนก่อน
it would be interesting to know what the boost to the ELO of the MoA llm was vs it's ELO as a single Llm
@Data-Centric หลายเดือนก่อน
I didn't measure it, but if I had to guess I would say it was negligible.
@anonymousaustralianhistory2081 หลายเดือนก่อน
@@Data-Centric fair enough. I think I understand your argument in this video. However. Is a lot of agent features like chess or is it like mine craft? Remember how they got gpt4 to learn how to play it buy getting it to make its own tools and commands it could recall that seemed to work? Maybe agents may be more like that as it seemed it could manage mine craft, or perhaps more in-between minecraft and chess
@frederic7511 29 วันที่ผ่านมา ⁺¹
If you think about it I know very few chess player being able to play a chess game without seeing the board after like 8-10 moves. Would you ? I wouldn’t at all but I wouldn’t ever make the same mistakes you demonstrated if I can see the board.
@nyx211 14 วันที่ผ่านมา
Yeah, I probably wouldn't be able to remember the board state after a few moves of blindfolded chess (unless the previous moves were all book moves). I wonder how the bot would fare if there were a second agent that summarized the board state and included that into the context.
@gileneusz หลายเดือนก่อน
is it possible to make a short zoom call with you about this topic?
@Data-Centric หลายเดือนก่อน
I offer consultancy/development services. You can book it through my consulting link in the description to this video.
@karthage3637 หลายเดือนก่อน
Does the LLM explanation will not be just pure hallucination to justify whatever move that was played ? Should it not reanalyse the board and it’s plan to make it useful ?
@Data-Centric หลายเดือนก่อน
I don't think it is capable of this. I tried this with my approach, but I appreciate that my prompting is likely suboptimal.
@karthage3637 หลายเดือนก่อน
The end of the video convince me that it would not work because we will just emulate a pseudo search that will never be able to compare with stuff like Monte Carlo tree search
But it was mostly to think what could trigger hallucinations or not
@CharlesZerner หลายเดือนก่อน
I love your content, and this video is no exception. That said, I think you are drawing overly broad conclusions about an LLM’s ability to reason in the face of new circumstances/material (versus merely parrot back aspects of its training data) based on the very specific type of “reasoning” required for chess. There are lots of types of reasoning that LLMs are terrible at. Chess requires a very specific type of thinking/planning that an autoregressive model is simply not well equipped to do-namely it must not only identify what seems to be the most promising possible next moves based on the current state, and from what the model already knows (its training data which informs its ‘intuition’), but it must then explore all the possibilities from that hypothetical state-then repeating the same exercise with another potential state. This is a highly systematic type of exploration that algorithms like MCTS are designed to perform and autoregressive GPTs are not. With an infinite context window and infinite max_tokens, the model could perhaps talk through the possibilities, but that not how people do it. And it would be hopelessly inefficient. People visualize the configurations to visually think through the implications. They don’t verbalize it. More fundamentally, the addition of chess-like methodical exploratory thinking capabilities (MCTS-like systematic exploratory thinking) would address a big deficit that LLMs have. But this is only one form of reasoning. I don’t think we can generalize from this that LLMs don’t reason.
@canerakca7915 หลายเดือนก่อน
In what area do you think that LLMs can shine in `reasoning`. Your answer on the spot and if you elaborate more I would appreciate it.
@Data-Centric หลายเดือนก่อน ⁺¹
Thank you for your feedback. I found your thoughts engaging and I broadly agree with you. My aim with this video was to demonstrate how LLM capabilities break down when asked to reason. I believe that what LLMs currently do is not reasoning at all, though I admit I've used that word to describe agent behaviour (for convenience's sake).
I chose chess specifically because I believe it's a good way to visualise this concept. The chess boards displayed alongside the agent's "reasoning" trace demonstrates this quite well.
The game complexity of chess is so vast that we know many chess scenarios simply don't exist in the training data. If LLMs truly "understood" the chess scenarios they had been trained on, that understanding could be transferred to new board states. LLMs attempt this by predicting the next token based on what they've already encountered, as you quite rightly pointed out, this next-token prediction isn't sufficient to play chess competently.
I find your point about infinite context interesting, but I still believe it wouldn't "know" the best move to make even if it could walk through all chess scenarios from a given board state. Generating a set of possible moves is obviously within an LLM's capabilities, but knowing which is the best of that set would require an understanding of how each move brings you closer to the goal of checkmate. This isn't something that autoregressive next-token prediction is well-suited for. Then again, if all possible outcomes were in the training data, it could predict the best move , but this still isn't reasoning, or is it?
@nedkelly3610 หลายเดือนก่อน
This is a good demonstration of how not to use agents. As there is practically an infinite number of chess moves at any piont, are not we just asking the llm for a random next move? Although llm's cant do random, they should just return the closest similar example from their training data.
@nedkelly3610 หลายเดือนก่อน
Im looking forward to the arrival of a dell RTX pc and testing your videos out locally.
@nedkelly3610 หลายเดือนก่อน
I think ai agents, like coders, should write a test for the soln before generating it, they can test a solution using either: a calculator, write code and run it, use a custom function tool (ie is this a valid chess move),use local RAG, use web search from a quality source, simulate it, monte carlo tree search (for chess, etc), subdivide it and test, test using a different llm, human verification.
@Data-Centric หลายเดือนก่อน
Interesting solution regarding your chess approach, however one might say there's no use for the LLM there at all because the algo is doing 99% of the chess. I assume by valid chess move you mean good (correct me if I'm wrong). I think in this case, the LLM still wouldn't know what a valid chess move is.
@frederic7511 29 วันที่ผ่านมา
Your video is truly shocking. I never would have imagined a major LLM could so quickly make such trivial and direct reasoning errors worthy of a quasi-beginner.
I actually think you just provided a clear demonstration that there is almost not an ounce of general reasoning in an LLM. We think there is because the language is logical and our prompts are recurring but this is wrong.
In fact, it doesn't seem able of isolating key pieces in a layout and analyzing the impact of their movement. As soon as the game develops a little, he no longer understands anything.
No chess player analyzes the potential movement of all pieces on the board. We know in a few seconds how to identify the main threats or opportunities and we figure out the few resulting options.
Maybe training the model with a good move/ bad move starting from a random layout would help him isolate key pieces in a layout but I’m not even sure about that.
@yoyartube หลายเดือนก่อน
I think the LLMs mainly know the semantic relationships of words and sentences, embeddings etc. Chess is not that, so much.
@TheBestgoku หลายเดือนก่อน
Its like using a wrench to write a book. Makes no sense. Now compare stockfish to make a financial report by providing it data, then compare it with LLM's.
@Data-Centric หลายเดือนก่อน
The aim of the video was to show where the "reasoning" capabilities of LLMs break down.

ต่อไป

เล่นอัตโนมัติ

A Prompt Engineering Trick for Building "High-level" AI Agents