The Simple $1,000,000 Problem AI Can't Solve

แชร์
ฝัง
  • เผยแพร่เมื่อ 10 ก.ย. 2024

ความคิดเห็น • 116

  • @BryanLanders
    @BryanLanders 2 หลายเดือนก่อน +7

    This is 🔥! I’m on the ARC Prize team and this was a great rundown of everything. Thrilling to see my design work in the video, too. 😊 Hope this inspires people to jump in and participate. Thanks!

    • @renedworschak8670
      @renedworschak8670 2 หลายเดือนก่อน +1

      I think this type of benchmark will become more and more important. Neural networks and LLM are trained with "infinite" test sets. The energy required to form the models will become ever greater - this benchmark shows how the training amount could be reduced or how inflexible the models are. I think it will be particularly crucial for small LLM on the edge (IOT, smartphones).

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      🤩 awesome to hear from you Bryan! You all have done an amazing job with this prize! Hey, if you're up for it - would love it if you could share the video on X - could be a good way to introduce more people to the prize and keep engagement going!

  • @npc-aix-84
    @npc-aix-84 2 หลายเดือนก่อน +12

    I'm not 100% sure future LLMs won't solve this. I bet that from now on these types of puzzles will be agressively put into the training data in enormous amount.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +9

      Yeah that's the worry - Francois expressed this concern on Dwarkesh as well - that someone might create a successful but unsatisfactory solution of just synthesizing a ton of arc-like training data and solving it effectively through memorization. I hope some new ideas come about and take a different approach

    • @jonmichaelgalindo
      @jonmichaelgalindo 2 หลายเดือนก่อน

      You're 100% wrong. ARC is only a challenge because you're not allowed to use LLMs. You have to use a small, local program.
      GPT-4o is already at 50% accuracy without any training data. Just explain the concept in a prompt, convert the images to text, and it can not only solve them, it can invent computer programs to solve them. And it's the worst it's ever going to be. Claude 3.5 just surpassed GPT4-o's performance this week. Two years ago, it had 0% accuracy. Where will it be next year? Two years from now? Three?
      Last week, a user on Twitter / X converted these challenges into text-based-squares, built a prompt (less than 32k tokens), and had GPT-4o write python programs to solve them, then submitted those programs to the ARC challenge. GPT-4o's work scored over 50% accuracy, which is higher than MindsAI's 39% currently topping the ARCPrize leaderboard.

    • @tommiest3769
      @tommiest3769 2 หลายเดือนก่อน +3

      @@jonmichaelgalindo Still, the fact that I can sit with my coffee never having seen these puzzles before, and leisurely solve all the ones I have tried so far with relative ease, and yet it takes AI an enormous amount of energy and "compute" just to hit 50% accuracy shows that AGI is still elusive. That said, my mind is the product of 4 billion years of evolution whereas these Chatbots are just getting started. I expect that AGI will be reached within 10-20 years even though we aren't exactly sure what it will take to get there. After all, who predicted 5 years ago that AI would be where it is today in terms of being able to pass medical exams etc...

    • @jonmichaelgalindo
      @jonmichaelgalindo 2 หลายเดือนก่อน

      @@tommiest3769 I'm basically incompetent at these puzzles. Way lower than average. :-( And I'm not stupid. I play several instruments. I'm a lot better at coding than GPT-4. I've self published novels. I enjoy philosophy. I could go on. But these stupid squares never make sense. I get it right after someone tells me the trick and then it seems super obvious, but there's just something not quite right in my head.

    • @tommiest3769
      @tommiest3769 2 หลายเดือนก่อน

      Isn't the best way to test whether a system is an AGI to place it in a completely novel environment and ask it to figure out a puzzle for which it has no experience whatsoever? So in some ways, we might need embodiment before this can happen. An example might be an escape room or placing it out in the middle of a deep woods and seeing if it can figure out how to get from point A to point B (e.g. orienteering). Another test for AGI/ASI would be to set it loose on one of The Millennium Prize Problems" in mathematics.

  • @3thinking
    @3thinking 2 หลายเดือนก่อน +5

    I would ask the GPT to use Python to solve the puzzles, and verify the answers are correct. In this way it would generate code, run the program, check the results and iterate the code until it is correct.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +4

      This is an interesting approach though I wonder how well ChatGPT could describe what it wants implemented to the code interpreter. I like the Agentic approach though. You should play around with it and see if you can claim the prize!

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน +1

      Let me see you write the code to solve an arbitrary puzzle like that

    • @bladekiller2766
      @bladekiller2766 2 หลายเดือนก่อน

      You don't have the verification program of the provate set, so it wont work

    • @drhxa
      @drhxa 2 หลายเดือนก่อน

      Look up Ryan Greenblat's solution he posted in his blog for ARC AGI. He used exactly this method to get 50% on the public eval dataset.
      Specifically what he did is have the LLM (GPT-4o) write ~8000 possible solutions in python per problem and then test the outputs against the examples of a given problem. The closest 2 or 3 are used to generate the final result.
      There are lots of interesting details I'm leaving out included in his blog such as test-time compute scaling curves, speculation on what it will take to get 85%, implementation details, etc.

    • @Billy4321able
      @Billy4321able 2 หลายเดือนก่อน +1

      Someone already tried that. After a lot of tweaking it achieved the highest score on the public test set. Nowhere near the level of the prize, and very hacky, but still the best anyone has come up with so far.

  • @phatster88
    @phatster88 21 วันที่ผ่านมา +1

    ARC benchmark shows AI reaches 0.25 average human and has not budged for years

    • @VoloBuilds
      @VoloBuilds  21 วันที่ผ่านมา

      Yeah it has definitely been one of the most difficult benchmarks for AI. Since they announced the prize it has gone from .34 to .46 but the best approach is still quite complex and expensive to run and there's still a ton of ground to cover

  • @duytdl
    @duytdl 2 หลายเดือนก่อน +3

    -But didn't IQ tests already have such pattern matching questions that AIs have passed to average human level? Or am I misinformed?- nvm, watched the full video and understood what I was missing. Fascinating insight!

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Thanks for watching! I hope we will see a non-memorizarion based solution for Arc!

  • @spencerfunk6697
    @spencerfunk6697 2 หลายเดือนก่อน +5

    what if theyre color bind

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +2

      Haha true, but you can still represent the data symbolically like how we see it in the GPT output (colors replaced by numbers)

  • @linusandersen5608
    @linusandersen5608 2 หลายเดือนก่อน +3

    Very good vid, espec. considering that this seems to be a smaller channel, judging by the viewcount. I liked this

    • @linusandersen5608
      @linusandersen5608 2 หลายเดือนก่อน

      Ah okay one small question, because you seem very optimistic about AGI (not in the technical but in the moral sense) - I personally think that AGI will probably be catastrophic to the lives of 99.9% of people, do you agree? If not, how do you think such a concentration of power onto a small "elite" will pan out good for humanity? Especially considering how they will probably be pre-selected to ignore security concerns regarding AI safety

    • @linusandersen5608
      @linusandersen5608 2 หลายเดือนก่อน

      Ah okay one small question, because you seem very optimistic about AGI (not in the technical but in the moral sense) - I personally think that AGI will probably be catastrophic to the lives of 99.9% of people, do you agree? If not, how do you think such a concentration of power onto a small "elite" will pan out good for humanity? Especially considering how they will probably be pre-selected to ignore security concerns regarding AI safety

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +2

      Thanks for watching and for your kind words :)
      I have a lot of thoughts re:AGI - too much to write in a comment - but here are some of my guiding thoughts:
      - AI is a technology and accelerator of value creation
      - We have seen an overwhelming historical trend of technology improving people's lives
      - Technology overwhelmingly creates jobs instead of destroying them. It's just that the jobs shift and new roles appear or old roles evolve significantly.
      - There are clear ways in which the tech adds value and I think this will expand as it matures. I'm most interested in AI being used to cure diseases, work risky/difficult manual jobs (robots), improve education, energy use, etc.
      - Some of the most advanced AI systems are literally free to use and open source is not far behind so I wouldn't call it concentrated.
      I have a video about Jevon's Paradox which I think you might find interesting where I talk about AI impact on software jobs. Got lots of other thoughts I'll likely share in future videos!

  • @ckq
    @ckq 2 หลายเดือนก่อน +1

    I keep posting this on the Dwarkesh videos about this, I'll post it here too.
    LLMs are trained on language, of course they'll master that but not visual tasks. They suck at Sudoku (which has an easy solution in code).
    If you want to solve ARC you'll need to do a convolutional neural network. I simply think the vision models are much "dumber" than the text models since there's way more knowledge in text form.
    The training data for images doesn't necessarily correspond to intelligence but rather a basic understanding of light and physics which 5 year olds (and plenty of animals) probably have.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      Would love to see a vision-based approach!

    • @drj92
      @drj92 2 หลายเดือนก่อน +7

      These tasks are not primarily visual -- they can easily be represented via json, or as flattened sequences. The LLMs have no problem memorizing arbitrary manipulations to those sequences, showing that they don't actually have any problem with the input data-type. You don't need CNNs for the network to figure out how to memorize the training set.
      What they can't do is come up with new, simple combinations of rules that they haven't seen before. The problem isn't that they can't see, it's that they can't think.

    • @bladekiller2766
      @bladekiller2766 2 หลายเดือนก่อน

      You can represent the grids 2d matrix of numbers that denote the colors, you don't need cnn at all.

  • @Gauchland
    @Gauchland 2 หลายเดือนก่อน +3

    Question. If this test is memorization resistant and tests general skills, would it be a good form of cognitive training for humans? If each question has high novelty, perhaps this is also a good way to train generalities of the human brain? I've been training myself on them. Even though most are easy, they still take a "moment" of effort, like reading a complex sentence, to reach the aha moment.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      That's a great point - I think it is a great critical thinking exercise. I've found them very satisfying to solve, because as you said - while easy, they still take that moment of effort to think through.

    • @kev2582
      @kev2582 2 หลายเดือนก่อน +3

      ARC Price tests don't require high level cognitive abilities, so it probably won't help horning cognitive abilities for human beyond say elementary/middle schoolers. There are some ARC puzzles that are challenging to even adults, but I think those are due to complexities in visual information, the solutions are all very simple.

    • @Gauchland
      @Gauchland 2 หลายเดือนก่อน

      @@kev2582 I would hope that arc 2 has more challenging examples. A hard version of this would be the ultimate puzzle

    • @abhi
      @abhi 2 หลายเดือนก่อน

      We had exams on this in 9th 10th standard in India (late 90s). Subject was called mental ability and it was a big part of National/State Talent Search Examination

    • @liambailey5630
      @liambailey5630 2 หลายเดือนก่อน

      Work in psychology shows that "brain training" programmes do not improve generalised intelligence. They improve task-specific skills through developing cognitive routines that do not generalise well to new tasks. Thus winners of this test may be doing the exact same thing rather than improving overall reasoning and planning depending how close the unrevealed test set is to the training data.
      The fact that he states that intelligence is not down to "memory" is not true. Everything we know is from memory, we just have heuristics that are learned and are easy to pull from due to the vast amount of experience we have. Without experience it would be impossible. For example, the task that zooms into the rectangle; it is common for humans to think that a border is isolating/marking a spot which is our abstract meaning in some regards (like circling an answer). I think it does require a large knowledge base that has modules to build abstract schemata of situations which acts as quick heuristics that are more flexible than brute force knowledge.

  • @hobrin4242
    @hobrin4242 2 หลายเดือนก่อน +1

    tbf to chatgpt tho, maybe the method of inputting the data is a problem. Like we humans could also not read that json and see patterns like that. I think this would be a very hard challenge for us if we had to read a 1d json like gpt did. Also spatial reasoning is kind of a ridiculous thing to ask to an LLM.

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน +2

      It points out that LLMs are not a path to AGI, as some lunatics think they are

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      What I love about this challenge is that the data structure is actually sooo simple. Computer vision isn't "looking" at anything like we are, it's analyzing huge arrays of numbers that represent the colors of each pixel. So you can think of this puzzle as a super simplified image.
      When we use GPT vision up till now, if used a vision model to understand the contents and then pass that to GPT-4. Now with GPT-4o if should be native and pass it in as a compressed version of the image (you can read more on their blogs) but I've only gotten very poor results from using it unfortunately.
      Still interesting to see that a fine tuned LLM is the current SOTA for Arc.

    • @hobrin4242
      @hobrin4242 2 หลายเดือนก่อน +1

      ​@@VoltLover00 speaking this number of languages and not to mention programming languages pretty fluently sounds pretty general to me. A shit ton of human problems get solved by thinking in a language as well.
      I think human eyes are pretty much like an API as well, considering how we only have a narrow focus point where we can actually see properly.

  • @Romahotmetytky
    @Romahotmetytky 2 หลายเดือนก่อน +2

    this is basically raven matrices

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Hadn't heard of those before! Thanks for sharing :)

    • @robosapiens-yd1hb
      @robosapiens-yd1hb หลายเดือนก่อน

      @@VoloBuilds Look at bongard problem. First efforts i've seen of AI projects trying to solve pattern recognition problems was in 2006. Look up harry foundalis.

    • @robosapiens-yd1hb
      @robosapiens-yd1hb หลายเดือนก่อน

      close

  • @whismerhillgaming
    @whismerhillgaming 2 หลายเดือนก่อน +2

    I wonder how GPT omni would fare at this task
    since GPT omni is capable of understanding all kinds of input directly and is much better at having a broader understanding of stuff

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      The model I used was GPT-4o but admittedly I only did text input, not visual. I believe others have tried visual based approaches and had similar results. Will be interesting to see if someone can create an effective solution on the public leaderboard using this approach!

  • @ideacharlie
    @ideacharlie 2 หลายเดือนก่อน

    You know it can see images right?

  • @ckq
    @ckq 2 หลายเดือนก่อน +3

    How to solve ARC in my opinion:
    Inputs: LLM (for thinking), Vision model + generator (fine tuned for grids as in ARC)
    Train on the example arc puzzles. Convert the jsons to images and create a tokenizer specialized for ARC (i.e. Tetris pieces could be a token). For each of the 400 (i think) public puzzles, give a detailed description of the solution in natural language. Fine tune on this data.
    That method should reach 80% accuracy on unseen ARC (no one has done it yet probably because we have bigger problems)

    • @hobrin4242
      @hobrin4242 2 หลายเดือนก่อน

      go for the prize!

    • @stevenru4516
      @stevenru4516 2 หลายเดือนก่อน

      Which problems? Like half of nlp papers are about prompts or model evals

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน

      You have no reason to make such predictions

    • @bladekiller2766
      @bladekiller2766 2 หลายเดือนก่อน

      This has been tried, achieves less than 20% on the public set which is very bad

  • @tom_skip3523
    @tom_skip3523 2 หลายเดือนก่อน +1

    Have you tried feeding chatgpt with a screenshot of some examples? Maybe the logical thinking improves with vision

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      That's a great question - I have not but my experience with GPT-4o vision has been that it's not very accurate. Even if I roll a few dice and take a picture and ask it to sum for me, it still misreads numbers and makes stuff up. So if we apply that same model to a detailed grid of squares, I'm afraid it won't do very well. Would be interesting to see how people fare on the public leaderboard with that approach though!

  • @x111-c4f
    @x111-c4f หลายเดือนก่อน

    because we need AGI !!
    AI is just a tool that cannot think !!

    • @VoloBuilds
      @VoloBuilds  หลายเดือนก่อน

      What do you think are the necessary components we still need to create to achieve AGI?

  • @spencerfunk6697
    @spencerfunk6697 2 หลายเดือนก่อน +1

    i dont agree with the french homie. have u ever learned to do something incorrectly? learning is definitely a skill. time to learn to learn

    • @npc-aix-84
      @npc-aix-84 2 หลายเดือนก่อน

      Learning is a meta-skill. The skill to learn any skill.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      Agreed that learning is a (meta) skill. But current AIs focus on learning specific knowledge and reproducing it. I think Chollet wants to see us try to build AI that learns to learn natively so to speak, rather than inferring from a massive knowledge base.

    • @bladekiller2766
      @bladekiller2766 2 หลายเดือนก่อน

      It's a skill but without having some inherent capacity to represent abstractions, it's useless.

  • @Matlockization
    @Matlockization 2 หลายเดือนก่อน

    I can't see why AI could not be trained to solve these kinds of pattern puzzles. Given that specialised AI's can be combined and given enough time, then there won't be anything it can't do. If you have cold sores on your lips, then use animal fat on your lips for a few years and take lysine pill once a week or every so often.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      I encourage you to try creating a model that performs well at these and submitting it for the arc prize!

    • @Matlockization
      @Matlockization 2 หลายเดือนก่อน

      @@VoloBuilds I sincerely thank you for your suggestion !

  • @nobo6687
    @nobo6687 2 หลายเดือนก่อน

    Tray so solve it Hering saxophone music

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Could you elaborate? :)

  • @robrita
    @robrita 2 หลายเดือนก่อน +1

    I think you can improve your prompt by adding some examples to prime the model into the intended solution.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน +1

      I did include the 3 samples shown in that example arc puzzle - do you mean adding some additional synthetically created examples?

    • @robrita
      @robrita 2 หลายเดือนก่อน

      ​@@VoloBuilds yeah, I saw that you put everything in 1 prompt.. have you tried using the api or playground?
      using the api setup your instructions in the system prompt. Add the 3 samples as exchange between user and assistant. Now your test input will come as the last user input.
      get familiarise how the api works.

    • @MavVRX
      @MavVRX 2 หลายเดือนก่อน

      That will make no difference as LLMs aren't intelligence. And a prompt with example interactions can be achieved with a single prompt.

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน

      @@robrita No way on Earth 3 examples will work

  • @DivinesLegacy
    @DivinesLegacy 2 หลายเดือนก่อน

    lol this test will get crushed.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Hopefully in a way that is original and not memory based :)

  • @ideacharlie
    @ideacharlie 2 หลายเดือนก่อน

    I can guarantee that this is mostly just the way you are giving it inputs. Just send an image so it’s not translating across inputs of whats supposed to be visual

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน

      I guess you don't understand how LLMs work? You can't input an image as a prompt to an LLM

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Given your confidence, you should create a solution and claim the prize :) but I assure you, plenty of smart folks have tried all sorts of LLM prompting tricks and vision models for this benchmark and none have worked well at all so far. That's what I find so interesting about it!

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Additionally, consider that vision models don't "see" things - they accept huge arrays of numbers representing the colors of each pixel. In that sense, this puzzle's data should be x100 easier to interpret.

    • @drj92
      @drj92 2 หลายเดือนก่อน +2

      It's not a problem with the inputs. You can stick these in as json and the LLM will happily memorize rules. It'll even figure out how to apply the rules it's memorized to slight variations of the problem.

  • @exwi-zed533
    @exwi-zed533 2 หลายเดือนก่อน

    _Yet_. Interesting video thanks for sharing.

    • @VoloBuilds
      @VoloBuilds  2 หลายเดือนก่อน

      Thanks for watching! :)

    • @exwi-zed533
      @exwi-zed533 2 หลายเดือนก่อน

      I finally got a solve by hinting that 'first to last' is left to right and (grid size, in my case x6 was) 6x6 can be seen as a b c d e f. Before it solved correctly I had to show it back an incorrect grid and give it one hint for a cell it had wrong which was causing the entire stagger to be incorrect. For some reason those two changes made it finally able to get the first puzzles stagger and color inputs correct. It kind of skipped the second puzzle entirely, however, which shows it knows it can just type what it predicts we want as the output but skipping the actual math.

  • @mfpears
    @mfpears 2 หลายเดือนก่อน +1

    I don't get why it's so hard for people to understand that AI can do anything with recursive transformations. Do people have no introspection ability? Or do they think that AI is supposed to solve this in a completely different way from the way humans solve it? Maybe it's harder than it looks, but on the other hand when researchers like Francois Chollet are throwing away brilliant ideas from the grad students because they are too human-like, I'm suspicious about stuff like this.

    • @mfpears
      @mfpears 2 หลายเดือนก่อน

      What I'm referring to is the ability to solve long arithmetic problems. One of his grad students had the idea to do it recursively like the way humans learn, and he threw it away because it wasn't reliable Or fast as calculators or whatever.

    • @mfpears
      @mfpears 2 หลายเดือนก่อน

      Just analyze what's going on. Zoom. Why did you know what it was? The rectangle looked important. You cut the example inputs up and started matching against the output examples. It's an extremely incremental, recursive process that you just can't see in a single pass-through. But if you give it a series of transformations that it can perform, and then let it recursively apply them until it figures it out, it should be able to use the examples the way humans do and find the transformations, and then it's a matter of knowing what transformations are possible. I think these examples draw on human-centric perceptions. Alignment. Zooming. All of these things relate to how humans see the world. Taking things apart. We have hands, and we have done it millions of times. That's the only reason we try it out as a potential transformation. It just comes to our minds. When we see blocks, we see things to pick up and move.

    • @mfpears
      @mfpears 2 หลายเดือนก่อน

      So if I were trying to solve this problem, I would set up a recursive neural network and train it to be able to treat the pixels as objects to manipulate. The output should be a transformation, not a full set of pixels. Reality renders the result of our actions. The thing that has to understand is that objects are rigid though. Or that they can merge. Or whatever. But that expectation in humans is the result of actual real-world experience. These puzzles rely on implicit understandings of how physics works.

    • @mfpears
      @mfpears 2 หลายเดือนก่อน

      What this means is that if there is a certain law of physics that is accounted for in the puzzles that aren't made available for training on but are in the actual test itself, it will be impossible to pass it without full human intuition about how the world works. So it's going to take basically a humanoid robot to be able to know how to solve these things.

    • @mfpears
      @mfpears 2 หลายเดือนก่อน

      AI researchers should listen to Jordan Peterson or learn how to think on their own.

  • @StevenAkinyemi
    @StevenAkinyemi 2 หลายเดือนก่อน +2

    This can and will be solved by AI pretty quickly

    • @StevenAkinyemi
      @StevenAkinyemi 2 หลายเดือนก่อน +4

      And when we have that, will that be considered AGI or the goalpost will be shifted again?

    • @stevenru4516
      @stevenru4516 2 หลายเดือนก่อน +6

      Arc has been around since late 2019.

    • @VoltLover00
      @VoltLover00 2 หลายเดือนก่อน +1

      No LLM will solve this

    • @epajarjestys9981
      @epajarjestys9981 2 หลายเดือนก่อน

      lol

    • @npc-aix-84
      @npc-aix-84 2 หลายเดือนก่อน +3

      @@StevenAkinyemi Solving this test is not enough for AGI. This is just an incentive to step out from the current LLM paradigm.