Speculations on Test-Time Scaling (o1)

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ธ.ค. 2024

ความคิดเห็น • 38

  • @test-sc2iy
    @test-sc2iy หลายเดือนก่อน +37

    we're in a spot where a serious person can seriously say "it's SIMPLY the model talking to itself until it solves the problem" , and we enthusiasts shrug and move along. What a time to be alive.

    • @mossglow
      @mossglow หลายเดือนก่อน

      But there is so much more to problem-solving than recursive iteration, is it? Humans solve problems using hypermodalities. Bodily sensations, sounds, smells, gut bioma, and emotional states all impact how we think. Then there are the more or less understood “a-ha!” moments or trial-and-error lucky guesses where intuitive judgment makes the call. We also have subconscious processing during sleep tackling the most difficult problems we are stuck with, accompanied by cerebrospinal fluid flushing over our brain tissue. Then there are hungover days when creativity takes the lead for some (e.g., Hemingway). Good luck trying to introduce a central nervous system depressant like alcohol into an LLM and then get the best out of it, lol. I can only imagine how difficult it is to capture all these nuances in current or future LLM architectures. Almost seems like we need something else to augment LLMs with.

  • @420_gunna
    @420_gunna 6 ชั่วโมงที่ผ่านมา

    Coming back a month later, this is still the goat
    (We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).

  • @JustSayin24
    @JustSayin24 18 วันที่ผ่านมา +2

    i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!

  • @소금-v8z
    @소금-v8z 3 วันที่ผ่านมา

    Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.

  • @DanielBonaker
    @DanielBonaker หลายเดือนก่อน +5

    Very interesting summary, thanks a lot. My intuition is that evaluation/test is where we can grow / low hanging fruits.

  • @familiabartolome9725
    @familiabartolome9725 หลายเดือนก่อน +3

    such a good overview - thank you for the insights, quite instructive and accessible

  • @jaewooklee5844
    @jaewooklee5844 14 วันที่ผ่านมา

    Thank you so much for your detailed information. 🙏

  • @sanesanyo
    @sanesanyo หลายเดือนก่อน +4

    Thank you so much for such an informative video 🙏🙏.

  • @openroomxyz
    @openroomxyz หลายเดือนก่อน +4

    Thanks for creating this video

  • @drhxa
    @drhxa หลายเดือนก่อน +4

    Stream of search + let's verify step-by-step has looked the most likely to me. It might be that they just put their heads down and worked really hard to solve the collapse problems and optimized generalizability.
    Regardless, amazing overview, thanks a bunch for sharing

  • @theK594
    @theK594 หลายเดือนก่อน +1

    This is fantastic work❤!

  • @HansKonrad-ln1cg
    @HansKonrad-ln1cg หลายเดือนก่อน +1

    for search it is important to search over ideas. not letters or tokens or words or sentences or paragraphs but ideas. so an llm needs to be able to output a token that says that it has finished laying out an idea, and thus a new idea can begin at this point. if an llm is constantly interrupted at the lower levels, it can never fully finish the idea. that would also help battle the combinatorial explosion that makes search on lower levels untreatable. its like a human chess player that only considers a few moves vs a brute force algorithm that considers millions of moves that are leading nowhere.

    • @srush_nlp
      @srush_nlp  หลายเดือนก่อน +1

      Agreed. Lots of choices though in how to actually build that. Need steps that cause tangible progress.

  • @mindhoc
    @mindhoc 29 วันที่ผ่านมา

    🎉❤terrific video, thank you

  • @wiktorm9858
    @wiktorm9858 หลายเดือนก่อน

    Cool lecture, thanks!

  • @SLAM2977
    @SLAM2977 หลายเดือนก่อน +2

    The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt

    • @francisco444
      @francisco444 หลายเดือนก่อน +2

      Hence the 7 Tril bet

    • @diophantine1598
      @diophantine1598 หลายเดือนก่อน

      They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.

  • @DistortedV12
    @DistortedV12 หลายเดือนก่อน

    I think not following from expert examples is a stretch. They could of helped finetune the CoT mechanism having people write out their thought processes while solving problems especially for math and coding. Edit: i see it addressed at 20:30

    • @srush_nlp
      @srush_nlp  หลายเดือนก่อน +2

      Yeah I agree that there are expert examples somewhere in the training procedure. Wanted to emphasize that these play less of a role than I would have assumed before diving into this area (if you believe the OAI comments).

    • @tankieslayer6927
      @tankieslayer6927 หลายเดือนก่อน +3

      @@DistortedV12 I think to achieve scale, the data has to be generated by the model itself via a step-by-step prompt, the correctness of the solution has to be easily verified. For example, the AIME problems have an integer solution between 0-999. One can then use process and advantage reward on such dataset.

  • @420_gunna
    @420_gunna หลายเดือนก่อน +1

    Goat

  • @vaioslaschos
    @vaioslaschos หลายเดือนก่อน

    That is awesome. It saved me lots of time. I am trying to use some of these techniques for the AIMO Kaggle contest. If anyone is interested drop me a message.

  • @wwkk4964
    @wwkk4964 หลายเดือนก่อน

    Brilliant!

  • @DistortedV12
    @DistortedV12 หลายเดือนก่อน

    Thinking LLMs from Meta, LLM-Berry, ARC AGI paper from MIT on test time training. Can someone (a LLM) ideally Noam Brown or otherwise comment how these are related to what is discussed here?

    • @srush_nlp
      @srush_nlp  หลายเดือนก่อน +1

      * Thinking LLMs is quite related. It uses an LLM as the verifier (I was emphasizing automatic verifiers in this talk.).
      * LLM-Berry is an effort to do a MCTS style search on existing Llama models without learning.
      * ARC-AGI paper that came out today seems really neat! They do SGD at test time, so pretty different than these methods that only do CoT at test time.

    • @DistortedV12
      @DistortedV12 หลายเดือนก่อน

      @@srush_nlp thank you so much for responding to my questions! Very great talk / liked how you pointed out the core problem so other researchers can focus efforts

  • @DistortedV12
    @DistortedV12 หลายเดือนก่อน +1

    Did he mention that they use reasoning tokens?

    • @srush_nlp
      @srush_nlp  หลายเดือนก่อน +2

      Oh no I forgot to mention that! In my notation the reasoning token is how you know to move from z to y. It's kind of implied by the color changing from green to red.

  • @novantha1
    @novantha1 หลายเดือนก่อน +1

    I find this ridiculous and remarkably improbable. Did you see the missed space in the example CoT from o1? That matches Sam Altman’s laidback writing style, he’s clearly writing all the CoT a test-time by hand.

  • @NerdCrusader
    @NerdCrusader หลายเดือนก่อน +2

    Has to be process reward

    • @srush_nlp
      @srush_nlp  หลายเดือนก่อน +1

      Yeah, it definitely seems like that is part of the equation. The question is whether that is everything.

  • @tankieslayer6927
    @tankieslayer6927 หลายเดือนก่อน +6

    Test compute capability is still constrained by the data used for the RL training, which is harder to curate. You can give a D student an infinite amount of time on an exam and he is certainly not going to get an A.

    • @wwkk4964
      @wwkk4964 หลายเดือนก่อน +2

      Depends entirely on the verifier and the test.

    • @haiderameer9473
      @haiderameer9473 หลายเดือนก่อน

      But synthetic data can solve this restraint. Just have increasingly more capable models create more synthetic data to allow further reinforcement learning, and so on.

    • @mossglow
      @mossglow หลายเดือนก่อน

      @@haiderameer9473 No it doesn't as its still combinatorics at work, D -> A remains a challenge. No amount of recursive repetition of one domain over even seemingly infinite window of time will make you an expert in another that you know little about

    • @Asmodeus.q
      @Asmodeus.q 5 วันที่ผ่านมา

      @@mossglow
      I really was thinking about this, but as it seems to be working, and producing better results.
      I came to the idea that maybe gpt-4=< models are not the best distillations of all the knowledge they've been trained on. and further distillation towards paths that aligns with the intended results required needs further optimization of this distillation towards an outlier reasoning that is better than the average reasoning. Basically trying to distill towards expert level human language proficiency. This should exist within the LLM corpus of knowledge, it's just lost in an ocean of data.
      I certainly don't have any idea about what I'm talking about, I just follow A.I. news.