o3 (Part 1): Generating data from multiple sampling for self-improvement + Path Ahead

แชร์
ฝัง
  • เผยแพร่เมื่อ 12 ม.ค. 2025

ความคิดเห็น • 9

  • @m_ke
    @m_ke 20 วันที่ผ่านมา +1

    Great content, thanks so much for sharing all of your videos!

  • @drhxa
    @drhxa 16 วันที่ผ่านมา +1

    Where did OpenAI say they didn't use tree search? I think they do use tree search, specifically MCTS for generating the synthetic data for o1, then at inference time they don't use tree search. The magic is in creating the synthetic data - they take a variety of paths including some wrong paths of the tree search and chain those with keywords like "but wait, the above is getting me stuck. Let's try this instead" then jump to another branch (the branch frequently does lead to the correct answer) of the tree.
    The key is MCTS + "let's verify step by step" in my opinion, so they linearize the MCTS thoughts chains and train on that. Somewhere in there they're using RL also as another key ingredient.
    Looking forward to hear your thoughts

    • @drhxa
      @drhxa 16 วันที่ผ่านมา +1

      Add one more thing: take a look at Sasha Rush's video "speculations on o1" where he describes 4 possible approaches and he explains the stream of search approach. There are a number of problems with this approach such as collapse and loss of generality (as you noted experiencing). But their "secret sauce" could really just be a lot of hard work to overcome these issues to scale the techniques

    • @johntanchongmin
      @johntanchongmin  16 วันที่ผ่านมา +2

      Thanks for the insightful comments. I think tree search may be possible but it is extremely hard to get the heuristic for the nodes right. For example, in AlphaZero the value network is very hard to train and often leads to system collapse if initialised wrongly (I've trained AlphaZero before).
      OpenAI members have repeatedly said the underlying algo is very simple. I think tree search is good but may be too complex for self-learning.

  • @johntanchongmin
    @johntanchongmin  14 วันที่ผ่านมา +1

    Part 2 here: th-cam.com/video/f5obaHiOog4/w-d-xo.html

  • @johntanchongmin
    @johntanchongmin  20 วันที่ผ่านมา +1

    Prompt that makes 4o behave like o1:
    ```
    [Problem]
    Do it out by the following format, taking care to reflect, verify, clarify all assumptions:
    ###Thoughts###
    ...
    ###Final Answer###
    ```

  • @_PranavDesai
    @_PranavDesai 21 วันที่ผ่านมา

    What is the purpose of generating synthetic data from the model which would be used to improve itself? Wouldn't the synthetic data it produced contain the exact same biases as the model? How do you remove the inherent bias? More importantly, if it can produce expert data, why would it be used to fine-tune itself over it again considering the model was already able to produce the very same data?
    Does this feel like CoT or ReAct with extra steps?

    • @johntanchongmin
      @johntanchongmin  21 วันที่ผ่านมา +4

      @@_PranavDesai You can actually do chain of thought prompting to get the model to output more detailed steps, which it natively may not do due to web data not being of that format.
      Such understanding of reasoning steps can be transferred across domains by fine tuning it, resulting in a model that can do reasoning/chain of thought natively without the prompt
      In most cases, you have a ground truth dataset to check if the answer obtained by reasoning is correct, and so you are more assured (though not 100%) that the model is generating the right reasoning traces.
      Btw I myself do not believe models can actually reason like humans, but these reasoning serves as chain of thought to help guide better generation, so it plays an important role.

    • @francisco444
      @francisco444 18 วันที่ผ่านมา

      one important reason of producing synthetic data from the model is that it helps the model represent its knowledge, otherwise you would be feeding the knowledge from another source which it doesn't know anything about. since we want the models to be honest, which means they should learn about what they know and don't, this self-generating data is the best way to make them hallucinate less.