LLM - Reasoning SOLVED (new research)

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ม.ค. 2025

ความคิดเห็น • 46

  • @mulderbm
    @mulderbm 7 หลายเดือนก่อน +11

    Indeed very good. It was under our nose all the time. Not sure why this research is only now being picked up. First papers on grokking are 2021, 2022 and partially earlier. Brigning this all together is very insightful. These series make me wat to set it up to play with 😂😂

  • @alexjensen990
    @alexjensen990 7 หลายเดือนก่อน +6

    Completely blown away by the test with the "old model"...

  • @luke.perkin.online
    @luke.perkin.online 7 หลายเดือนก่อน +5

    The causal tracing highlights how similar NNs are to just applying input-sensitive matrix multiplication. In the case of ReLUs they're zero or linear, so it's like a hierarchical bunch of switches that turn on just the right linear transform on the input to get the output. The fact that this works (effective, trainable, interpolates and generalises) still amazes me!

  • @luke.perkin.online
    @luke.perkin.online 7 หลายเดือนก่อน +3

    The atomic facts on the graph at the 95% / 5% reminds me of the approach in reinforcement learning for physics models where you start with, for example, low gravity and high friction to dampen the system, then slowly increase/reduce each to bring it closer to reality. It makes unlearned high frequency chaotic (deterministic) systems learnable.

  • @MultiNiktar
    @MultiNiktar 7 หลายเดือนก่อน +2

    This is a crazy good video keep it up! The Algorithm will pick this channel up in no time

    • @code4AI
      @code4AI  7 หลายเดือนก่อน +2

      Smile. Since I always decline when Google wants me to pay them for advertising my own video to a broader audience, I am not at all a good customer for Google, since I do not support their business model: that I pay for promoting my video. Therefore I'll be a stealth YT Channel for a dedicated audience only.

  • @mlytle0
    @mlytle0 7 หลายเดือนก่อน +5

    Amazing stuff. We heard a few months ago about Q* and supposed advances in math ability at OpenAI on unreleased models, nothing of which has appeared in the public domain. This seems like real advances and is publicly accessible. Part of me thinks OpenAI puts out a lot of hype out there to keep the interest up, but their model still hallucinates like crazy, nothing as solid as this appears to be.

    • @notaras1985
      @notaras1985 7 หลายเดือนก่อน

      How can we reduce hallucination

  • @ОлегВиноградов-й8т
    @ОлегВиноградов-й8т 7 หลายเดือนก่อน +2

    Great video. If possible, make a lesson with python code. It would help to understand better how it works. This science is a deep ocean.

  • @tiagotiagot
    @tiagotiagot 7 หลายเดือนก่อน +6

    How about this, first train a model for grokking just on pure logic dataset, randomly generated examples of logic (which should be easy to verify is correct), not language, just those stuff with letters and those weird symbols for logic gates/operators and so on; then once it groks it, move on the the next barebones level of mathematics, then climb up the math ladder at each grokking, at some point start including coding, physics, chemistry etc, and leave natural language for towards the end of the training ladder; ensuring the dataset for all steps follows the ideal ratio. Will we get an ASI that runs on RasPI with something like this approach?

    • @obsidianSt6761
      @obsidianSt6761 7 หลายเดือนก่อน

      you are talking about curriculum learning, which has been around for many decades. The limitation is that different architectures require different curricula (the one you've proposed seems to work for human learning, but does it work for an arbitrary neural architecture? it is expensive to test many architectures!)

    • @tiagotiagot
      @tiagotiagot 7 หลายเดือนก่อน

      @@obsidianSt6761 Combining the ratios thing, with building foundations for rational circuits gradually from the most basic concepts to more and more complex thinking, sounds like a good recipe for the achieving high rational thought processes and understanding from the type of neural-networks discussed in this video, no?

    • @obsidianSt6761
      @obsidianSt6761 7 หลายเดือนก่อน

      @@tiagotiagot But what is the architecture of the Transformer? Does it have 8 layers, 20? What is its hidden activation, hidden size, feed forward size, dropout rate, etc. ? This video shows that you need a whole research to test out if different architectures grok, you are proposing to not only testing different architectures but also through an extensive curricula for each architecture

    • @tiagotiagot
      @tiagotiagot 7 หลายเดือนก่อน

      @@obsidianSt6761 Ah, I see. I got the impression there was already a good starting point to pick an architecture that would grok with just about anything it was trained with...

    • @815TypeSirius
      @815TypeSirius 7 หลายเดือนก่อน

      The most ideal data is 49.9~9% (49.9~ is equal to 50%) noise and 50% signal.

  • @LamontCranston-qh2rv
    @LamontCranston-qh2rv 7 หลายเดือนก่อน +3

    If these structures can be detected, surely they can be predicted? Can we build a model that will look at a dataset and output a good guess at what the weights of a grokked model would be? If so, maybe we can radically diminish the amount of computation required to achieve grokking? Perhaps even predict optimal cross layer memory sharing? I wonder if this might require spatial reasoning. Specifically a kind of self-reflective "imagining" of the model's blackbox architecture, as well as possible, and desirable structures within it?

    • @obsidianSt6761
      @obsidianSt6761 7 หลายเดือนก่อน

      detectability assumes specific instances of dataset, architecture, algorithm, and the confirmed grokked subject model. To produce a hypervisor prediction model as described by you, you must train that model over many datasets, architectures, and algorithms, while also training your subject architecture until grokked to get the groundtruth labels (this simply introduces tremendously more computational resources than it may worth)...

    • @LamontCranston-qh2rv
      @LamontCranston-qh2rv 7 หลายเดือนก่อน

      @@obsidianSt6761 Fair enough. It's like trying to predict where the needle in the haystack might be. Why waste time and resources? Why not just go look for it? Still though, I can't help but think that, over time, a kind of library might emerge which essentially says that, these kinds of structures, tend to form in these kinds of models, when confronted with this type of data. It may be a worthwhile starting point as opposed to the brute force, train to death approach. Or, as you say, it could be another blind alley. Maybe the answer lies in the middle: trust your guess... but verify and abandon as needed? It is certainly true that martial arts masters, for example, don't typically take shortcuts to decades of training... but what if they could? It would amount to learning how best to learn. (A dynamic approach.) With this view into the black box, the professor has inspired an entirely new field of endeavor: Artificial Neuroscience. Necessary perhaps, if we are to have any hope of knowing how or why this stuff runs off the rails, and how to (hopefully) fix it! Thank you very much for your exceptional reply, all the best to you!

    • @815TypeSirius
      @815TypeSirius 7 หลายเดือนก่อน

      No. Its not reciprocal. But things dont get interesting till they start organizing using hypergeometry. How do you think a brain is so efficent an a cpu is comically inefficient.

    • @LamontCranston-qh2rv
      @LamontCranston-qh2rv 7 หลายเดือนก่อน

      The brain uses analog circuitry while LLMs (currently) use digital circuits is one answer. Additionally DNA itself can exhibit quantum tunneling effects in seemingly "intelligent" processes that are not yet well understood. If you are suggesting that human neurons process information in high dimensional space... perhaps. How interesting!

    • @815TypeSirius
      @815TypeSirius 7 หลายเดือนก่อน

      @LamontCranston-qh2rv oh its a "the brain is quantum" loon.

  • @notaras1985
    @notaras1985 7 หลายเดือนก่อน

    What should i do in order to make an AI helper model in my pharmacology lab?

  • @MusingsAndIdeas
    @MusingsAndIdeas 7 หลายเดือนก่อน +2

    Reminds me of the Ten Thousand Hours rule for mastery of a subject

    • @Mik-m2h
      @Mik-m2h 7 หลายเดือนก่อน

      For me too

  • @alexjensen990
    @alexjensen990 7 หลายเดือนก่อน

    Cant wait for the comparison!!!

    • @code4AI
      @code4AI  7 หลายเดือนก่อน +1

      A prominent Feature in Part III.

  • @lukeskywalker7029
    @lukeskywalker7029 7 หลายเดือนก่อน

    This all sounds too good to be true. However the atomic / inferred knowledge thing is something I have had a gut feeling on for a long time.
    Cant wait to replicate this on some easy tasks with continued pre-training.

  • @goodtothinkwith
    @goodtothinkwith 7 หลายเดือนก่อน

    Really incredible stuff

  • @spkgyk
    @spkgyk 7 หลายเดือนก่อน

    Sorry if you covered this in another video, but what's the difference between parametric and non parametric memory?

    • @code4AI
      @code4AI  7 หลายเดือนก่อน +1

      I'll explain it in detail in my next video. Thanks for pointing it out.

    • @generichuman_
      @generichuman_ 7 หลายเดือนก่อน +1

      He covered it in this video. Parametric memory is contained in the weights of the models, and non parametric is contextual memory that you put into the prompt or retrieve with RAG ( which still technically goes into the prompt)

  • @timgorn8927
    @timgorn8927 7 หลายเดือนก่อน

    Thank you very much! I loved this presentation.

    • @code4AI
      @code4AI  7 หลายเดือนก่อน

      Thank you for taking the time to send this feedback to me. Appreciate it.

  • @acasualviewer5861
    @acasualviewer5861 7 หลายเดือนก่อน

    What do they mean by sharing the information between the upper and lower layers? It's not clear to me how that is implemented. And that's kind of the key here.

    • @code4AI
      @code4AI  7 หลายเดือนก่อน

      I am referring to the architecture of a transformer.

    • @acasualviewer5861
      @acasualviewer5861 7 หลายเดือนก่อน

      @@code4AI yes.. but what kind of "sharing" do you mean? Just the normal mechanism of passing info to the next layer?

  • @RalphDratman
    @RalphDratman 7 หลายเดือนก่อน

    I think you have referred to the wrong paper at the bottom of your youtube summary. You mention a "metric", "structural grokking" and "tree structuredness." I cannot find the words "metric", "structural" or "tree" in the paper "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization" (arxiv 1405.24071), but all three of those terms are easy to find in "Grokking of Hierarchical Structure in Vanilla Transformers" (arxiv 2305.18741).

    • @code4AI
      @code4AI  7 หลายเดือนก่อน +2

      No, you are wrong ... but your comment provides a beautiful example of the inner workings of a vector store. So when you are looking for the terms I used in my reference video, at 2:16 to 2:32 I introduced here the new study by MIT and Stanford Univ: I present the title of the pre-print, I present the authors of the pre-print and the https link of this pre-print, and one (!) second later (at 2:33 in my video) I introduce the term "Tree Structuredness" from the study.
      You (@RalphDratman) comment now, that you can't find the words and were looking in another pre-print that I mention in the video. Perfect example of the semantic and causal relation encoded in a "close-by" representation within a low dim vector space.
      So whenever you don't find terms in a linear video sequence of mine, there is a high probability, that literally 1 sec before the term in question appears, the complete information where to find the term(s) was given to you, including the title, the authors and the https link of the pre-print. Imagine a cosine similarity-function that returns the term and the identifier for the pre-print in question directly to you.
      Thank you for this comment.

    • @RalphDratman
      @RalphDratman 7 หลายเดือนก่อน

      @@code4AI 1) I was trying to be helpful
      2) The reason I did not see the paper on the screen is that I was listening rather than watching.

  • @manslaughterinc.9135
    @manslaughterinc.9135 7 หลายเดือนก่อน +1

    Why do we have to exclude RAG from grokked LLMs? There is literally no reason why we can't RAG into a grokked LLM.

    • @frag_it
      @frag_it 7 หลายเดือนก่อน

      Yeah I don’t see RAG going away, grokked llm might even provide more reasoning on the context’s 😅

    • @code4AI
      @code4AI  7 หลายเดือนก่อน

      Great comment! Maybe I'll design an answer in an upcoming video!

  • @Daniel-Six
    @Daniel-Six 7 หลายเดือนก่อน +5

    Anyone who has read the Law of One transmissions might recognize the principle of "intelligent infinity" operating here.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 7 หลายเดือนก่อน +3

      Uhm seek a doctor?

    • @Daniel-Six
      @Daniel-Six 7 หลายเดือนก่อน +1

      ​@@HUEHUEUHEPony Are you familiar with the Law of One?