Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 ม.ค. 2025

ความคิดเห็น • 240

  • @YannicKilcher
    @YannicKilcher  3 ปีที่แล้ว +22

    OUTLINE:
    0:00 - Intro & Overview
    1:40 - The Grokking Phenomenon
    3:50 - Related: Double Descent
    7:50 - Binary Operations Datasets
    11:45 - What quantities influence grokking?
    15:40 - Learned Emerging Structure
    17:35 - The role of smoothness
    21:30 - Simple explanations win
    24:30 - Why does weight decay encourage simplicity?
    26:40 - Appendix
    28:55 - Conclusion & Comments
    Paper: mathai-iclr.github.io/papers/papers/MATHAI_29_paper.pdf

  • @ruemeese
    @ruemeese 3 ปีที่แล้ว +147

    The verb 'to grok' has been in niche use in computer science since the 1960's meaning to fully understand something (as opposed to just knowing how to apply the rules for example). It was coined in Robert A Heinlein's classic sci-fi novel "Stranger in a Strange Land".

    • @michaelnurse9089
      @michaelnurse9089 3 ปีที่แล้ว +3

      Thanks for the history. Today it is a somewhat well known word in my anecdotal experience.

    • @ruemeese
      @ruemeese 3 ปีที่แล้ว +2

      @@michaelnurse9089 Always going to be new to somebody.

    • @cactusmilenario
      @cactusmilenario 3 ปีที่แล้ว

      @@ruemeese thanks

    • @和平和平-c4i
      @和平和平-c4i 3 ปีที่แล้ว

      Thanks that's why it was not it the Webster dictionary ;)

    • @RobertWF42
      @RobertWF42 3 ปีที่แล้ว +1

      In my opinion "grokking" shouldn't be used here. There's no grokking happening, instead it's running an ML algorithm for a long time until it fits the test data. But it's still a really cool idea. :-)

  • @mgostIH
    @mgostIH 3 ปีที่แล้ว +71

    Thanks for accepting my suggestion in reviewing this! I think this paper can fall between "Wow what a strange behaviour" to "Is more compute all we need?"
    If the findings of this paper are reproduced on a lot of other practical scenarious and we find ways to speed up grokking it might change the entire way we think about neural networks, where we just need a model that's big enough to express some solution and then keep training it to find simpler ones, leading to an algorithmic Occam's Razor!

    • @TimScarfe
      @TimScarfe 3 ปีที่แล้ว +5

      I was nagging Yannic too!! 😂

    • @aryasenadewanusa8920
      @aryasenadewanusa8920 3 ปีที่แล้ว +2

      but again, though it's wonderful, in some cases, the cost to train such a large networks in such long time to achieve that AHA moment by the network can be too crazy to make sense..

    • @SteveStavropoulos
      @SteveStavropoulos 3 ปีที่แล้ว +6

      This seems like random search where the network stumbles upon the solution and then stays there. The "stay there" part is well explained, but the interesting part is the search. And nothing is presented here to suggest search is not entirely random.
      It would be interesting to compare the neural network to a simple random search algorithm which just randomly tries every combination. Would the neural network find the solution significantly faster than the random search?

    • @RyanMartinRAM
      @RyanMartinRAM 3 ปีที่แล้ว +1

      @@SteveStavropoulos or just start with different random noise every so often?

    • @brunosantidrian2301
      @brunosantidrian2301 3 ปีที่แล้ว +2

      @@SteveStavropoulos Well, you are actually restricting the search to the space of networks that perform well in the training set. So I would say yes.

  • @martinsmolik2449
    @martinsmolik2449 3 ปีที่แล้ว +80

    WAIT, WHAT?
    In my Masters thesis, I did something similar, and I got the same result. A quick overfit and subsequent snap into place. I never thought that this would be interesting, as I was not an ML engineer, but a mathematician looking for new ways to model structures.
    That was in 2019. Could have been famous! Ah well...

    • @andrewcutler4599
      @andrewcutler4599 3 ปีที่แล้ว +1

      What do you think of their line "we speculate that such visualizations could one day be a useful way to gain intuitions about novel
      mathematical objects"?

    • @thomasdeniffel2122
      @thomasdeniffel2122 3 ปีที่แล้ว +22

      Schmidthuber, is ist you?

    • @visuality2541
      @visuality2541 3 ปีที่แล้ว +1

      may i ask u the thesis title?

    • @arzigogolato1
      @arzigogolato1 3 ปีที่แล้ว +10

      As with many discoveries you need both the luck to find something totally new, and the knowledge to recognize what's happening...I'm sure a lot of missed opportunities out there...still, the advisor or some of the other professors could have noticed that this was indeed something "off and interesting"...it might actually be interesting to see how and when this happens :)

    • @autolykos9822
      @autolykos9822 3 ปีที่แล้ว +7

      Yeah, a lot of science comes from having the leisure to investigate strange accidents, instead of being forced to just shrug and start over to get done with what you're actually trying to do. Overworking scientists results in poor science - and yet there are many academics proud of working 60-80h weeks.

  • @Bellenchia
    @Bellenchia 3 ปีที่แล้ว +46

    So our low dimensional intuitions don't work for extremely high dimensional space. For example, look at the sphere bounding problem. If you join four 2D spheres, the sphere that you can draw inside the bounds of those four spheres is smaller--and that seems like common sense. But in 10D space, the sphere you place in the middle of the hyperspheres actually reaches outside the bounds of the enclosing hyperspheres. I imagine an analogous thing is happening here, if you were at a mountaintop and you could only feel around with your feet, you of course wouldn't get to mountain tops further away by just stepping in the direction of higher incline, but maybe in higher dimensions these mountain tops are actually closer together than we think. Or in other words, some local minima of the loss function which we associate with overfitting is actually not so far away from a good generalization, or global minimum.

    • @MrjbushM
      @MrjbushM 3 ปีที่แล้ว +2

      Interesting thought…

    • @andrewjong2498
      @andrewjong2498 3 ปีที่แล้ว +3

      Good hypothesis, I hope someone can prove it one day

    • @franklydoodle350
      @franklydoodle350 2 ปีที่แล้ว +1

      Wow seems you're speaking in the 10th dimension. Your intelligence is way beyond mine.

    • @franklydoodle350
      @franklydoodle350 2 ปีที่แล้ว

      Now I wonder if we could push the model to make more rash decisions as the loss function nears 1. In theory I don't see why not.

  • @Phenix66
    @Phenix66 3 ปีที่แล้ว +3

    Loved the context you brought into this about double descent! And, as always, really great explanations in general!

  • @Chocapic_13
    @Chocapic_13 3 ปีที่แล้ว +31

    It would have been interesting if they had shown the evolution of the sum of the squared weights during the training.
    Maybe the sum of the weights gets progressively smaller and this helps the model to find the "optimal" solution to the problem starting from the memorized solution. It can also be that the l2 penalty kind of progressively "destroys" the learned representation and it adds noise until it finds this optimal solution.
    Very interesting paper and video anyways !

  • @CristianGarcia
    @CristianGarcia 3 ปีที่แล้ว +3

    Thanks Yannic! Started reading this paper but forgot about it, very handy to have it here.

  • @nauy
    @nauy 3 ปีที่แล้ว +72

    I actually expected this kind of behavior based on the principle of regularization. With regularization (weight decay is L2 regularization), there is always a tension between the prediction error term and regularization term in the loss. At some point in the long training session, regularization term starts to dominate and the network, given enough time and noise, finds a different solution that has much lower regularization loss. This is also true with increasing the number of parameters, since doing that also increases the regularization loss. At some point, adding more parameters results in the regularization loss dominating over the error loss and the network latches on another solution that would result in lower regularization loss and overall loss. In other words, regularization provides the pressure that, when given enough time and noise in the system, comes up with a solution that uses fewer effective parameters. I am assuming fewer parameters leads to better generalization.

    • @Mefaso09
      @Mefaso09 3 ปีที่แล้ว +13

      While this is the general motivation for regularization, the really surprising thing here is that it does still find this better solution by further training.
      Personally I would have expected it to get stuck in a local overfit optimum and just never leave it. The fact that it does "suddenly" do so after 1e5 times as many training steps is quite surprising.
      Also, if the solution really is to have fewer parameters for better generalization, traditionally L1 regularization is more likely to achieve that than L2 regularization (weight decay). Would be interesting to try that.

    • @Mefaso09
      @Mefaso09 3 ปีที่แล้ว +3

      Also, I suppose it would be interesting to look at if the number of weights with significantly nonzero magnitudes does actually decrease over time, to test your hypothesis.
      Now if only they had shared some code...

    • @vladislavstankov1796
      @vladislavstankov1796 3 ปีที่แล้ว +2

      @@Mefaso09 Agree it is confusing why training much longer time helps. If you remember the example with overfitting polynomials that matches every data point, but is very "unsmooth", unsmoothness comes from "large" coefficients. When you let this polynomial have way more freedom but severely constrain the weights than it can look like a parabola (even though in reality it will be a polynomial of degree 20), but will fit the data match better than any polynomial of degree 2.

    • @markmorgan5568
      @markmorgan5568 3 ปีที่แล้ว +5

      I think this is helpful. My initial reaction was “how does a network learn when the loss and gradients are zero?” But with weight decay or dropout the weights do still change.
      Now back to watching the video. :)

    • @visuality2541
      @visuality2541 3 ปีที่แล้ว +1

      l2 reg (weight decay) is more about small valued parameters rather than about a fewer number of params (sparsity); at least the classical view is so as u probably already know.
      i personally believe that the finidnng is related to Lipschitzness of the overall net and also the angle component of the representation vector.
      but i am not sure what loss function they really used
      itd be really surprising if the future work shows the superiority of l2 reg compared to other regs for generalization, of course, with a solid thoery.

  • @drhilm
    @drhilm 3 ปีที่แล้ว +59

    I saw a similar idea on the paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks
    Elad Hoffer, Itay Hubara, Daniel Soudry" from 2017. Since then, I always try to run the network longer than the over fitting point and in many cases it works.

  • @FromBeaverton
    @FromBeaverton 3 ปีที่แล้ว +1

    Thank you! Very cool! I have observed the grokking phenomena and was really puzzled why I do not see anybody talking about it other than "well, you can try training your neural network longer and see what happen"

  • @TheMrVallex
    @TheMrVallex 3 ปีที่แล้ว +2

    Dude, thanks so much for the video! You just gave me an absolut 5-head idea for a paper :D
    Too bad i work full time as data scientist, so i have little time for doing research. I don't want to spoil the idea, but i got a strong gut feeling that these guys stumbled onto an absolute goldmine discovery!
    Keep up the good work! Love your channel

    • @letsburn00
      @letsburn00 3 ปีที่แล้ว

      I suspect that it has something to do with the fact that there is limited entropy change from the initial randomised initial set values to the "fully trained" conditions. The parameters have so much more capacity to learn in there, but the dataset often can be learned in more detail.

    • @Zhi-zv1in
      @Zhi-zv1in 9 หลายเดือนก่อน

      pleassse share the idea i wanna know

  • @AliMoeeny
    @AliMoeeny 3 ปีที่แล้ว +2

    Just wanted to say thank you, I learned a lot.

  • @beaconofwierd1883
    @beaconofwierd1883 3 ปีที่แล้ว +10

    I don’t see what needs explaining here.
    The loss function is something like L= MSE(w,x,y) + d*w^2
    The optimization tries to get the error as small as possible and also get the weights as small as possible.
    A simple rule requires a lot less representation than memorizing every datapoint, thus more weights can be 0.
    The rule based solution is exactly what we are opimizing for, it just takes a long time to find it because the dominant term in the loss is the MSE, hence why it first moves toward minimizing the error by memorizing the training data then slowly moves towards minimizing the weights. The only way to minimize both is to learn the rule. How is this different from simple regularization?
    As for why the generalization happens ”suddenly” it is just a result of how gradient descent (with the Adam optimizer) works. Once it has found a gradient direction toward the global minimim it will pick up speed and momentum moving towards that global minimum, but finding a path towards that global minimum could take a long time, possibly forever if it got stuck in a local minima without any way out. Training 1000x longer to achieve generalization feels like a bad/inefficient approach to me.

    • @beaconofwierd1883
      @beaconofwierd1883 3 ปีที่แล้ว

      @@aletheapower1977 Not really though, look at the graphs at 14:48, without regularization you need about 80% of the data to be training data for generalization to occur, then you’re basically just training on all your data and the model is more likely to find rule before it manages to memorize all of the training data. Not really any surprise there either.

    • @beaconofwierd1883
      @beaconofwierd1883 3 ปีที่แล้ว

      @@aletheapower1977 Is there a relationship between the complexity of the underlying rule and the data fraction needed to grok? A good metric to use here could be the entropy of the training dataset. One would expect groking to happen faster and at smaller training data fractions when the entropy of the dataset is low, and vice versa. Not sure how much updating you will be doing, but this could be something worth investigating :) If there’s no correlation between entropy and when grokking occurs that would be very interesting.

    • @beaconofwierd1883
      @beaconofwierd1883 3 ปีที่แล้ว

      @@aletheapower1977 I hope you find the time :)
      If not you can always add it as possible future work :)

  • @bingochipspass08
    @bingochipspass08 2 ปีที่แล้ว

    Great walkthrough of the paper Yannic! You covered some details which weren't clear to me initially!

  • @tristanwegner
    @tristanwegner 2 ปีที่แล้ว +2

    In a few years a neuronal Network might watch your videos as training to output better neural Networks. I mean Google own TH-cam and DeepMind, and videos are the next step from picture and text creation, so maybe the training is already happening. But with such a huge dataset I hope they weigh the videos not by views only, but also the predicted education of the audience. I mean they sometimes ask me, if I liked a video because it was relaxing, or informative etc. So thanks for being part of the pre-singularity :)

  • @nikre
    @nikre 3 ปีที่แล้ว +26

    Why would it be specifically t-sne that reflects the "underlying mechanism" but not a different method? I would like to see how the t-sne structure evolve over time, and other visualization methods. This might as well be a cherry picked result.

    • @Mefaso09
      @Mefaso09 3 ปีที่แล้ว +2

      Definitely, the evolution over time would be very interesting

  • @UncoveredTruths
    @UncoveredTruths 3 ปีที่แล้ว +7

    S5? ah yes, all these years of abstract algebra: my time to shine

  • @Yaketycast
    @Yaketycast 3 ปีที่แล้ว +66

    OMG, I'm not crazy! This has happened to me when I forgot to down a model for a month

    • @freemind.d2714
      @freemind.d2714 3 ปีที่แล้ว +8

      Same here

    • @TheRyulord
      @TheRyulord 3 ปีที่แล้ว +10

      Here I was watching the video and wondering "How do people figure something like this out?" but I guess you've answered my question.

    • @solennsara6763
      @solennsara6763 3 ปีที่แล้ว +31

      not paying attention is a very essential skill for achieving progress.

    • @Laszer271
      @Laszer271 3 ปีที่แล้ว +5

      @@solennsara6763 That's essentially also how antibiotics were invented.

    • @anshul5243
      @anshul5243 3 ปีที่แล้ว +49

      @@solennsara6763 so are you saying no attention is all you need?

  • @gocandra9847
    @gocandra9847 3 ปีที่แล้ว +42

    last time i was this early schmidhuber had invented transformers

    • @NLPprompter
      @NLPprompter 7 หลายเดือนก่อน

      what???

    • @GeorgeSut
      @GeorgeSut 7 หลายเดือนก่อน

      Schmidthuber also invented GANs, supposedly 😆

  • @AdobadoFantastico
    @AdobadoFantastico 3 ปีที่แล้ว

    This is a fascinating phenomenon, thanks for sharing and walking through the paper.

  • @SeagleLiu
    @SeagleLiu 3 ปีที่แล้ว

    Yannic, thanks for bring this to us. I agree with your heuristics on the local minimum vs global minimum. To me, the phenomena in this paper is related to the "oracle property" of Lasso in the tradition statistical learning literature, aka, for a penalized algorithm (Lasso or equivalent weight decay), such that the algorithm is almost sure to find the correct non-zero parameters (aka, the global minimum), as well as their correct coefficients. It will be interesting that some one further study this.

  • @billykotsos4642
    @billykotsos4642 3 ปีที่แล้ว

    Should keep this on our radar !

  • @pladselsker8340
    @pladselsker8340 5 หลายเดือนก่อน +2

    How come this only has gained traction and popularity now, 2 years after the paper's release

  • @PunmasterSTP
    @PunmasterSTP 13 วันที่ผ่านมา

    The concept of grokking is just so fascinating to me! I remember getting intrigued watching the documentary on AlphaGo and reading up about AlphaZero and AlphaStar after that, and imagining what might happen if training was left on indefinitely. Would there be some random discovery made after days, weeks or years that radically changed the model's intuition, or even just gave it a slight edge that made it much easier to beat earlier versions of itself? I also love how neural nets seem like they have a deterministic substrate, but yet are so full of mystery!
    Also, and I know this is kinda unrelated, but anyone else stoked for Grok 3 to come out?

  • @CristianGarcia
    @CristianGarcia 3 ปีที่แล้ว +1

    The "Deep Double Descent" paper did look into the double descent phenomena as a function of computation time/steps.

  • @vslaykovsky
    @vslaykovsky 3 ปีที่แล้ว +46

    This looks like an Aha moment discovered by the network at some point.

  • @Idiomatick
    @Idiomatick 3 ปีที่แล้ว

    Your suggestion seems sufficient for both phenomenon. Decay can potentially cause worse loss with a 'wild' memorization so a more smooth solution will eventually be found after many many many attempts.
    "Worst way to generalize solutions" would have been a fun title.

  • @jabowery
    @jabowery 3 ปีที่แล้ว +2

    And intuitive way of thinking about the double dip is imagine you're trying to write an algorithm that goes from gzip to bzip2. You must first decompress. Or think about trying to compress video. If you convert from pixels to voxels it seems like it's a very costly operation but it allows you to go from a voxel representation to a geometric abstraction in 3d where there is much less change with time. So don't be confused by the parameter count. You are still trying to come up with a smallest number of parameters in the final model.

  • @Savage2Flinch
    @Savage2Flinch 6 หลายเดือนก่อน +1

    Stupid question: if your loss on the training set goes to zero, what happens during back-prop if you continue? Shouldn't the weights stop changing after overfitting?

    • @pokerandphilosophy8328
      @pokerandphilosophy8328 4 หลายเดือนก่อน

      My guess is that the model keeps learning about the deep structure of the problem albeit not yet in a way that is deep enough to enable generalization outside of the training set. Imagine you are memorizing a set of poems by a poet. At some point you know all of them by rote but you don't yet understand the style of the poet deeply enough to be able to predict (with above chance accuracy) how some of the unseen poems proceed. As you keep rehearsing the initial poems (and being penalized for the occasional error) you latch on an increasing number of heuristics or features of the poet's style. This doesn't improve significantly your ability to recite those poems since you already know them by rote. It doesn't also significantly improve your ability to make predictions outside of the training set since your understanding of the style isn't wholistic enough. But, at some point, you have latched on sufficiently many stylistic features of them (and learned to tacitly represent aspects of the poet's world view) to be able to understand why (emphasis on "why") the poet chooses this or that word in this or that context. You've grokked and are now able to generalize outside of the training data set.

  • @piratepartyftw
    @piratepartyftw 3 ปีที่แล้ว

    very cool phenomenon. thanks for highlighting this paper.

  • @tristanwegner
    @tristanwegner 2 ปีที่แล้ว

    21:33. Your explanation about why weightdecay helps with generalization after a lot of training makes intuitive sense. Weightdecay is like being forgetful. Thus you try to remember varying details of each sample, but the details is different for each sample, thus if you forget a bit, you do not recognize it anymore, and when learning about it again, you pick a new detail to remember. But if you find (stumble upon?) the underlying pattern, each sample re enforces your rule, counteracting the constant forgetting rate, thus it will still be there when the training samples comes around again, thus staying. Neuroscience has assumed that forgetting is essential for learning, and some Savants with amazing memory often have a hard time in real world situations.

  • @imranq9241
    @imranq9241 3 ปีที่แล้ว +3

    how can we prove double descent mathematically? I wonder if there could be a triple descent as well

  • @thunder89
    @thunder89 3 ปีที่แล้ว +4

    I cannot find in the paper what the dataset size for fig 6 was -- what percentage of the training data are outliers? It sees interesting that there is a jump between 1k and 2k outliers...

  • @tsunamidestructor
    @tsunamidestructor 3 ปีที่แล้ว +1

    As soon as I read the title, my mind immediately went to the "reconciling modern machine learning and the bias variance trade-off" paper.

  • @rayyuan6349
    @rayyuan6349 3 ปีที่แล้ว +1

    I don't know if GAN community (Perhaps overparameterization community too) and general ML community have widely accept "Grokking", but this this absolute mind-blowing to the field as it implies that we can just brainless add more parameters to get better results.

  • @guruprasadsomasundaram9273
    @guruprasadsomasundaram9273 3 ปีที่แล้ว

    Thank you for the wonderful review of this interesting paper.

  • @IIIIIawesIIIII
    @IIIIIawesIIIII 3 ปีที่แล้ว +1

    This may be due to the fact, that there are countless combinations of factors that lead to overfitting, but only one "middle point" between those combinations.
    A quicker (non exponential) way to arrive there would be to overfit a few times with a different random seed and then add an additional layer that checks for patterns that these overfitted representations have in common.

  • @herp_derpingson
    @herp_derpingson 3 ปีที่แล้ว

    This is mind blowing.
    .
    I wonder if a "grokked" image model is more robust against adversarial attacks. Somebody needs to do a paper on that.
    .
    We can also draw parallels to the dimpled manifold hypothesis. Weight decay implies that the dimples would have to have a shallow angle, while still being correct. A dimple with a shallow angle will just have a lot more volume to "catch particles" within it, leading to better generalization.
    .
    I would also be interested in seeing how fast should we decay the weight. My intuition says that it has to me significantly less than the learning rate, otherwise, the model wont catch up with the training, while not being too less that the weight loss is nullfied by gradient addition.
    .
    I remember in one of the blog posts of Andrej Karpathy he said that make sure that your model is able to overfit a very small dataset before moving on to next steps. I wonder if now it would become standard practice to "overfit first" a model.
    .
    I wonder if something similar happens in humans. I have noticed that I develop better understanding of some topics after I let them sit in my brain for a few weeks and I forget a lot of the details.

  • @justfoundit
    @justfoundit 3 ปีที่แล้ว +4

    I'm wondering if it works with XGBoost too: increasing the size of the model to 100k and let the model figure out what went wrong when it starts to overfit at around 2k steps?

  • @TimScarfe
    @TimScarfe 3 ปีที่แล้ว +3

    Been waiting for this

  • @THEMithrandir09
    @THEMithrandir09 3 ปีที่แล้ว +2

    25:50 looking forward to the Occams Razor Regularizer.

    • @drewduncan5774
      @drewduncan5774 3 ปีที่แล้ว

      en.wikipedia.org/wiki/Minimum_description_length

    • @THEMithrandir09
      @THEMithrandir09 3 ปีที่แล้ว

      @@drewduncan5774 great read, thx!

  • @yasserothman4023
    @yasserothman4023 3 ปีที่แล้ว +1

    Thank you for the explanation , any reference or recommended videos on weight decay in the context of deep learning optimization ?
    Thanks

  • @oreganorx7
    @oreganorx7 3 ปีที่แล้ว +5

    I would have named the paper/research "Weight decay drives generalization" : P

    • @oreganorx7
      @oreganorx7 3 ปีที่แล้ว

      @@aletheapower1977 Hi Alethea, I was mostly just joking around. In my head I took a few leaps and came up with an understanding of how I think it is that with weight decay this behaviour happens the fastest. Great work on the paper and research!

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 3 ปีที่แล้ว

    Thanks for the amazing video... I have seen this while working with piecewise linear functions.. Somehow leaky relu works much better than relu or other deactivating functions.... Wouldn't have known this paper exists of you hadn't posted..

  • @jeanf6295
    @jeanf6295 3 ปีที่แล้ว

    When you are limited to composition of elementary operations that don't have quite the right structure, it kinda makes sense that you need a lot of parameters to emulate the proper rule, possibly way more than you have data points.
    It's a bit like using NOT and OR gates to represent an operation, sure it will work but you will need quite a few layers of computation and you will need to store quite a few intermediary results at each step.

  • @chocol8milkman750
    @chocol8milkman750 2 ปีที่แล้ว

    Really well explained.

  • @GeorgeSut
    @GeorgeSut 7 หลายเดือนก่อน +1

    So this is another interpretation of double descent, right?

    • @GeorgeSut
      @GeorgeSut 7 หลายเดือนก่อน

      Also, why didn't the authors train the model for more epochs/optimization steps? I honestly doubt the accuracy stays at that level, noticing that the peak on the right of the plot is cut-off by the plot's boundaries. What if there is some weird periodicity here?

  • @affrokilla
    @affrokilla 3 ปีที่แล้ว

    This is actually really interesting, I wonder if the same concepts apply to harder problems (object detection, classification, etc). But the compute necessary to go 100-1000x more training steps than usual will be insane. Really hope someone tries it though.

    • @andrewjong2498
      @andrewjong2498 3 ปีที่แล้ว +2

      Anecdotally, my colleague also independently observed grokking on their dataset with height maps a year ago, using a UNet. So it does seem possible for harder problems.

  • @petertemp4785
    @petertemp4785 3 ปีที่แล้ว

    Really well explained!

  • @BitJam
    @BitJam 3 ปีที่แล้ว

    This is such a surprising result! How can training be so effective at generalizing when the loss function is already so close to zero? Is the loss function really pointing the way to the generalization when we're already at a very flat bottom? Is it going down when the NN generalizes? How many bits of precision do we need for this to work? I wonder what happens if noise is added to the loss function during generalization. If generalization always slows down when we add noise then we know the loss function is driving it. If it speeds up or doesn't slow down then that points to self-organization.

  • @Champignon1000
    @Champignon1000 3 ปีที่แล้ว

    This is super cool thanks. - I'm trying somewhere where I use the intermediate layers of a discriminator, as an additional loss to the generator/autoencoder, to avoid collapse. And adding regularizers really helps a lot. (specifically instead of MSE error loss from real image and fake image. I use a super deep perceptual loss of the discriminator, So basically only letting the autoencoder know how different the image was conceptual)

  • @felix-ht
    @felix-ht 3 ปีที่แล้ว

    Feels a lot like the process when a human am is learning something new. In the beginning one just tries to soak up the knowledge. And then suddenly you just get how it works and convert the knowledge to understanding. To me this moment always felt very relieving - as much of the knowledge can simply be replaced with the understanding.
    Arguably the driving factor in humans for this might actually be something similar to weight decay. Many neurons storing knowledge is certainly more costly than a few just remembering the rule. Extrapolated even further this might every well be what drives sentience itself, the simplest solution to being efficient in a reality is to understand its rules and ones role within it.
    So sentience would be the human brain grokking to a new more elegant solution to it's training data. One example of this would be human babys suddenly starting recognize themselfs in a mirror.

  • @hacker2ish
    @hacker2ish ปีที่แล้ว

    I think the weight decay having to do with simplicity of the trained model, is due to the fact that it may impose in some cases many parameters to be almost zeroed out (because weight decay means adding the 2-norm of the parameters to the objective function, scaled by a constant)

  • @gianpierocea
    @gianpierocea 3 ปีที่แล้ว

    Very cool. For some reason what I would like to see is to combine this sort of thing with a multitask approach i.e. , make the model solve the binary operation task and some other synthetic task that is different enough. I really doubt with the current computational resources this would ever work, but I do feel like that if the model could grokk that...well that would be really impressive, closer to AGI like.

  • @TheEbbemonster
    @TheEbbemonster 3 ปีที่แล้ว +1

    That is grokked up!

  • @74357175
    @74357175 3 ปีที่แล้ว +2

    What tasks are likely to be amenable to this? Only symbolic? How does the grokking phenomenon appear in non synthetic tasks?

    • @drdca8263
      @drdca8263 3 ปีที่แล้ว

      this comment leads me to ask, "what about a synthetic task that isn't symbolic?" e.g. addition mod 23.5 , with floating point numbers.
      Or, maybe they tried that? not sure.

  • @tinyentropy
    @tinyentropy 2 ปีที่แล้ว

    Whats the name of the other paper mentioned?

  • @BjornHeijligers
    @BjornHeijligers 7 หลายเดือนก่อน

    Can you explain what "training on the validation or test data" means? are you using the actual "y" data of the test or validation to back propagate the predictions error? How is that different to adding more data to the training set?
    Thanks. trying to understand.

  • @PhilipTeare
    @PhilipTeare 3 ปีที่แล้ว +1

    As well as double decent I think this relates strongly to the work by Naftali Tishby and his group around the 'forgetting phase' beyond overfit - where unnecessary information carried forward through the layers gets forgot until all that is left if what is needed to predict the label th-cam.com/video/TisObdHW8Wo/w-d-xo.html again he was running for ~10K epochs on very simple problems and networks.

  • @ericadar
    @ericadar 3 ปีที่แล้ว +1

    I'm curious if floating point errors has anything to do with this phenomena. Is it possible that FP errors accumulate and cause de-facto regularization after orders of magnitude froward and back prop? Is the order-of-magnitude delay between training and validation accuracy jump shorter for FP16 when compared to FP32?

  • @XOPOIIIO
    @XOPOIIIO 3 ปีที่แล้ว +3

    My networks are showing the same phenomenon when they get overfitted on the validation set too.

    • @gunter7518
      @gunter7518 2 ปีที่แล้ว

      How can i contact u ? I want to see these networks

  • @billykotsos4642
    @billykotsos4642 3 ปีที่แล้ว

    Interesting indeed !

  • @hannesstark5024
    @hannesstark5024 3 ปีที่แล้ว +5

    I guess the most awesome papers are at workshops!?
    It would be really nice to have a plot of the magnitude of the weights over the training time to see if it decreases a lot when grokking occurs.

    • @ssssssstssssssss
      @ssssssstssssssss 3 ปีที่แล้ว

      Well. Workshops are a lot more fun than big conferences. And the barrier to entry is typically lower so conservative (aka exploitation) bias has less of an impact.

    • @hannesstark5024
      @hannesstark5024 3 ปีที่แล้ว

      @@aletheapower1977 ❤️

  • @TheRohr
    @TheRohr 3 ปีที่แล้ว

    Do the number of necessary steps correlate with those necessary for differential evolution algorithms?

  • @herewegoagain2
    @herewegoagain2 3 ปีที่แล้ว

    Like you mentioned, i think even a minor amount of noise is enough to influence the 'fit' on higher order of dimensions. I think they're considering an ideal scenario. If so, it wont 'generalise' to reak life use cases. Very interesting paper nonetheless.

  • @petretrusca2
    @petretrusca2 3 ปีที่แล้ว

    It happened also to me on a problem, but the phenomenon occured much faster, I have observed that is seemed like it has converged but than after some time spend apparently learning nothing it snapped into a much better performance. I also observed that if the model is larger it snaps faster, and if the model is smaller it snaps slower or not at all

    • @petretrusca2
      @petretrusca2 3 ปีที่แล้ว

      And it is not a double descent phenomenon because the validation loss does not increase, it stays constant
      Also my dataset is real one, not an artificial one

  • @ssssssstssssssss
    @ssssssstssssssss 3 ปีที่แล้ว

    Grokking is a pretty awesome name. Maybe I'll find a way to work into my daily lingo.

  • @Lawrencelot89
    @Lawrencelot89 3 ปีที่แล้ว

    How is this related to weight regularization? If the weight decay is what causes this, making the weights already near zero should cause this even earlier right?

  • @goodtothinkwith
    @goodtothinkwith 7 หลายเดือนก่อน

    So if a phenomenon is relatively isolated in the dataset, it’s more likely to grok it? If that’s the case, it seems like experimental physics datasets would be good candidates to grok new physics…

  • @martinkaras775
    @martinkaras775 3 ปีที่แล้ว

    In natural datasets where generation function is unknown (or maybe not existent at all) isnt there a bigger chance of overfitting to validation set than to find actual generation function?

  • @themiguel-martinho
    @themiguel-martinho 2 ปีที่แล้ว

    Why do many of the binary operations tested include a modulus operator?

    • @themiguel-martinho
      @themiguel-martinho 2 ปีที่แล้ว

      Also, since both this work and dual descent come from open ai, did the researchers discuss with each other?

  • @larryjuang5643
    @larryjuang5643 3 ปีที่แล้ว

    Wonder if Grokking could happen for a reinforcement learning set up......

  • @abubakrbabiker9553
    @abubakrbabiker9553 3 ปีที่แล้ว

    My guess would be that this is more like (guided search for optimal weights) as in the neural network explores many local minimas (with SGD) wrt the trainining set until the weights find some weights that fit the test dataset. I suspect that if they extended the training period the test accuracy will drop again and might rise later as well

  • @MarkNowotarski
    @MarkNowotarski 3 ปีที่แล้ว +1

    22:58 "that's a song, no?" That's a song, Yes! th-cam.com/video/XhzpxjuwZy0/w-d-xo.html

  • @XecutionStyle
    @XecutionStyle 3 ปีที่แล้ว

    What does this mean for Reinforcement Learning and Robotics? Does an over-parameterized network with tons of synthetic data generalize better than a simple one with a fraction, but real data?

  • @jjpp3301
    @jjpp3301 3 ปีที่แล้ว +1

    Its like when you find out that its easier to lick the sugar cookie than to scrap the figure with a needle ;)

  • @altus1226
    @altus1226 3 ปีที่แล้ว +4

    This jives with my understanding of Machine-learning.
    We task the computer with looking through a "latent space" of possible code combinations in an common yet esoteric language (maths), and ask it to craft a program that takes our given inputs and returns our given outputs. In this example, the number of layers and parameters we give it directly represents the number of separate functions and variables respectively that it is allowed to create a program with.
    As we increase the size of the notepad file we are giving our artificial programmer, its ability to look at different possibilities and write complex functions increases, with non-linear increases to efficacy. If it is only alittle too large, it may find it can create a few approximations at a time- but can't get enough together to compare and contrast to quickly see the a workable pattern- so it has to slog through things a few a time while forgetting all the progress it had otherwise made. If the notepad is too small, it will never make a perfect program regardless of time spent.
    If you were to compare the training of a perfectly sized notepad for a program, to one that is over-sized, this seems to represent a trade off in memory-size for training-speed. As this allows the learning algorithm to investigate more of the latent space at a time.

    • @G12GilbertProduction
      @G12GilbertProduction 3 ปีที่แล้ว

      Everything is right. But where's a compiler playing own role?

    • @altus1226
      @altus1226 3 ปีที่แล้ว

      @@G12GilbertProduction a compiler generally only removes a layer of abstraction, which helps reduce complexity greatly for humans, however in our case, I feel like our programmer is writing directly in the computed language- there is no need for a compilation step.
      That said, I see no reason we could not come out with a reverse compiler to translate such NN into human readable code.

  • @raunaquepatra3966
    @raunaquepatra3966 3 ปีที่แล้ว

    Is there Learning rate decay? Then it will be more crazy
    because then jumping from a local minima to a global will be counter intuitive

  • @kadirgunel5926
    @kadirgunel5926 3 ปีที่แล้ว +1

    This paper reminded me the halting problem. Isn't it ? It can stop, but it will take too much time; it is indeterministic. I felt like living in 1950s.

  • @employeeofthemonth1
    @employeeofthemonth1 3 ปีที่แล้ว

    So is there a penalty on larger weights in the training algorithm? If yes I wouldnt find this suprising intuitively.

  • @pascalzoleko694
    @pascalzoleko694 3 ปีที่แล้ว

    No matter how good a video is, I guess someone has to put the thumb down. Very interesting stuff. I wonder what it costs to run training scripts for soooo long :D

  • @ZLvids
    @ZLvids 3 ปีที่แล้ว

    Actually, I have encountered several time in an "opposite grokking" phenomena where the loss is suddenly *increase*. Does this phenomena has a name?

  • @azai.mp4
    @azai.mp4 3 ปีที่แล้ว

    How is this different from epoch-wise double descent?

  • @derekxiaoEvanescentBliss
    @derekxiaoEvanescentBliss 3 ปีที่แล้ว

    Curious though if this has to do with the relationship between the underlying rule of the task, and the implicit regularization of gradient descdnt and explicit regularization.
    Is this a property of neural network learning, or is it a property of the underlying manifolds of datasets drawn from the real world and math?
    Edit: ah looks like that's where they went in the rest of the paper

  • @st33lbird
    @st33lbird 3 ปีที่แล้ว +7

    Code or it didn't happen

  • @pisoiorfan
    @pisoiorfan 3 ปีที่แล้ว

    I wonder if a model after grokking suddenly becomes more compressible through distillation than any of its previous incarnations. That would be expected if it discovers a simple rule that fits any inputs and be equivalent of the machine having an "Aha!"

  • @victorrielly4588
    @victorrielly4588 3 ปีที่แล้ว

    Yes! A transformlet paper.

  • @ayushkanodia240
    @ayushkanodia240 3 ปีที่แล้ว

    In appendix A.1 - For each training run, we chose a fraction of all available equations at random and declared them tobe the training set, with the rest of equations being the validation set. Can someone tell me how they choose their training and test sets? Figure 5 seems to indicate that one model runs on only one equation (same for the RHS in figure 1) but this line from the paper suggests otherwise.

    • @ayushkanodia240
      @ayushkanodia240 3 ปีที่แล้ว

      @@aletheapower1977 thank you. So the statement in the appendix is for equations within an operation. Clarifies!

  • @patrickjdarrow
    @patrickjdarrow 3 ปีที่แล้ว +2

    Some intern probably added a few too many 0's to their n_epochs variable, went on vacation and came back to these results

    • @oscezrcd
      @oscezrcd 3 ปีที่แล้ว

      Do interns have vacations?

    • @patrickjdarrow
      @patrickjdarrow 3 ปีที่แล้ว

      @@oscezrcd depends on their arrangement

    • @patrickjdarrow
      @patrickjdarrow 3 ปีที่แล้ว

      @@aletheapower1977 LOL. Love that

  • @MCRuCr
    @MCRuCr 3 ปีที่แล้ว +11

    This sounds just like your average „I forgot to stop the experiment over the weekend“ random scientific discovery.

  • @ralvarezb78
    @ralvarezb78 3 ปีที่แล้ว

    This means that at some point there is a very sharp optimum in the surface of loss function

  • @agentds1624
    @agentds1624 3 ปีที่แล้ว +1

    it would have been also verry interesting to see if the validation loss also snaps down. Maybe just coincidence that there is no val loss curve ;). I currently also have a kind of similar case where my loss curves scream overfitting however the accuracy sometimes shoots up...

    • @Coolguydudeness1234
      @Coolguydudeness1234 3 ปีที่แล้ว +2

      the figures and the whole subject of the paper is referring to the validation loss

  • @shm6273
    @shm6273 3 ปีที่แล้ว +1

    Wouldn't the snapping be a symptom of the loss landscape having a very sharp global minima and a ton of flat points? Let the network learn forever (and also push it around) and it should eventually find the global minima. I'm assuming in a real dataset, the 10^K iterations needed for a snap will grow larger than there are atoms in the universe.
    EDIT: I commented too soon, seems you had a similar idea in the video.

    • @shm6273
      @shm6273 3 ปีที่แล้ว

      @@aletheapower1977 cool. So you believe the minima is actually not sharp, but pretty flat and that makes the generalization better? I remember reading a paper about this, I’ll link to it a bit later

  • @CharlesVanNoland
    @CharlesVanNoland 3 ปีที่แล้ว

    Reminds me of a maturing brain, when a baby starts being aware of objects and tracking them, understanding permanence and tracking them, etc.

  • @DasGrosseFressen
    @DasGrosseFressen 3 ปีที่แล้ว

    Learn the rule? So there was extrapolation?

  • @andrewminhnguyen9446
    @andrewminhnguyen9446 3 ปีที่แล้ว +2

    Makes me think of "Knowledge distillation: A good teacher is patient and consistent," where the authors obtain better student networks by training for an unintuitively long time.
    Also, curious how this might play with "Deep Learning on a Data Diet."
    Andrej Karpathy talks about how he accidentally left a model training over winter break and found it had SotA'ed.

  • @f3arbhy
    @f3arbhy 3 ปีที่แล้ว

    No more earlyStopping. Got it!!

  • @andreasschneider1966
    @andreasschneider1966 3 ปีที่แล้ว

    Nice paper, also well done on your side! The generalization is relatively slow though I would say, since you are on a log scale here

    • @erwile
      @erwile 3 ปีที่แล้ว

      It could be helpful with tiny datasets !

  • @Mordenor
    @Mordenor 3 ปีที่แล้ว

    I think the group is C97, the only abelian group of order 97. Identity element is a.

    • @birkensocks
      @birkensocks 3 ปีที่แล้ว +1

      I don't think Z_97 is a group under most of the binary operations they used. But (x, y) -> x + y mod 97 is one of the operations, so it could be the group. It does seem like a is the identity element, and the table is symmetric about the diagonal