Fascinating talk, thank you for sharing! I can be completely wrong, though these ideas look so much like an adaptive state classifier for an adaptive control system. The splines appear similar to linear Fuzzy (merges statistical math with semantic languages) membership functions with overlapping probabilistic states. The centers of these membership functions can be adaptively adjusted to experience as discussed in this video using K-means clustering. Much like TRPO, PPO, GRPO, and the attention heads (from the "Transformers: Attention is all you need paper") are all like forms of adaptive state classification ("Grokking") for an adaptive control system of the "reasoning (and maybe even 'consciousness' process". Combine state classification with RL likely produces a robust capable system, especially as indicated in this video "...after very, very long training.". I need to explore this further. Another fascination I have is the usage of the Greek letter, Theta, throughout the deep learning literature. As always, I can be completely wrong (and crazy to think this), but I highly suspect the origin of using Theta is the angle of a pendulum in the 1983 IEEE "Cart Pole" paper by Barto, Sutton, and Anderson where they used Reinforcement Learning as an adaptive control system. There is an even simpler experiment just using a pole in a 1997 master thesis with no cart, it is by an American master student with a Chinese advisor, with the first few words of the title "Reinforcement Learning: Experiments with State Classifiers...". This simple pole (a "useless machine") is described as the "simplest robot" described in a fascinating lecture on TH-cam by Scott Kuindersma, he was one of Barto's students that went on to Boston Dynamics. Although the speaker in this MLST talk seemed to steer people away from looking at the older history of all this, I perceive it is vital for not only understanding how this works, but is also vital for reconstructing it, especially in catastrophe as history is a regular collapse and resurrection. It makes me understand how the Antikythera mechanism is so out of place in the archaeological record and was so challenging to reconstruct and understand it's celestial purpose. Slowly, or maybe too quickly, we are beginning to understand the "magic" hat 🎩 of ☃️"Frosty", though I hope how this "magic hat" is made he not forever lost in history, so vital to robustly identify it's history and core design! We live in as amazing times, again, fascinating talk and I really appreciate you sharing it, thank you!
Still watching, great video! Another popular paper from January was "Grokking at the Edge of Numerical Stability"... similar ideas, orthogonal gradient updates, and also a fix for softmax.
What we are seeing here seems to be transferable to understanding high functioning autism; and also seems related to the drastic neural pruning infants go through; and in general shines a light of pedagogy and logic. I suppose this data should be of paramount importance to cognitive psychology and neurology across the board.
10:11 I don't follow this claim. Clearly the partitions do change even far away from the data - when SGD adjusts some partitions close to the data, all the intersections with those partitions, no matter how far away, are affected. But why would they be affected in ways that extrapolate? What the NN looks like far away from the data is irrelevant to the training loss. There is no gradient information far from the data - all the changes that happen there are purely side-effects of minimizing training loss, i.e. getting a better fit close to the data.
I think all he’s trying to say is that in comparison to K means, there can be coupling from distant decision regions during the learning process, whereas in K means you can rearrange cluster centroids all you want on the north side and it won’t do anything to south pole, because that algo isn’t nested like deep networks are (I don’t think he’s talking about off the data manifold, he’s just taking about distant regions)
Oh sorry, I was mistaken above responding to the immediately preceding statements. But related idea to nested high D space. If your spline adjustments (affine) are tuned the data manifold, the geometric extension of those will be like kaleidoscope replicas. Assuming the test data is somewhat similarly structured, it should largely conform, ie nested splines generalize well under the assumption of “natural data exhibit tendencies”. Whereas for reason I state above, K means just can’t do this, by comparison. Think of the infinite polygon boundaries at the edge of a voronoi decomposition. But yes this is a claim about the behavior of NN and maybe not a perfect intuition.
Feels like the regions need relations (Nexus?). To understand connections for reasoning over patterns/symbols/translations. Clouseau vibe. 🔎 Cool deep video. Reminds me of the XO drum VST plugin. part 2) Might be very promising for overall consistent movie sequence generation. 🤔 Ability to overlay a reasoning mechanism on top of a loop where the index = Context'First to Context'Last. Is kind of a challenge.
Show notes and transcript:
www.dropbox.com/scl/fi/3lufge4upq5gy0ug75j4a/RANDALLSHOW.pdf?rlkey=nbemgpa0jhawt1e86rx7372e4&dl=0
Fascinating talk, thank you for sharing!
I can be completely wrong, though these ideas look so much like an adaptive state classifier for an adaptive control system.
The splines appear similar to linear Fuzzy (merges statistical math with semantic languages) membership functions with overlapping probabilistic states. The centers of these membership functions can be adaptively adjusted to experience as discussed in this video using K-means clustering. Much like TRPO, PPO, GRPO, and the attention heads (from the "Transformers: Attention is all you need paper") are all like forms of adaptive state classification ("Grokking") for an adaptive control system of the "reasoning (and maybe even 'consciousness' process".
Combine state classification with RL likely produces a robust capable system, especially as indicated in this video "...after very, very long training.". I need to explore this further.
Another fascination I have is the usage of the Greek letter, Theta, throughout the deep learning literature. As always, I can be completely wrong (and crazy to think this), but I highly suspect the origin of using Theta is the angle of a pendulum in the 1983 IEEE "Cart Pole" paper by Barto, Sutton, and Anderson where they used Reinforcement Learning as an adaptive control system. There is an even simpler experiment just using a pole in a 1997 master thesis with no cart, it is by an American master student with a Chinese advisor, with the first few words of the title "Reinforcement Learning: Experiments with State Classifiers...". This simple pole (a "useless machine") is described as the "simplest robot" described in a fascinating lecture on TH-cam by Scott Kuindersma, he was one of Barto's students that went on to Boston Dynamics.
Although the speaker in this MLST talk seemed to steer people away from looking at the older history of all this, I perceive it is vital for not only understanding how this works, but is also vital for reconstructing it, especially in catastrophe as history is a regular collapse and resurrection. It makes me understand how the Antikythera mechanism is so out of place in the archaeological record and was so challenging to reconstruct and understand it's celestial purpose.
Slowly, or maybe too quickly, we are beginning to understand the "magic" hat 🎩 of ☃️"Frosty", though I hope how this "magic hat" is made he not forever lost in history, so vital to robustly identify it's history and core design!
We live in as amazing times, again, fascinating talk and I really appreciate you sharing it, thank you!
I'm a simple man. I structure my interest around MLST uploads.
Still watching, great video! Another popular paper from January was "Grokking at the Edge of Numerical Stability"... similar ideas, orthogonal gradient updates, and also a fix for softmax.
Was reading this paper and then this got uploaded🔥 thank you
I've watched 3 episodes of mlst today haha
Its good to be a patreon, the content there is great as well
What we are seeing here seems to be transferable to understanding high functioning autism; and also seems related to the drastic neural pruning infants go through; and in general shines a light of pedagogy and logic. I suppose this data should be of paramount importance to cognitive psychology and neurology across the board.
Given the rate at which pedagogy moves, I'd expect this to be picked up 15 years from now, and applied in the next century ...😅
excellent analogy/metaphor
Great 👍
Top stuff! Immediate take away, more computation can take you a long way for improving resiliency of learnings from a given dataset. Hmnnn Nvidia...
Elastic origami…. in high dimensions
Lol
10:11 I don't follow this claim. Clearly the partitions do change even far away from the data - when SGD adjusts some partitions close to the data, all the intersections with those partitions, no matter how far away, are affected. But why would they be affected in ways that extrapolate? What the NN looks like far away from the data is irrelevant to the training loss. There is no gradient information far from the data - all the changes that happen there are purely side-effects of minimizing training loss, i.e. getting a better fit close to the data.
I think all he’s trying to say is that in comparison to K means, there can be coupling from distant decision regions during the learning process, whereas in K means you can rearrange cluster centroids all you want on the north side and it won’t do anything to south pole, because that algo isn’t nested like deep networks are
(I don’t think he’s talking about off the data manifold, he’s just taking about distant regions)
Oh sorry, I was mistaken above responding to the immediately preceding statements.
But related idea to nested high D space. If your spline adjustments (affine) are tuned the data manifold, the geometric extension of those will be like kaleidoscope replicas. Assuming the test data is somewhat similarly structured, it should largely conform, ie nested splines generalize well under the assumption of “natural data exhibit tendencies”.
Whereas for reason I state above, K means just can’t do this, by comparison. Think of the infinite polygon boundaries at the edge of a voronoi decomposition.
But yes this is a claim about the behavior of NN and maybe not a perfect intuition.
❤
Second but french too
First
Feels like the regions need relations (Nexus?). To understand connections for reasoning over patterns/symbols/translations. Clouseau vibe. 🔎 Cool deep video. Reminds me of the XO drum VST plugin. part 2) Might be very promising for overall consistent movie sequence generation. 🤔 Ability to overlay a reasoning mechanism on top of a loop where the index = Context'First to Context'Last. Is kind of a challenge.