I think you may have misunderstood part of the experimental setup, or at least implied it. I believe, they trained 256 different models, each on a singular rule set. That's how they generated the plots in Figure 2 with rule complexity on the x-axis. So the model would not be looking at previous states to determine which rule it was seeing, but instead learning a more complex understanding of the training rule as the authors hypothesized. It could be interesting to see if training on multiple rules simultaneously would improve downstream performance, especially across complexity classes.
@@arjavgarg5801 Sharing the code likely would not have clarified this. I would have written a script to generate a single rule dataset and then concatenated them manually. As for in general, sharing code is A LOT of work. Research grade code is messy and hacky, which needs to be cleaned, documented, and tested on clean builds for release. I recently uploaded a project to github and it took 4 days to clean and document, yet still would be an embarrassment for many groups / organizations. Besides, if a paper is written correctly, someone in the field would not require the code to verify their experiments, and releasing code is often an excuse to put less effort into reproducibility (that also increases scrutiny during review and increases the chance of rejection).
@hjups no need to document the code itself. The paper is the documenation And here all you folks are in the field, still you're making mistakes due to ambiguity of natural language.
@@arjavgarg5801 Perfectionism says otherwise. Publicly releasing code is releasing an artifact, and can have as much of an impact on job prospects as a paper. Unfinished paper or undocumented / poorly structured code, both are bad. There are other issues to contend with too, such as making sure the licensing is correct (when merging code bases), and making sure components under NDA are disentangled. As for mistakes due to ambiguity of natural language, most of the time this is intentional.
We have developed a novel statergy to unlock reasoning in LLMs at the speed of greedy decoding and were looking to ground our work in theory, and your video just got recommended and this paper shares a lot of similarities with our work. Thank you for covering. We are targetting ICML, wish us luck!
U got now idea how far this type of work has already gone in other parallel developments. Intelligence of the model matters. Claude, vs gpt vs Gemini etc… it’s fascinating to see what they do with the rules given and how those rules interact to create new output types and understanding. Cool stuff bro, keep sharing please 🙏🏽
So is the idea like, if you can train a model on a pattern generator that is capable of generating more complex and varied (but structured i.e. a discrete ruleset) patterns, then feed inputs into it that mimic the training data, it can then predict the patterns from that, e.g. a parameterized chess game? That seems like a good essential example for ML or genetic algorithms. It's interesting to look at hierarchical systems in biology from that lens, too, like how our body has essentially fully "learned" itself and how that produces rich sensations on top of it.
It could also allow for diversity of ability between models in a genetic fashion. With infinitely many rulesets in infinitely many possible automata systems, only some can reasonably be targeted and trained for, so you can develop a kind of diversity in ability from model to model.
Have you read Dr McGilchrist’s book “The Divided Brain”? There’s some interesting overlap with that book and Dr Peterson’s book “Maps of Meaning”. The thesis that different halves of the brain use a different type of attention could possibly map to the order / chaos idea.
Funny thing is, I did the same thing but with GAME OF LIFE by john conway. A form of CA. Stealing my ideas :( ... (JK!) love your videos bro. I be eating cereal watching your videos.
I think you may have misunderstood part of the experimental setup, or at least implied it. I believe, they trained 256 different models, each on a singular rule set. That's how they generated the plots in Figure 2 with rule complexity on the x-axis. So the model would not be looking at previous states to determine which rule it was seeing, but instead learning a more complex understanding of the training rule as the authors hypothesized. It could be interesting to see if training on multiple rules simultaneously would improve downstream performance, especially across complexity classes.
Why don't all of them just share code!
@@arjavgarg5801 Sharing the code likely would not have clarified this. I would have written a script to generate a single rule dataset and then concatenated them manually.
As for in general, sharing code is A LOT of work. Research grade code is messy and hacky, which needs to be cleaned, documented, and tested on clean builds for release. I recently uploaded a project to github and it took 4 days to clean and document, yet still would be an embarrassment for many groups / organizations.
Besides, if a paper is written correctly, someone in the field would not require the code to verify their experiments, and releasing code is often an excuse to put less effort into reproducibility (that also increases scrutiny during review and increases the chance of rejection).
@hjups no need to document the code itself. The paper is the documenation
And here all you folks are in the field, still you're making mistakes due to ambiguity of natural language.
@@arjavgarg5801 Perfectionism says otherwise. Publicly releasing code is releasing an artifact, and can have as much of an impact on job prospects as a paper. Unfinished paper or undocumented / poorly structured code, both are bad.
There are other issues to contend with too, such as making sure the licensing is correct (when merging code bases), and making sure components under NDA are disentangled.
As for mistakes due to ambiguity of natural language, most of the time this is intentional.
We have developed a novel statergy to unlock reasoning in LLMs at the speed of greedy decoding and were looking to ground our work in theory, and your video just got recommended and this paper shares a lot of similarities with our work. Thank you for covering. We are targetting ICML, wish us luck!
Bro you're going to be a monster human being. Godspeed
U got now idea how far this type of work has already gone in other parallel developments. Intelligence of the model matters. Claude, vs gpt vs Gemini etc… it’s fascinating to see what they do with the rules given and how those rules interact to create new output types and understanding. Cool stuff bro, keep sharing please 🙏🏽
Lol, loved that clarification about Peterson.
So is the idea like, if you can train a model on a pattern generator that is capable of generating more complex and varied (but structured i.e. a discrete ruleset) patterns, then feed inputs into it that mimic the training data, it can then predict the patterns from that, e.g. a parameterized chess game? That seems like a good essential example for ML or genetic algorithms.
It's interesting to look at hierarchical systems in biology from that lens, too, like how our body has essentially fully "learned" itself and how that produces rich sensations on top of it.
It could also allow for diversity of ability between models in a genetic fashion. With infinitely many rulesets in infinitely many possible automata systems, only some can reasonably be targeted and trained for, so you can develop a kind of diversity in ability from model to model.
Have you read Dr McGilchrist’s book “The Divided Brain”? There’s some interesting overlap with that book and Dr Peterson’s book “Maps of Meaning”.
The thesis that different halves of the brain use a different type of attention could possibly map to the order / chaos idea.
yeee I read I wanna say 2/3 of that book years ago but don't remember it as well as MoM
now the question is: what is the best general task to pretrain the models on for anything
text hahaha, well at least if we're talking about already existing data
I wonder how it would be for a square image of all the rgb values
23:45 curriculum learning
Not advisable for LLMs though
@@KevinKreger why not? I've actually been meaning to try it
Reminds me of the c64 demo 'A mind is born'
Thanks, great explanation, good insight :)
Funny thing is, I did the same thing but with GAME OF LIFE by john conway. A form of CA. Stealing my ideas :( ... (JK!) love your videos bro. I be eating cereal watching your videos.
Eww, violin charts...
first
Sure, first on the video that got 100 views in 2h
impressive
Now last 🤷♂️