An Exactly Solvable Model for Emergence and Scaling Laws

Tunadorable

มุมมอง 5 383

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ก.ค. 2024
The paper:
arxiv.org/abs/2404.17563
Support my learning journey either by clicking the Join button above or becoming a Patreon member!
/ tunadorable
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tunadorable

ความคิดเห็น • 35

@Crawdaddy_Ro 9 วันที่ผ่านมา ⁺¹⁴
Emergence is one of the concepts I enjoy researching most! Complexity science is, without a doubt, a truly futuristic science! This paper really pulls my cord, dude!
Edit: The paper is interesting but feels pretty basic when it comes to explaining emergence in deep learning models. They used a simplified model with specific tasks designed just for this research, and while it's cool to see skills following a power law and showing up as a sigmoid curve, I'm not sure how relevant it is to real-world applications. The models seem too tailored to this experiment to draw any solid conclusions about how skills emerge in more complex, practical scenarios.
@loganlawrence1476 9 วันที่ผ่านมา ⁺³
Parameter count limit in the bottleneck table might also be a proxy for inference costs or product latency, eg. a company sets aside a fixed budget for a deployed model but has lots of time until go-live and is willing to spend money on training to find the best performer within that speed constraint. Just an idea, great video btw!
@AndreRSilva-oz1nd 9 วันที่ผ่านมา ⁺⁴
Man, amazing vids, keep the good work going!
@marcfruchtman9473 9 วันที่ผ่านมา ⁺³
The title of this paper is super interesting. I do find the choice of "skills" as being basis functions within the model to be somewhat difficult for me to wrap my head around. It would be immeasurably more useful if they were able to demonstrate that it also modeled some real world example... such as using the MNIST data and applying some basis function such as detecting horizontal lines, vertical lines, Diagonals, loops, etc and then evaluating the result to see if it matched their findings when using the mathematically derived basis functions.
I look forward to any future updates.
@wwkk4964 9 วันที่ผ่านมา
Top tier content!
@whemmakatatt5311 9 วันที่ผ่านมา ⁺²
NICE content. S tier
@netherportals 9 วันที่ผ่านมา
Pretty cool new ability
@joe_limon 9 วันที่ผ่านมา
I think to advance future models, we are going to have to figure out how to increase training efficiency.
@JGLambourne 8 วันที่ผ่านมา
're orthography of real world skills. Feels a little bit of a stretch to think of such complex things in this linear way, but I guess one could imagine some "basis" skills from which others are composed.
@RoulDukeGonzo 9 วันที่ผ่านมา
How does this relate to the whole "measurement creates emergence" thing?
@kimcosmos 8 วันที่ผ่านมา
Is it possible to separate the simple skills by filtering out all data that does not assume those simple skills. ie. filter out the obvious data once it becomes obvious, to avoid repeated;ly reinventing the wheel. It means identify the obvious once it becomes obvious. ie looking for nonobvious or counter intuitive data. Running a prediction filter on (what has become) the obvious. It means testing generalising circuits (4 layers to find, +4 to test) and using them as retrieval heads to filter the data stream. Qstar is relatively compute inefficient but useful with sparse data because of its improved accuracy, and this would be a good use case. Filtered data, less shots. Maybe less parameters after filtering and maybe less layers if fast grokking retrieval heads with 8 layers.
@Tunadorable 8 วันที่ผ่านมา
interesting. recently the fineweb-edu dataset was created as a filtered down version of fineweb where they asked llama 70b whether each document had educational value or not. i imagine that may be a conceptually easier method (albeit potentially more computationally intensive). a question like “is this document relatively mundane, or does it contain unusually rare/complex facts/reasoning?”. alternatively some sort of rating by perplexity or some other quantitative measure might work.
@kimcosmos 8 วันที่ผ่านมา
@@Tunadorable RAG like retrieval heads can use a more focused subset for local learning. Especially few shot sparse data methodical analytics. ie "What am I missing here?"
Fineweb extracts data pairs (ER?) with one of its 5 reward prompts for creating artificial data being "- Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style."
@kimcosmos 8 วันที่ผ่านมา
@@Tunadorable "Add another point if the extract addresses certain elements pertinent to education but does not align closely with educational standards. It might mix educational content with non-educational material, offering a superficial overview of potentially useful topics, or presenting information in a disorganized manner and incoherent writing style.". 1 out of 5 points in their artificial generator prompt.
Its not using Q* to find optimum paths. Fineweb is getting the low hanging fruit. Q* shakes the tree and is good for ticker feeds
@RoulDukeGonzo 9 วันที่ผ่านมา
From the comments i think i got the answer, but just to clarify, this is theoretical right? Why would skill data be so uniform on real skills.
@Tunadorable 9 วันที่ผ่านมา ⁺¹
yes it's theoretical. on real skills it's likely not as uniform but very possible the general theme still holds true in aggregate. The idea that some skills are common while rare skills are very very very (orders of magnitude or exponentially more) rare seems reasonable; if anything the alternative would be that rare skills are only slightly more (geometric? linearly?) rare would be a good thing. However so far the fact that we've had to increase LLM training compute by orders of magnitude in order to get linear returns on benchmarks would imply the former
@phpn99 9 วันที่ผ่านมา
It's a descriptive model. It has no predictive power.
@andrewsilber 9 วันที่ผ่านมา
Maybe Congress should authorize a full digitization of the Library of Congress if what we need is trillions of tokens of quality data. Presumably they could justify it on the grounds of national security, if the goal is to stay ahead in the “AI arms race”
@Tunadorable 9 วันที่ผ่านมา
interesting
@JehovahsaysNetworth 9 วันที่ผ่านมา
ChatGPT can’t write PHP like I showed it how to. I tried and it failed to understand. If you know a better bot to try out direct me to one to choose to work with.
@SoFukinDope24 9 วันที่ผ่านมา ⁺¹
easy solution: use anthropic
@JehovahsaysNetworth 9 วันที่ผ่านมา
@@SoFukinDope24 I will search for it and try it thanks
@RoulDukeGonzo 9 วันที่ผ่านมา ⁺¹
Easier solution, learn python
@JehovahsaysNetworth 9 วันที่ผ่านมา
@@RoulDukeGonzo I know some python I used to use a piwiki bot on my mediawiki
@ricosrealm 9 วันที่ผ่านมา ⁺¹
Claude is the best for coding.
@sikunowlol 9 วันที่ผ่านมา ⁺²
oi
@jacksaunders1929 9 วันที่ผ่านมา
Have you thought about doing a PhD?
@Tunadorable 9 วันที่ผ่านมา ⁺³
oof during undergrad I considered doing it one in economics but back then after going through through the legit publication process, talking with professors, looking at the way the system works, etc it sounded more restrictive than freeing. considered it again when I decided I wanted to pivot into AI but I was blessed to chance upon a short conversation with Paul Christiano and he told me it wasn't necessary for this field, just self-publish then go work at a company. Rn I'm hoping I can become self-sufficient off TH-cam and do a combo of research & science education without any boss/restrictions
@danielmartinmonge4054 5 วันที่ผ่านมา
This paper seems to miss the point about emergent capabilities. From my understanding, the model is learning to solve a specific problem only because it appears in the dataset and is solved in an exact way. The more frequently this exact problem appears, the faster the model learns it.However, true logic, abstraction, and understanding are about finding broader connections between concepts and solving new problems that are not present in the dataset. My intuition suggests that this approach is not suitable for learning natural language. Human knowledge cannot be reduced to a finite set of easily solvable problems. This method overlooks the critical strength of large language models: symbolic abstraction, where specific problems are merely examples of broader categories.It seems to me that the paper fails to address the core aspects of these new architectures. It applies mathematical models designed for narrow, purpose-specific AI rather than for this broader kind of intelligence.
@waveFunction25 9 วันที่ผ่านมา
Oi
@GNARGNARHEAD 9 วันที่ผ่านมา
oi, this is a comment

ต่อไป

เล่นอัตโนมัติ

Exploring Learning Dynamics in Concept Space