ICML 2024 Tutorial: Physics of Language Models

Zeyuan Allen-Zhu, Sc.D.

มุมมอง 42 438

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 97

@QuanWang 6 หลายเดือนก่อน ⁺⁷³
Best tech talk since LLM!
@GenzhPuff หลายเดือนก่อน
since which?
@icriou 6 หลายเดือนก่อน ⁺¹⁴
This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.
@dolremi1 หลายเดือนก่อน ⁺²
Finally, someone has investigated the physics of LLM. Great talk
@sucim 4 หลายเดือนก่อน ⁺⁷
Great work! The research questions are simple and the answers are profound and not too complex but they are also exactly what we need to understand how LLMs work. I hope you continue to demystifie the limitations of LLMs and even find technical remedies. One of the highest Signal/Noise pieces i've had the pleasure to enjoy in quite some time!
@manncodes 5 หลายเดือนก่อน ⁺⁷
This is the most insightful talk I have ever seen. This has tought me how powerful controlled experiments can be!
@leaderfeng8419 5 หลายเดือนก่อน ⁺³
One of the clearest talk about LLM I’ve ever heard
@sacramentofwilderness6656 6 หลายเดือนก่อน ⁺¹¹
Awesome talk! Thorough and detailed investigation into multiple aspects of LLM learning, their behavior.
@XiaoBaBa 5 หลายเดือนก่อน ⁺²²
Very high information density. Each sub section worth an individual paper
@nvbkdw 15 วันที่ผ่านมา ⁺¹
great talk, very insightful, excited to see that we can prepare synthetic data to manipulate LLM
@salemon289 5 หลายเดือนก่อน ⁺³
Awesome talk, I am glad it is available now for those who can't make it to ICML in person!
@sheikhshafayat6984 5 หลายเดือนก่อน ⁺²
Allen, this is such a wonderful talk. So many thanks for putting this online!
@liqs_xd 18 วันที่ผ่านมา ⁺¹
Super insightful, thanks for the upload!
@fdumitru 5 หลายเดือนก่อน ⁺¹
Okay, this was Masterclass level of presentation. Great job Zeyuan!
@EnesDeumic 4 หลายเดือนก่อน
Absolutely the best talk I've seen in a long time. This is how you do science!
@carpediemcotidiem 17 วันที่ผ่านมา ⁺²
Timestamps (Powered by Merlin AI)
00:10 - Theory of language models covers a spectrum of meanings
02:59 - Newton's laws built on Kepler's work, preceded by Tycho Brahe's observatory data
09:07 - Emphasizing on the physics of language models for detailed study and understanding.
12:08 - Language models need controlled experiments to study knowledge manipulation
18:12 - Knowledge augmentation significantly improves accuracy of language models.
21:02 - Augmenting knowledge in language models changes how knowledge is stored.
26:50 - Exploring knowledge manipulation in language models
29:54 - GPT-4 excels in simple tasks but struggles with knowledge manipulations.
35:29 - Language models may struggle to fully extract specific knowledge details.
38:18 - Synthetic data can measure knowledge stored in datasets and language models.
43:57 - Rare knowledge decreases knowledge capacity, affecting model performance
46:42 - Importance of controlled experiments in assessing knowledge capacity
52:12 - Utilizing synthetic data to control and study language model behaviors
55:06 - Developing synthetic math data sets is crucial for unbiased and diverse problem generation.
1:00:47 - Challenges in language models solving arithmetic problems
1:03:32 - Language models can generalize to solve math problems beyond training data
1:09:13 - Language models develop reasoning skills beyond level-1.
1:12:05 - Language models make two types of mistakes: unnecessary parameters and getting stuck during computation.
1:17:48 - Language models require deep networks for reasoning skills.
1:20:43 - Language models often know when they make mistakes
1:26:18 - Label masking not needed for learning from errors in math data
1:28:56 - Creating fake mistakes can improve accuracy gain in language models
1:34:23 - Language models learn structures and formats efficiently
1:37:03 - Importance of relative attention in training language models
1:42:42 - GPT can learn CFG trees and linearly encode sub-tree information.
1:45:28 - Language models utilize dynamic programming for parsing and generation
1:51:19 - GPT-2 learns algorithms from patterns
@Pingu_astrocat21 27 วันที่ผ่านมา
Thank you for uploading! So much to learn :)
@sanskarshrivastava5193 หลายเดือนก่อน
woww, just woww, one of the best talks ever, up above with my goated dl videos
@hahiZY 6 หลายเดือนก่อน
Good job! You explained it in a very clear way! One of the best talks i have watched recently
@vorushin 4 หลายเดือนก่อน
Thanks a lot for the illuminating and inspiring research! And of course for extremely dense and entertaining presentation!
@islandfireballkill 6 หลายเดือนก่อน ⁺²
Fantastic talk. This is the kind correct combination of hypothesized but backed up with disciplined investigative research that I love to see.
@HomePc-f2m 6 หลายเดือนก่อน ⁺³
Really insightful. Great talk.
@spoiled1511 4 หลายเดือนก่อน
Awesome talk! The best talk I have seen! And this is my first time to leave messages on youtube...
@anshitasaxena3992 4 หลายเดือนก่อน
Thank you very much for posting this talk!
@shashidharkudari5613 หลายเดือนก่อน
Such an amazing talk and so many learnings
@casual.dojo.1 2 หลายเดือนก่อน
Great talk. Genius level. That said, imho It takes humans around 20 years of education, filled with structured training and guidance, to learn what to remember and prioritize before we can effectively apply the scientific method. This training is continuously updated and refined. Expecting an LLM to reason and make decisions without specific training seems unrealistic. The issues with AI reasoning are genuine, but the need for training is normal and essential. Once properly trained, using expensive COT or any new or future methodology, then I would hope that an LLM could perform tasks at a much faster rate, but this training is crucial because knowledge and reasoning, whether in humans or LLMs, don't come as a magical solution without effort and guidance. I'm no specialist, so take what I say with a grain of salt or a bucket. Again, great talk. ty
@李翔-b2v 5 หลายเดือนก่อน
Amazing tutorial. I learned a lot about how LLMs work!
@Dron008 4 หลายเดือนก่อน
What a smart guy! His speech is excellent.
@JL-zl6ot 6 หลายเดือนก่อน
Super interesting and informative. Thanks for posting this!!
@jameshoang877 6 หลายเดือนก่อน
woo, thank you for your talk. Love it so much! 😀
@haoliang7319 5 หลายเดือนก่อน
excellent talk, great demonstration of how llm works, and more importantly how research works in general, 清流in the hype
@LiangyueLi 3 หลายเดือนก่อน
thanks for the talk, it is very inspiring and interesting!
@indraneelmukherjee2092 6 หลายเดือนก่อน ⁺¹
Awesome stuff!!
@ZYTUSA 6 หลายเดือนก่อน
Thanks for uploading!!
@baboothewonderspam 4 หลายเดือนก่อน
Great talk - thanks!!
@fangzhangmnm6049 5 หลายเดือนก่อน ⁺⁵
That's a terrible [Back] amazing talk!
@shubhamtoshniwal2221 5 หลายเดือนก่อน
hahaha, good one
@zhangmeng2538 3 หลายเดือนก่อน
Amazing talk!
@邵斌-n2r 6 หลายเดือนก่อน
Beautiful talk! Textbook level.
@shijielin7465 หลายเดือนก่อน
amazing talk,
@GNARGNARHEAD 5 หลายเดือนก่อน
wonderful talk, thanks
@orlandoairs23 5 หลายเดือนก่อน
Really insightful❤
@PatrickvanNieuwenhuizen 4 หลายเดือนก่อน ⁺²
can anyone explain why pretrain on bio + fine tune in QA is so different from mixed pre training on bio + QA? what if the fine tune was a full fine tune updating all weights? why would that be any different from including QA in pre training? (re 26:21)
@Zeyuan-AllenZhu 4 หลายเดือนก่อน ⁺¹
Sure, the failure of the finetune for this part includes full finetuning. The controlled experiment made sure that the finetune dataset only contains QAs for *half* of the individuals, thus even overfitting (e.g. full finetune for sufficiently long) on this half doesn't generalize to the QAs for the other half. If you're interested in the details, you might want to watch my separate video talking about Part 3.1+3.2.
PS: the failure of the finetune for Part 2.2, Result 7, does not apply to full finetuning (but for a not-so-interesting reason). Details will be in my separate video in Part 2.2 that will come in a few days (or in the arxiv paper).
@Zeyuan-AllenZhu 4 หลายเดือนก่อน ⁺¹
Doing mixed-pretrain allows the model to see two formats of data at the same time, thus this functions just like data augmentation and thus it works.
@PatrickvanNieuwenhuizen 4 หลายเดือนก่อน
thank you for the response!
when you say "see two formats of data at the same time" does that mean imply the bio and QA data sets are shuffled together and/or there are multiple epochs of pre-training?
would the outcome be different if pre-training was a single epoch? in that case, pre-training on A followed by fine-tuning on B would be identical to pre-train on A and B, right?
in any case, thanks for the response, will watch the other video, loving this!
@Zeyuan-AllenZhu 4 หลายเดือนก่อน ⁺¹
@@PatrickvanNieuwenhuizen Yes and there's a ratio controlling the fraction of data coming from bio/QA in each sample (of whatever context length). I included either only the bio data or only the QA data in each sample --- if you want to see how sample ratio affects the result, it's in the paper or the other video.
Pretraining with one pass of the data gives the model only one chance to see each piece of knowledge. This is called "1 exposure" using the language from Part 3.3. If all pieces of knowledge only received 1 exposure then I don't think the model will remember anything. One needs to have some factual knowledge that repeated appears in the pretrain data (with enough exposures) so that the model learns how to "parse" such knowledge. Part 3.1 is talking about this "parsing" for such (repeated) knowledge data, and it's important to teach the model how to correctly parse such knowledge (with either mixed training or data augmentation). It will be too late to do that during finetune.
@bojome3751 4 หลายเดือนก่อน
Very nice presentation. One question which I had on the results shown around 18:40. How do you ensure that this effect is not just overfitting on the Q/A training-set? Did you try an experiment with some biographies in a validation split during finetuning, does it never work or does it vanish due to overfitting?
@michaelguan883 4 หลายเดือนก่อน ⁺²
This is an excellent research! I loved it! I have a question about the difference between fine-tuning and pertaining. In Part 2.2, learn from mistakes on grade-school math problems, the experiment showed that it is too late to add "retry" data at fine-tuning stage. I wonder if there's a difference in the objective function between fine-tuning and pertaining? Are they both cross-entropy loss on every output token in this case? If they are not the same, what are they then? If they are the same, why is it too late to do it at fine-tuning stage?
@ananis25 4 หลายเดือนก่อน
AFAIK (pls double-check, I think Andre Karpathy's videos cover it)
- Imagine a training data sample, a Q and an A.
- your objective is still next token prediction for both pretraining and instruction-finetuning
- When pretraining, you update weights after each token of both Q and A
- With instruction finetuning, you only backpropagate and update weights for the tokens in A.
@scinerd68 4 หลายเดือนก่อน
I believe fine-tune in that part specifically refer to PEFT methods aka LoRA, in the paper they actually mentioned that full finetune on retry data achieve the same result as pretraining, and they called this continued pretraining. In other words, pretrain on T tokens error-free data + full finetune on T tokens of retry data = pretrain on 2T retry data
@galoisgogh452 หลายเดือนก่อน
Great work. Have you tried to use probabilistically compressed data as model inputs? It’s very interesting to know its implications on the performance. For example, junk data are likely have poor compression ratios…
@MAKELEKAM 26 วันที่ผ่านมา
Not a LLM researcher, but I watched every second of this video. Great talk. Got one question, does LLM learn the knowledge in a similar way as humans now? Human could have fatanstic accuracy on out-of-distribution test on reasoning and knowledge manipulations, unlike current LLM. Will the future LLM be using a more human way to solve this issue or still rely on data augmentation?
@kksu6860 6 หลายเดือนก่อน
Great talk.
@filobonda 5 หลายเดือนก่อน ⁺¹
The forbidden knowledge! Quick, download it!
@oulshis7453 6 หลายเดือนก่อน
amazing!!
@dyfgluv2182 2 หลายเดือนก่อน
Thks for the great talk! I want to ask how is 8 selected for v-probing rank, and if there are related ablations. (could rank-8 be sufficient for some tasks, even if initial weights were random?)
@devon9374 4 หลายเดือนก่อน
Most I've ever learned about LLMs... ever.
ClosedAI is punching the air rn
@spartaleonidas540 6 หลายเดือนก่อน ⁺²
Well.. that was visionary. Your best guess on why Claude is winning at code gen? They’re on this trail already?
@004307ec 6 หลายเดือนก่อน
awesome😊❤
@sokak01 7 วันที่ผ่านมา
Are these results mostly phenomenology? E.g., the 2 bits per parameter is conjectured or proved as well?
@ognjenstefanovic5062 4 หลายเดือนก่อน
Amazing
@TingHuang-h7z 21 วันที่ผ่านมา
"In evaluation, the model only sees the question without hints. We design tokens to instruct the model to either generate a hint followed by an answer (test acc (with hint)), or to answer directly (test acc (w/o hint))."
I was wandering how to design tokens to instruct the model to either generate a hint or not? I cant find in the paper or the talk. I'd like to get the author's replay, thank you.
@Zeyuan-AllenZhu 18 วันที่ผ่านมา
Hi, it's easy in our controlled setting. We tried both ways and the results are similar. The first is to do ” question answer “ and ” question hint ; answer ". In this way, if you give the model it knows it must answer without hint, and vice versa.
We also tried to do " question answer " and " question hint ; answer " and we then collect model's accuracies when it's answer doesn't include the semi-colon ";". The model will learn to use hint with around half the chance. Why? That's because learning format is extremely easy for a model (c.f. our Part 1 CFG paper.)
@TingHuang-h7z 17 วันที่ผ่านมา
@@Zeyuan-AllenZhu when training one model, are training data with hint and no hint both collected as the training data for this model? and their probabiliy is 1:1. (我感觉我的英文可能没有完全表达我的疑问，我用中文描述下：在这个实验里，训练数据有50%是带hint的，50%是不带hint的，然后在inference后，有50%的生成是带hint的，50%是不带hint的。然后统计发现，生成的50%带hint的准确率很高，不带hint的准确率类似与随机。第二种方式是类似上面这里的描述的这样吗？)
@TingHuang-h7z 17 วันที่ผ่านมา
@@Zeyuan-AllenZhu Beside, if all the train data is cot data, the inference will output cot format 100%. Is this true?
@Zeyuan-AllenZhu 15 วันที่ผ่านมา
Yes, for our (in-distribution) QA training data we include hint with 50% probability, or in other words half of the QA training data there are hints (e.g, using token) and the other half without hints (e.g., using token). During test time, we test (on out-of-distribution people) and let the model generate either starting from token (so forcing it not to use hint) or generate starting from token (so forcing it to use hint).
This corresponds to my "first way" above, and is easier to describe (than my "second way"). I suggest you use this view.
@Zeyuan-AllenZhu 15 วันที่ผ่านมา
Yes of coz 100%. Learning format is trivial. The hint tokens and answer tokens are very disjoint in our setting, and there's no reason what-so-ever the model will start to output answer tokens without first generating hint tokens --- if all training QAs use hints.
You can always assume that the model learns format easily. This is why I said at the end of the tutorial that when people use the word hallucination, I don't. It just means learning format is easy and and the model learns the format before (i.e., faster than) it learns the answer.
@mohammadsalah2307 22 วันที่ผ่านมา
Very insightful talk. but too many ads interrupt this great talk...would you mind turn them off?
@TheWasserstoff 5 หลายเดือนก่อน
Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?
@spartaleonidas540 6 หลายเดือนก่อน
30:55 the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this:
Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA.
What is the birth date of Anya Briar Forger?
Answer: Anya Briar Forger's birth date is October 2, 1996.
Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA?
Answer: Anya Briar Forger, based on the information provided.
@Zeyuan-AllenZhu 6 หลายเดือนก่อน ⁺³
The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.
@fengliang1590 5 หลายเดือนก่อน
For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.
@mono-ren 5 หลายเดือนก่อน
Highly appreciate this masterpiece! Could you please share the slides?
@peterzhong3085 6 หลายเดือนก่อน
Do you mind if I share this video in my linkedin. I will emphasize the authors.
@QuanWang 6 หลายเดือนก่อน
In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)
@clementdato6328 6 หลายเดือนก่อน
As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942
@sparshteotia652 6 หลายเดือนก่อน
@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.
@Zeyuan-AllenZhu 6 หลายเดือนก่อน
Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...
@draganjovanovic3112 4 หลายเดือนก่อน
🙏
@hanchisun6164 6 หลายเดือนก่อน ⁺¹
Some of the issues are caused by the discrepancies of tokenizers
@Zeyuan-AllenZhu 6 หลายเดือนก่อน ⁺³
@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.
@sparshteotia652 6 หลายเดือนก่อน ⁺²
@@Zeyuan-AllenZhuvery rigorous probing and good research looking for more of your experiments and talks
@hanchisun6164 6 หลายเดือนก่อน ⁺²
@@Zeyuan-AllenZhu Thank you so much for the clarifications! I will read your paper more carefully
@wwkk4964 6 หลายเดือนก่อน
@@Zeyuan-AllenZhuat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.
@bwatspro 6 หลายเดือนก่อน
This needs more attention
@BernaBermejo-qw2yh 5 หลายเดือนก่อน
Attention is all it need
@MrNoipe 6 หลายเดือนก่อน
Download this video NOW before it gets taken down again!
@devon9374 4 หลายเดือนก่อน
Why was it taken down?
@ai_outline 4 หลายเดือนก่อน ⁺¹
Maybe I’m dumb, but isn’t it misleading to call this “physics”? They are not physical, they are an abstraction of computation. Maybe the computer science of Language models would fit better? The information encoding of Language models? But physics? I don’t think that’s correct.
Or are they using the term “physics” not in the context of the science but in the context of trying to explain its structure and underlaying concepts?
@deter3 6 หลายเดือนก่อน
awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .
@clementdato6328 6 หลายเดือนก่อน
A is B does not implies B is A though…
@devon9374 4 หลายเดือนก่อน
He is using a logical reference to explain that IF A is B and B is A (by definition!), LLMs will not be able to know B is A. Very simple.
@ScottzPlaylists 4 หลายเดือนก่อน ⁺¹
The High School Q/A example is horrible. 👎❗❗ ❗
It's not even good english or grammer. th-cam.com/video/yBL7J0kgldU/w-d-xo.html
(should have run it through a grammar checker before submitting it to a model or exposing an audience ❗ )
👍👍Otherwise: Very good Investigation into LLM's -- thanks 👍👍
@Jvo_Rien 2 หลายเดือนก่อน
It's fundamentally not a problem. LLMs are better learners when the training dataset contains approximations such as mistakes.

ต่อไป

เล่นอัตโนมัติ

Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation