This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.
Great work! The research questions are simple and the answers are profound and not too complex but they are also exactly what we need to understand how LLMs work. I hope you continue to demystifie the limitations of LLMs and even find technical remedies. One of the highest Signal/Noise pieces i've had the pleasure to enjoy in quite some time!
Timestamps (Powered by Merlin AI) 00:10 - Theory of language models covers a spectrum of meanings 02:59 - Newton's laws built on Kepler's work, preceded by Tycho Brahe's observatory data 09:07 - Emphasizing on the physics of language models for detailed study and understanding. 12:08 - Language models need controlled experiments to study knowledge manipulation 18:12 - Knowledge augmentation significantly improves accuracy of language models. 21:02 - Augmenting knowledge in language models changes how knowledge is stored. 26:50 - Exploring knowledge manipulation in language models 29:54 - GPT-4 excels in simple tasks but struggles with knowledge manipulations. 35:29 - Language models may struggle to fully extract specific knowledge details. 38:18 - Synthetic data can measure knowledge stored in datasets and language models. 43:57 - Rare knowledge decreases knowledge capacity, affecting model performance 46:42 - Importance of controlled experiments in assessing knowledge capacity 52:12 - Utilizing synthetic data to control and study language model behaviors 55:06 - Developing synthetic math data sets is crucial for unbiased and diverse problem generation. 1:00:47 - Challenges in language models solving arithmetic problems 1:03:32 - Language models can generalize to solve math problems beyond training data 1:09:13 - Language models develop reasoning skills beyond level-1. 1:12:05 - Language models make two types of mistakes: unnecessary parameters and getting stuck during computation. 1:17:48 - Language models require deep networks for reasoning skills. 1:20:43 - Language models often know when they make mistakes 1:26:18 - Label masking not needed for learning from errors in math data 1:28:56 - Creating fake mistakes can improve accuracy gain in language models 1:34:23 - Language models learn structures and formats efficiently 1:37:03 - Importance of relative attention in training language models 1:42:42 - GPT can learn CFG trees and linearly encode sub-tree information. 1:45:28 - Language models utilize dynamic programming for parsing and generation 1:51:19 - GPT-2 learns algorithms from patterns
Great talk. Genius level. That said, imho It takes humans around 20 years of education, filled with structured training and guidance, to learn what to remember and prioritize before we can effectively apply the scientific method. This training is continuously updated and refined. Expecting an LLM to reason and make decisions without specific training seems unrealistic. The issues with AI reasoning are genuine, but the need for training is normal and essential. Once properly trained, using expensive COT or any new or future methodology, then I would hope that an LLM could perform tasks at a much faster rate, but this training is crucial because knowledge and reasoning, whether in humans or LLMs, don't come as a magical solution without effort and guidance. I'm no specialist, so take what I say with a grain of salt or a bucket. Again, great talk. ty
can anyone explain why pretrain on bio + fine tune in QA is so different from mixed pre training on bio + QA? what if the fine tune was a full fine tune updating all weights? why would that be any different from including QA in pre training? (re 26:21)
Sure, the failure of the finetune for this part includes full finetuning. The controlled experiment made sure that the finetune dataset only contains QAs for *half* of the individuals, thus even overfitting (e.g. full finetune for sufficiently long) on this half doesn't generalize to the QAs for the other half. If you're interested in the details, you might want to watch my separate video talking about Part 3.1+3.2. PS: the failure of the finetune for Part 2.2, Result 7, does not apply to full finetuning (but for a not-so-interesting reason). Details will be in my separate video in Part 2.2 that will come in a few days (or in the arxiv paper).
thank you for the response! when you say "see two formats of data at the same time" does that mean imply the bio and QA data sets are shuffled together and/or there are multiple epochs of pre-training? would the outcome be different if pre-training was a single epoch? in that case, pre-training on A followed by fine-tuning on B would be identical to pre-train on A and B, right? in any case, thanks for the response, will watch the other video, loving this!
@@PatrickvanNieuwenhuizen Yes and there's a ratio controlling the fraction of data coming from bio/QA in each sample (of whatever context length). I included either only the bio data or only the QA data in each sample --- if you want to see how sample ratio affects the result, it's in the paper or the other video. Pretraining with one pass of the data gives the model only one chance to see each piece of knowledge. This is called "1 exposure" using the language from Part 3.3. If all pieces of knowledge only received 1 exposure then I don't think the model will remember anything. One needs to have some factual knowledge that repeated appears in the pretrain data (with enough exposures) so that the model learns how to "parse" such knowledge. Part 3.1 is talking about this "parsing" for such (repeated) knowledge data, and it's important to teach the model how to correctly parse such knowledge (with either mixed training or data augmentation). It will be too late to do that during finetune.
Very nice presentation. One question which I had on the results shown around 18:40. How do you ensure that this effect is not just overfitting on the Q/A training-set? Did you try an experiment with some biographies in a validation split during finetuning, does it never work or does it vanish due to overfitting?
This is an excellent research! I loved it! I have a question about the difference between fine-tuning and pertaining. In Part 2.2, learn from mistakes on grade-school math problems, the experiment showed that it is too late to add "retry" data at fine-tuning stage. I wonder if there's a difference in the objective function between fine-tuning and pertaining? Are they both cross-entropy loss on every output token in this case? If they are not the same, what are they then? If they are the same, why is it too late to do it at fine-tuning stage?
AFAIK (pls double-check, I think Andre Karpathy's videos cover it) - Imagine a training data sample, a Q and an A. - your objective is still next token prediction for both pretraining and instruction-finetuning - When pretraining, you update weights after each token of both Q and A - With instruction finetuning, you only backpropagate and update weights for the tokens in A.
I believe fine-tune in that part specifically refer to PEFT methods aka LoRA, in the paper they actually mentioned that full finetune on retry data achieve the same result as pretraining, and they called this continued pretraining. In other words, pretrain on T tokens error-free data + full finetune on T tokens of retry data = pretrain on 2T retry data
Great work. Have you tried to use probabilistically compressed data as model inputs? It’s very interesting to know its implications on the performance. For example, junk data are likely have poor compression ratios…
Not a LLM researcher, but I watched every second of this video. Great talk. Got one question, does LLM learn the knowledge in a similar way as humans now? Human could have fatanstic accuracy on out-of-distribution test on reasoning and knowledge manipulations, unlike current LLM. Will the future LLM be using a more human way to solve this issue or still rely on data augmentation?
Thks for the great talk! I want to ask how is 8 selected for v-probing rank, and if there are related ablations. (could rank-8 be sufficient for some tasks, even if initial weights were random?)
"In evaluation, the model only sees the question without hints. We design tokens to instruct the model to either generate a hint followed by an answer (test acc (with hint)), or to answer directly (test acc (w/o hint))." I was wandering how to design tokens to instruct the model to either generate a hint or not? I cant find in the paper or the talk. I'd like to get the author's replay, thank you.
Hi, it's easy in our controlled setting. We tried both ways and the results are similar. The first is to do ” question answer “ and ” question hint ; answer ". In this way, if you give the model it knows it must answer without hint, and vice versa. We also tried to do " question answer " and " question hint ; answer " and we then collect model's accuracies when it's answer doesn't include the semi-colon ";". The model will learn to use hint with around half the chance. Why? That's because learning format is extremely easy for a model (c.f. our Part 1 CFG paper.)
@@Zeyuan-AllenZhu when training one model, are training data with hint and no hint both collected as the training data for this model? and their probabiliy is 1:1. (我感觉我的英文可能没有完全表达我的疑问,我用中文描述下:在这个实验里,训练数据有50%是带hint的,50%是不带hint的,然后在inference后,有50%的生成是带hint的,50%是不带hint的。然后统计发现,生成的50%带hint的准确率很高,不带hint的准确率类似与随机。 第二种方式是类似上面这里的描述的这样吗?)
Yes, for our (in-distribution) QA training data we include hint with 50% probability, or in other words half of the QA training data there are hints (e.g, using token) and the other half without hints (e.g., using token). During test time, we test (on out-of-distribution people) and let the model generate either starting from token (so forcing it not to use hint) or generate starting from token (so forcing it to use hint). This corresponds to my "first way" above, and is easier to describe (than my "second way"). I suggest you use this view.
Yes of coz 100%. Learning format is trivial. The hint tokens and answer tokens are very disjoint in our setting, and there's no reason what-so-ever the model will start to output answer tokens without first generating hint tokens --- if all training QAs use hints. You can always assume that the model learns format easily. This is why I said at the end of the tutorial that when people use the word hallucination, I don't. It just means learning format is easy and and the model learns the format before (i.e., faster than) it learns the answer.
Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?
30:55 the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this: Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA. What is the birth date of Anya Briar Forger? Answer: Anya Briar Forger's birth date is October 2, 1996. Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA? Answer: Anya Briar Forger, based on the information provided.
The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.
For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.
In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)
As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942
@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.
Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...
@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.
@@Zeyuan-AllenZhuat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.
Maybe I’m dumb, but isn’t it misleading to call this “physics”? They are not physical, they are an abstraction of computation. Maybe the computer science of Language models would fit better? The information encoding of Language models? But physics? I don’t think that’s correct. Or are they using the term “physics” not in the context of the science but in the context of trying to explain its structure and underlaying concepts?
awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .
The High School Q/A example is horrible. 👎❗❗ ❗ It's not even good english or grammer. th-cam.com/video/yBL7J0kgldU/w-d-xo.html (should have run it through a grammar checker before submitting it to a model or exposing an audience ❗ ) 👍👍Otherwise: Very good Investigation into LLM's -- thanks 👍👍
Best tech talk since LLM!
since which?
This is crazy. I'm like a low dim creature witnessing high dim creature's thinking process and experimental methods. The testbed is so well chosen that I'll build on this to learn more. Thank you so much.
Finally, someone has investigated the physics of LLM. Great talk
Great work! The research questions are simple and the answers are profound and not too complex but they are also exactly what we need to understand how LLMs work. I hope you continue to demystifie the limitations of LLMs and even find technical remedies. One of the highest Signal/Noise pieces i've had the pleasure to enjoy in quite some time!
This is the most insightful talk I have ever seen. This has tought me how powerful controlled experiments can be!
One of the clearest talk about LLM I’ve ever heard
Awesome talk! Thorough and detailed investigation into multiple aspects of LLM learning, their behavior.
Very high information density. Each sub section worth an individual paper
great talk, very insightful, excited to see that we can prepare synthetic data to manipulate LLM
Awesome talk, I am glad it is available now for those who can't make it to ICML in person!
Allen, this is such a wonderful talk. So many thanks for putting this online!
Super insightful, thanks for the upload!
Okay, this was Masterclass level of presentation. Great job Zeyuan!
Absolutely the best talk I've seen in a long time. This is how you do science!
Timestamps (Powered by Merlin AI)
00:10 - Theory of language models covers a spectrum of meanings
02:59 - Newton's laws built on Kepler's work, preceded by Tycho Brahe's observatory data
09:07 - Emphasizing on the physics of language models for detailed study and understanding.
12:08 - Language models need controlled experiments to study knowledge manipulation
18:12 - Knowledge augmentation significantly improves accuracy of language models.
21:02 - Augmenting knowledge in language models changes how knowledge is stored.
26:50 - Exploring knowledge manipulation in language models
29:54 - GPT-4 excels in simple tasks but struggles with knowledge manipulations.
35:29 - Language models may struggle to fully extract specific knowledge details.
38:18 - Synthetic data can measure knowledge stored in datasets and language models.
43:57 - Rare knowledge decreases knowledge capacity, affecting model performance
46:42 - Importance of controlled experiments in assessing knowledge capacity
52:12 - Utilizing synthetic data to control and study language model behaviors
55:06 - Developing synthetic math data sets is crucial for unbiased and diverse problem generation.
1:00:47 - Challenges in language models solving arithmetic problems
1:03:32 - Language models can generalize to solve math problems beyond training data
1:09:13 - Language models develop reasoning skills beyond level-1.
1:12:05 - Language models make two types of mistakes: unnecessary parameters and getting stuck during computation.
1:17:48 - Language models require deep networks for reasoning skills.
1:20:43 - Language models often know when they make mistakes
1:26:18 - Label masking not needed for learning from errors in math data
1:28:56 - Creating fake mistakes can improve accuracy gain in language models
1:34:23 - Language models learn structures and formats efficiently
1:37:03 - Importance of relative attention in training language models
1:42:42 - GPT can learn CFG trees and linearly encode sub-tree information.
1:45:28 - Language models utilize dynamic programming for parsing and generation
1:51:19 - GPT-2 learns algorithms from patterns
Thank you for uploading! So much to learn :)
woww, just woww, one of the best talks ever, up above with my goated dl videos
Good job! You explained it in a very clear way! One of the best talks i have watched recently
Thanks a lot for the illuminating and inspiring research! And of course for extremely dense and entertaining presentation!
Fantastic talk. This is the kind correct combination of hypothesized but backed up with disciplined investigative research that I love to see.
Really insightful. Great talk.
Awesome talk! The best talk I have seen! And this is my first time to leave messages on youtube...
Thank you very much for posting this talk!
Such an amazing talk and so many learnings
Great talk. Genius level. That said, imho It takes humans around 20 years of education, filled with structured training and guidance, to learn what to remember and prioritize before we can effectively apply the scientific method. This training is continuously updated and refined. Expecting an LLM to reason and make decisions without specific training seems unrealistic. The issues with AI reasoning are genuine, but the need for training is normal and essential. Once properly trained, using expensive COT or any new or future methodology, then I would hope that an LLM could perform tasks at a much faster rate, but this training is crucial because knowledge and reasoning, whether in humans or LLMs, don't come as a magical solution without effort and guidance. I'm no specialist, so take what I say with a grain of salt or a bucket. Again, great talk. ty
Amazing tutorial. I learned a lot about how LLMs work!
What a smart guy! His speech is excellent.
Super interesting and informative. Thanks for posting this!!
woo, thank you for your talk. Love it so much! 😀
excellent talk, great demonstration of how llm works, and more importantly how research works in general, 清流in the hype
thanks for the talk, it is very inspiring and interesting!
Awesome stuff!!
Thanks for uploading!!
Great talk - thanks!!
That's a terrible [Back] amazing talk!
hahaha, good one
Amazing talk!
Beautiful talk! Textbook level.
amazing talk,
wonderful talk, thanks
Really insightful❤
can anyone explain why pretrain on bio + fine tune in QA is so different from mixed pre training on bio + QA? what if the fine tune was a full fine tune updating all weights? why would that be any different from including QA in pre training? (re 26:21)
Sure, the failure of the finetune for this part includes full finetuning. The controlled experiment made sure that the finetune dataset only contains QAs for *half* of the individuals, thus even overfitting (e.g. full finetune for sufficiently long) on this half doesn't generalize to the QAs for the other half. If you're interested in the details, you might want to watch my separate video talking about Part 3.1+3.2.
PS: the failure of the finetune for Part 2.2, Result 7, does not apply to full finetuning (but for a not-so-interesting reason). Details will be in my separate video in Part 2.2 that will come in a few days (or in the arxiv paper).
Doing mixed-pretrain allows the model to see two formats of data at the same time, thus this functions just like data augmentation and thus it works.
thank you for the response!
when you say "see two formats of data at the same time" does that mean imply the bio and QA data sets are shuffled together and/or there are multiple epochs of pre-training?
would the outcome be different if pre-training was a single epoch? in that case, pre-training on A followed by fine-tuning on B would be identical to pre-train on A and B, right?
in any case, thanks for the response, will watch the other video, loving this!
@@PatrickvanNieuwenhuizen Yes and there's a ratio controlling the fraction of data coming from bio/QA in each sample (of whatever context length). I included either only the bio data or only the QA data in each sample --- if you want to see how sample ratio affects the result, it's in the paper or the other video.
Pretraining with one pass of the data gives the model only one chance to see each piece of knowledge. This is called "1 exposure" using the language from Part 3.3. If all pieces of knowledge only received 1 exposure then I don't think the model will remember anything. One needs to have some factual knowledge that repeated appears in the pretrain data (with enough exposures) so that the model learns how to "parse" such knowledge. Part 3.1 is talking about this "parsing" for such (repeated) knowledge data, and it's important to teach the model how to correctly parse such knowledge (with either mixed training or data augmentation). It will be too late to do that during finetune.
Very nice presentation. One question which I had on the results shown around 18:40. How do you ensure that this effect is not just overfitting on the Q/A training-set? Did you try an experiment with some biographies in a validation split during finetuning, does it never work or does it vanish due to overfitting?
This is an excellent research! I loved it! I have a question about the difference between fine-tuning and pertaining. In Part 2.2, learn from mistakes on grade-school math problems, the experiment showed that it is too late to add "retry" data at fine-tuning stage. I wonder if there's a difference in the objective function between fine-tuning and pertaining? Are they both cross-entropy loss on every output token in this case? If they are not the same, what are they then? If they are the same, why is it too late to do it at fine-tuning stage?
AFAIK (pls double-check, I think Andre Karpathy's videos cover it)
- Imagine a training data sample, a Q and an A.
- your objective is still next token prediction for both pretraining and instruction-finetuning
- When pretraining, you update weights after each token of both Q and A
- With instruction finetuning, you only backpropagate and update weights for the tokens in A.
I believe fine-tune in that part specifically refer to PEFT methods aka LoRA, in the paper they actually mentioned that full finetune on retry data achieve the same result as pretraining, and they called this continued pretraining. In other words, pretrain on T tokens error-free data + full finetune on T tokens of retry data = pretrain on 2T retry data
Great work. Have you tried to use probabilistically compressed data as model inputs? It’s very interesting to know its implications on the performance. For example, junk data are likely have poor compression ratios…
Not a LLM researcher, but I watched every second of this video. Great talk. Got one question, does LLM learn the knowledge in a similar way as humans now? Human could have fatanstic accuracy on out-of-distribution test on reasoning and knowledge manipulations, unlike current LLM. Will the future LLM be using a more human way to solve this issue or still rely on data augmentation?
Great talk.
The forbidden knowledge! Quick, download it!
amazing!!
Thks for the great talk! I want to ask how is 8 selected for v-probing rank, and if there are related ablations. (could rank-8 be sufficient for some tasks, even if initial weights were random?)
Most I've ever learned about LLMs... ever.
ClosedAI is punching the air rn
Well.. that was visionary. Your best guess on why Claude is winning at code gen? They’re on this trail already?
awesome😊❤
Are these results mostly phenomenology? E.g., the 2 bits per parameter is conjectured or proved as well?
Amazing
"In evaluation, the model only sees the question without hints. We design tokens to instruct the model to either generate a hint followed by an answer (test acc (with hint)), or to answer directly (test acc (w/o hint))."
I was wandering how to design tokens to instruct the model to either generate a hint or not? I cant find in the paper or the talk. I'd like to get the author's replay, thank you.
Hi, it's easy in our controlled setting. We tried both ways and the results are similar. The first is to do ” question answer “ and ” question hint ; answer ". In this way, if you give the model it knows it must answer without hint, and vice versa.
We also tried to do " question answer " and " question hint ; answer " and we then collect model's accuracies when it's answer doesn't include the semi-colon ";". The model will learn to use hint with around half the chance. Why? That's because learning format is extremely easy for a model (c.f. our Part 1 CFG paper.)
@@Zeyuan-AllenZhu when training one model, are training data with hint and no hint both collected as the training data for this model? and their probabiliy is 1:1. (我感觉我的英文可能没有完全表达我的疑问,我用中文描述下:在这个实验里,训练数据有50%是带hint的,50%是不带hint的,然后在inference后,有50%的生成是带hint的,50%是不带hint的。然后统计发现,生成的50%带hint的准确率很高,不带hint的准确率类似与随机。 第二种方式是类似上面这里的描述的这样吗?)
@@Zeyuan-AllenZhu Beside, if all the train data is cot data, the inference will output cot format 100%. Is this true?
Yes, for our (in-distribution) QA training data we include hint with 50% probability, or in other words half of the QA training data there are hints (e.g, using token) and the other half without hints (e.g., using token). During test time, we test (on out-of-distribution people) and let the model generate either starting from token (so forcing it not to use hint) or generate starting from token (so forcing it to use hint).
This corresponds to my "first way" above, and is easier to describe (than my "second way"). I suggest you use this view.
Yes of coz 100%. Learning format is trivial. The hint tokens and answer tokens are very disjoint in our setting, and there's no reason what-so-ever the model will start to output answer tokens without first generating hint tokens --- if all training QAs use hints.
You can always assume that the model learns format easily. This is why I said at the end of the tutorial that when people use the word hallucination, I don't. It just means learning format is easy and and the model learns the format before (i.e., faster than) it learns the answer.
Very insightful talk. but too many ads interrupt this great talk...would you mind turn them off?
Definetly an eye opener when it comes to inner working of LLMs, I need to understand if this approach can be replicated while building RAG systems and fiunetuning?
30:55 the proposed Turing test doesn't really apply to in-context learning, i.e. after RLHF or DPO right? A simple test (Gemini Advanced) verifies this:
Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA.
What is the birth date of Anya Briar Forger?
Answer: Anya Briar Forger's birth date is October 2, 1996.
Question: Who was born on October 2, 1996 in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA?
Answer: Anya Briar Forger, based on the information provided.
The entire Part 3 is about (factual) knowledge, which is "out-of-context" knowledge, including asking some celebrity's birthday when that knowledge is *not* given in the same context window, or even things like 7*9=63 is also considered factual knowledge. In contrast, in-context "knowledge" (which I don't usually call this knowledge, but some people do) is certainly inversely searchable, also manipulatable. Any hierarchical operations on the "knowledge" can be viewed as some sort of reasoning or mental computation (if not using CoT), and that's what Part 2 covers about (at least to some extent). Hope this answers your question.
For Part 3.2 reverse search problem, it occurs to me that OpenAI models actually have a special token , fill in the middle token. Probably it can help mitigate a bit of the reversal curse. As we can do something like "往开来" to train models. But as it's a special token, I suspect that the capability is triggered only when it's used. Probably, we can just fix this by using multiple or random fill-in-the-middle tokens.
Highly appreciate this masterpiece! Could you please share the slides?
Do you mind if I share this video in my linkedin. I will emphasize the authors.
In your experiment setup, during finetuning, is the loss computed on the full sequence, or only the generated tokens? If latter, do you think that might be the cause of different conclusions for pre training vs finetuning? (I still don't understand how they shall be different from a universal law...)
As for not being capable of retrieving partial knowledge, from “October 7, 1942” to 1942. Is this because all of the training data never has the concept of month, day and year separately? Otherwise, I can see other example data to serve as “celebrity” that helps learn extract the data for “minority”, like a specific October 7, 1942
@zhuzeyuan I had a question can we do pretraining on large instruction tuning datasets then do sft for smaller scale sft datasets, as you state that sft is not beneficial if tasks information is not in the domain of pretrained parameters.
Sort of yes. But even in the domain, I can tell you a set of experiments we did (but don't have the time to write a paper). If you pretrain a model on N biographics + make sure knowledge is 100% extractable. Next, suppose you finetune with M biographies, and check if the model can extract knowledge of those M people? The answer is, regardless of the finetune method (full, lora, different ranks) it seems M can be at most 1% of N. But I didn't have the time to do a more thorough experiment to also vary model size, etc. We've got more urgent things to work on first...
🙏
Some of the issues are caused by the discrepancies of tokenizers
@@hanchisun6164 Essentially none in our case. The controlled experiment has ruled out the influence from tokenizers. An example is the GPT vs Llama experiment, where we actually compared Llama vs Llama(GatedMLP replaced with MLP) or GPT with GPT(MLP replaced with GatedMLP), so the comparison is“conditioning” on the same tokenizer.
@@Zeyuan-AllenZhuvery rigorous probing and good research looking for more of your experiments and talks
@@Zeyuan-AllenZhu Thank you so much for the clarifications! I will read your paper more carefully
@@Zeyuan-AllenZhuat around 37 minute mark, you talked about the curve of reversal, but I think read a paper where they were definitely able to solve this problem by training the model on jumbled up tokens of variable n size between a starting and ending token. It was inefficient and it's performance on forward direction was bad but it's performance on reverse was equally bad or equally good depending on how you want to characterize it compared to bert. Anyway, ENJOYING YOUR WORK, THANK YOU, WILL CONTINUE WATCHING.
This needs more attention
Attention is all it need
Download this video NOW before it gets taken down again!
Why was it taken down?
Maybe I’m dumb, but isn’t it misleading to call this “physics”? They are not physical, they are an abstraction of computation. Maybe the computer science of Language models would fit better? The information encoding of Language models? But physics? I don’t think that’s correct.
Or are they using the term “physics” not in the context of the science but in the context of trying to explain its structure and underlaying concepts?
awesome research . The only drawback is the author has less research llms on cognition level , so it is not very strategical research perspective . In other words , the assumption of research is llms is a knowledge engine .
A is B does not implies B is A though…
He is using a logical reference to explain that IF A is B and B is A (by definition!), LLMs will not be able to know B is A. Very simple.
The High School Q/A example is horrible. 👎❗❗ ❗
It's not even good english or grammer. th-cam.com/video/yBL7J0kgldU/w-d-xo.html
(should have run it through a grammar checker before submitting it to a model or exposing an audience ❗ )
👍👍Otherwise: Very good Investigation into LLM's -- thanks 👍👍
It's fundamentally not a problem. LLMs are better learners when the training dataset contains approximations such as mistakes.