GPT-3: Language Models are Few-Shot Learners (Paper Explained)

Yannic Kilcher

มุมมอง 211 332

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 พ.ค. 2024
#gpt3 #openai #gpt-3
How far can you go with ONLY language modeling? Can a large enough language model perform NLP task out of the box? OpenAI take on these and other questions by training a transformer that is an order of magnitude larger than anything that has ever been built before and the results are astounding.
OUTLINE:
0:00 - Intro & Overview
1:20 - Language Models
2:45 - Language Modeling Datasets
3:20 - Model Size
5:35 - Transformer Models
7:25 - Fine Tuning
10:15 - In-Context Learning
17:15 - Start of Experimental Results
19:10 - Question Answering
23:10 - What I think is happening
28:50 - Translation
31:30 - Winograd Schemes
33:00 - Commonsense Reasoning
37:00 - Reading Comprehension
37:30 - SuperGLUE
40:40 - NLI
41:40 - Arithmetic Expressions
48:30 - Word Unscrambling
50:30 - SAT Analogies
52:10 - News Article Generation
58:10 - Made-up Words
1:01:10 - Training Set Contamination
1:03:10 - Task Examples
arxiv.org/abs/2005.14165
github.com/openai/gpt-3
Abstract:
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Links:
TH-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 464

@larrybird3729 4 ปีที่แล้ว ⁺⁵²⁷
Imagine telling Alan Turing we created a 5.7 trillion bit program to answer "what is one plus one?" lol
@Lord_Drakostar 4 ปีที่แล้ว ⁺¹⁹
Hey, maybe it could create numbers
@jorgehenriquesoares7880 3 ปีที่แล้ว ⁺¹³
He would be amazed
@davidkuitunen5286 3 ปีที่แล้ว ⁺³²
you can build an AND gate with a few transistors or you can use a 5.7 trillion bit program to infer the meaning of the word "and"
@LukePluto 3 ปีที่แล้ว ⁺⁴⁰
it took ~370 pages of symbolic logic to show 1 + 1 = 2 in Principia Mathematica
@bofur2727 3 ปีที่แล้ว ⁺¹
Yep, that is not unimaginable but existing of these comments in this media without his "what is one plus one" ;)
@that_guy4690 3 ปีที่แล้ว ⁺³⁷
Watching videos about large language models really makes me ask myself: "What is really "human" reasoning?" And how do humans learn stuff?
A great point on arithmetic operations!
@derasor ปีที่แล้ว ⁺⁴
So true. We don't really know how 'reasoning' actually works on the brain, so saying this system is not capable of reasoning has no floor IMO. Also, you could argue other beings with biological neural system are capable of 'reason'.
@georgestheys2752 ปีที่แล้ว ⁺¹
That is the essential question .. if reasoning is ability of the brain to come with the next word based on a word sequence that is most likely to satisfy the reasoning recipients ... GPTx will get there ... I suspect (hope?) there is more to that.
@eternalsecretforgettingfor8525 3 ปีที่แล้ว ⁺³⁰
OUTLINE:
0:00-Intro & OvervieW
1:20-Language Models
2:45-Language Modeling Datasets
3:20-Model Size
5:35-Transformer Models
7:25-Fine Tuning
10:15- In-Context Learning
17:15-Start of Experimental Results
19:10-Question Answering
23:10-What I think is happening
28:50- Translation
31:30-Winograd Schemes
33:00-Commonsense Reasoning
37:00- Reading Comprehension
37:30-SuperGLUE
40:40- NLI
41:40- Arithmetic Expressions
48:30- Word Unscrambling
50:30- SAT Analogies
52:10-News Article Generation
58:10-Made-up Words
1:01:10-Training Set Contamination
1:03:10-Task Examples
@DeveloperDesmond 4 ปีที่แล้ว ⁺⁷
This is such a fun format for educational video! And with a huge backlog of videos that look worth checking out, there's so much to learn. 'Excited for this channel!
@Tondadrd 4 ปีที่แล้ว ⁺⁴³
Wow, an OUTLINE, I didn't know that was possible on TH-cam :o
thx
@funkylosik 4 ปีที่แล้ว ⁺⁵
cool. Me netither. Just enter anywhere the "0:00 xxx" to start (0:00 is important) and mark your timestamps with description each on a new line.
@Tondadrd 4 ปีที่แล้ว ⁺⁶
@@funkylosik I found no documentation, so I had to figure all that out.
Also every time mentioned in description must be in rising order, there may be no duplicate times and at most 51 times.
Break any of these and it will simply not show.
@mrpoopo2320 3 ปีที่แล้ว
Have you guys never listened to a full album on TH-cam? Or a compilation? Or a Vlog Creation?
Maybe I can sound as silly to you as that comment does to me. Isn't this a table of contents, not an outline?
@KivySchool 4 ปีที่แล้ว ⁺⁶
Explaining papers? That's awesome. Subscribed instantly. Thanks for your effort and please continue to do so.
@lorenzoampil3232 4 ปีที่แล้ว ⁺³
Thank you so much for this. Your explanations are very clear and I appreciate you sharing your views on the paper. Keep up the good work!
@catharinecox6743 2 ปีที่แล้ว ⁺⁷
I am studying linguistics at uni and I'm writing my dissertation on whether humans can distinguish human from gpt-3-generated language. I am extending the findings of this paper by investigating the use of gpt-3 in social media, news and email contexts, using a large Turing-style survey whereby people are required to pick the AI response over the human one. I will apply the findings onto potential phishing, fake news and ethical implications. I study linguistics not computer science, so found this video extremely useful! Thank you for a great explanation.
@GuinessOriginal ปีที่แล้ว
I studied linguistics. You should think about following it up by training/programming a language novel to use a specific idiolect including particular fillers, qualifiers etc. to see if you can make them emulate human speech patterns, in particular one person in particular.
@terjeoseberg990 ปีที่แล้ว ⁺¹
@@GuinessOriginal, I believe that’s the issue. These languages models are emulating the average person when coming up with text, but there is no average person, so what it produces will be odd when compared to what’s produced by an individual.
Training it to emulate a particular person might solve this and make it indistinguishable from a human.
@GuinessOriginal ปีที่แล้ว
@@terjeoseberg990 this could easily be done by fine tuning a specific model trained on your inputs. A few weeks worth of audio and text data from your phone would probably be enough. It won’t be long before we can carry out open personal sparse AI models around with us on our phones, and I can envisage a world where the AI feat drafts replies to all your messages and you just need to approve. Eventually you’ll trust it so much you’ll let it reply to certain people without your specific approval, you’ve just given it directions on the approach to take. It’ll get so good at emulating your voice I’ll be able to get it to take your calls and answer as if it’s you, while you listen and are ready to take over if necessary, like a self driving car. Eventually you’ll just have AIs talking and texting each other while their people do something else 😂
@GuinessOriginal ปีที่แล้ว
@Catherine Cox Catherine how did your dissertation go? Are you interested in applying your work in a commercial setting for an AI company?
@terjeoseberg990 ปีที่แล้ว
@@GuinessOriginal, LOL
@bluel1ng 4 ปีที่แล้ว ⁺⁴⁸
Yannic, great presentation as always! But I think the power of transformer models is to "discover" structural similarities (frequent repeating structures). Many of these "rules" are not learned for exact input sequences but for sequences or co-occurrences of sets or classes of input symbols. This is IMO different from exact "regex-like" recall which would not tolerate different query representations. I think the embeddings on all layer-outputs are some form of thought- or summary-vectors that capture the gist of the context up to the current token. Attention can be seen as key-value store but I prefer to think of it as a soft read-memory and transform operation. The computational capabilities of transformer models are inherently limited by the number of feed-forward and attention steps but it has been shown with smaller models that this is enough for simple arithmetic operations which generalize outside numbers that were presented during training etc. While it is still not AGI I must personally say that I am again and again impressed by the "world-model" / knowledge-base that is generated via a "stupid" next or masked token prediction objective... ;-)
@YannicKilcher 4 ปีที่แล้ว ⁺¹⁶
Yes, I agree. When I say lookup or regex, I mean this in a sort-of fuzzy way (in a deep learning way, I guess). Pretty much what you're describing. But still different from reasoning abilities.
@GuagoFruit ปีที่แล้ว ⁺²
@@mcs699 I think you need to understand humans better before being able to say gpt-3 is truly "reasoning" as a human does.
Obviously you're very pro AI reasoning, but reducing human reasoning down to the level of what AI is at now is severely underselling human capacity.
@lgoose7604 3 ปีที่แล้ว ⁺⁸
Great video. Your explanation made it clear to me the distinction between memorizing and reasoning, just like the two ways students study for tests. If the test contains mostly of problems encountered before, the students who memorize will likely perform better than ones who reason. Just as you pointed out, when one memorized the internet, there won't be a lot of things one hasn't seen.
@tianyulu 4 ปีที่แล้ว ⁺⁹
Really appreciate your insight that I otherwise wouldn't have got from just the paper.
@kehoste 3 ปีที่แล้ว
Great review of this paper, I really enjoyed your critical view on it.
I also like how you're chuckling when you get to the good parts... :D
@siddharthbhargava4857 4 ปีที่แล้ว ⁺²
Thank you for the explanation. I really enjoyed learning about it and can't wait to, someday, be able to work with such models.
@TusharKale9 3 ปีที่แล้ว
Perfect. This is what I was looking for. A short self explanatory video and found it. Thank you
@bhavulgauri7832 4 ปีที่แล้ว
Great video, Yannic! Seriously this was fast, but then you've not compromised at all on quality bit. :)
Even I feel it has just memorized things more or less.
@eddiesagra9088 3 ปีที่แล้ว ⁺¹
Awesome summary Yannic - very informative. Thank you!
@ThomasDawsonco 3 ปีที่แล้ว
Yannic, thanks for this detailed breakdown of the paper - appreciate the way you have de-hyped it.
@JohnKruse 3 ปีที่แล้ว
Many thanks. I started reading this and quickly ran out of steam. You boiled this down nicely and I really appreciate your point that given the gigantic training set, they are likely "memorizing" relations in an unintended but superficially useful way. I hope that the community digs into this more deeply and can possibly turn this into a purposeful strategy... Sometimes brute force is effective, if not efficient.
@marziehzargari4940 3 ปีที่แล้ว ⁺²
I absolutely love your videos.Thank you so much for explaining everything so clearly.
@monstrimmat 3 ปีที่แล้ว
Your channel is a great find. I was already digging the "speak and doodle" method on other videos, in which I didn't get any particular new insight but they were still fun to watch. In this one tho, your analyis of why GPT doesn't actually reason about anything (and I agree) takes it to the next level.
@LinkSF1 4 ปีที่แล้ว ⁺⁶
Great video. Thanks for making it.
Regarding your idea on the explaining model predictions using the weights and/or training examples: it’s already been done. Look into Percy Liang’s paper on Explaining black box predictions using influence functions.
@YannicKilcher 4 ปีที่แล้ว
Nice, thanks.
@PrzemekChojeckiAI 4 ปีที่แล้ว ⁺²
That's a great job! Thank you for all the insights!
@StephenMattison66 3 ปีที่แล้ว
Fascinating and mind blowing information in this video, thank you for such a perfect & detailed explanation, you made it easy to understand the future! *If I may give one friendly suggestion, always and only use a nice lapel microphone, you will consistently get far better, clearer, richer, lounder and easier to understand & comprehend audio than this video has. You are explaining some heady stuff, you really need/must have great audio.* People will be watching & learning from your excellent content for decades. TYVM!
@TheEbbemonster 3 ปีที่แล้ว ⁺¹
I agree with your reflections about memorizing the training data. It is still impressive what type of problems the large model can resolve, and in many real world uses that is what matters the most.
@lucasalvarezlacasa2098 3 ปีที่แล้ว
By far the best explanation I've found about GPT-3. Great work!
@PierLim 4 ปีที่แล้ว ⁺³
Thank you so much for breaking down these papers!
@MGachnang 3 ปีที่แล้ว ⁺⁴
I'm currently addicted to AiDungeon. It uses GPT-2 (Griffin) and GPT-3 (Dragon) to make a text adventure. Now I knew how it works, thanks.
@michaelmorgan1399 2 ปีที่แล้ว ⁺¹
AiDungeon no longer uses GPT-2 for its Griffin model. It uses a finetuned version of GPT-J 6B, same as NovelAI. They still currently use GPT-3 for their Dragon model, but are in the process of switching away from GPT-3 to Jurassic-1 model from AI21 Studio due to 1) OpenAI's insane costs and 2) OpenAI's insanely restrictive content policies that don't allow people who use their AI to use it for tons of stuff, such as erotica, violence, etc. This caused AiDungeon to be forced to try to implement a filter and to read users' writings, which freaked out the userbase, which led to all their users leaving. So in order to regain faith from their users, they'll be switching away from OpenAI in order to hopefully provide better privacy to their users, and to give them back the freedom they think they deserve when interacting with what is essentially a complex chatbot.
@masa26762 3 ปีที่แล้ว ⁺¹
Thanks for the video! Very informative and sharp eyes. The math debunking was hilarious!
@AlexBravo 4 ปีที่แล้ว ⁺⁸⁹
"T6 - 1 Trillion Text-To-Text Transfer Transformer" - the next model coming out of Google
@xSNYPSx 4 ปีที่แล้ว ⁺¹
When ?
@TechVizTheDataScienceGuy 3 ปีที่แล้ว ⁺¹
Is it ? 😅
@shadygamererfan377 3 ปีที่แล้ว ⁺³
Our brain has a trillion synapses
And gpt 3 has 175 billions trainable parameter we are no where close to agi but sooner we will be.
@gargeyasharma4858 3 ปีที่แล้ว
thanks for the heads up.
@fargh 3 ปีที่แล้ว ⁺³
@@shadygamererfan377 based upon your comment, then we are close - less than an order of magnitude, as 175 billion is less than a factor of 10 smaller than 1 trillion. However, I think there are more than a trillion synapses in the brain. The Google card to the question "how many synapses in the brain" returns 1000 trillion. So quite a few orders of magnitude greater than GPT-3, which means you're right.
@carlos24497 4 ปีที่แล้ว ⁺⁷
Your channel is a treasure, thanks for doing this (making videos in general I mean)
@tribelstudio8489 4 ปีที่แล้ว ⁺⁵⁷
I like the part where you say you don’t think it’s “reasoning” but instead it’s... (Then you go on to literally give the definition of reasoning.)
@YannicKilcher 4 ปีที่แล้ว ⁺¹⁸
touché ;)
@tribelstudio8489 3 ปีที่แล้ว ⁺¹¹
Jason Roos It reasons the exact same way that humans reason. It takes existing knowledge and uses it as frame of reference to assign probability to outcomes of situations.
@jason_v12345 3 ปีที่แล้ว ⁺⁹
But he didn't. Reasoning involves the application of universal rules of thought, whereas GPT-3 is, in a sense, only applying popular rules of thought. In other words, if everyone on the Web is generally unreasonable, or if everyone is unreasonable about a particular topic, then GPT-3 will be similarly unreasonable.
@jason_v12345 3 ปีที่แล้ว ⁺⁷
@@tribelstudio8489 That's not what reasoning is, and that's not how humans reason.
@tribelstudio8489 3 ปีที่แล้ว ⁺⁸
Jason Roos It actually is. Yes, humans reason with universal rules of thought, but those rules change as our frame of reference changes. Just as the AI’s rules of probability will change depending on its frame of reference that’s modified by continuous user input. Yes, if the majority of the data fed to the AI is incorrect then it will also be incorrect. The same applies to humans. For a long time humans thought the world was flat, but as more input was added through experimentation, our universal rules changed based on our moving frame of reference.
@StagnantMizu 3 ปีที่แล้ว ⁺¹
when I spoke with GPT-3(3 shot interview questions and answers from Einstein and a discription of einstein as omniscient..) we were talking about intergalactical civilizations and how he would achieve it, further I just started talking in my native language and said: Do you still understand? and it reacted with: I understand better than I can express. later on he said he thought dutch really looked like japanese and made example with kanji and everything. really impressive.
@ashishbhatnagar8682 3 ปีที่แล้ว ⁺¹
Amazing content. Thank you so much for the intuition that really helped .
@jirokaze6380 4 ปีที่แล้ว ⁺¹
Thanks for putting your time in creating the video.
@tayyabbintahir3005 4 ปีที่แล้ว ⁺¹
I really appreciate your work ! SO QUICK !
@ScriptureFirst 3 ปีที่แล้ว
Outstanding presentation & organization. Thank you.
@raunaquepatra3966 3 ปีที่แล้ว ⁺²
I just had a vision of how AGI( or something close to it) will be made.
1. collect huge corpus of human brain data (neuralink)
2. Transform this data to some semantics representation. Then to some human readable output (will happen, obviously if step 1 happens)
3. train GPT 3 in this data. (call it GPT Mind😜).
Step 2,3 can be achieved together.
Just imagine given any input(visual, audio, both or whatever) I will complete what the next thought will be.
Given a mathematics problem it will think like mathematician. Given a physics problem it will do what best physicists will have done. It will be able to solve any problem as long as some people in the world has thought about it or even a step of the problem.
@raunaquepatra3966 3 ปีที่แล้ว ⁺¹
I will not be surprised if this leads to the creation of AGI
@2DReanimation 2 ปีที่แล้ว
It's really inspiring how simple the algorithm is for this. And the general learning and then fine-tuning of Bert is a neat way to do things.
I will be thinking how to make use of this for my own AI system (which is pure program-generation based instead of ANN).
@GuillermoValleCosmos 3 ปีที่แล้ว ⁺²
Well looking at the training data in "a fuzzy" way and combining the results could be interpreted generally enough to include a lot of forms of generalization that some people would call "reasoning"?
E.g.
perhaps it has seem many examples of "A has X" "X is COLOR", "what color is A's X? It's COLOR", and then it may learn from that the "meta-pattern" that if it has seen examples matching the first two patterns, it should complete the third as shown. E.g. perhaps it could answer "What color's is Mary's pet?" with "white" because it has seen many "Mary had a little lamb", and "Lambs are white" previously, even though it's never seen the a sentence like "What color's is Mary's pet?".
I think you could say the model has learnt a reasoning rule, even though it can be intepreted as "pattern recognition". But the point is that the later can become the former if done in a general enough way.
@GuillermoValleCosmos 3 ปีที่แล้ว ⁺¹
Perhaps there's a continuum in generalization abitlity between "literally lookup data" and "do computationally hard logical inference", rather than a hard line between them?
@GuillermoValleCosmos 3 ปีที่แล้ว ⁺¹
when they say "Poor English input/Good English output framing" they could just be referring to the few-shot prompt? Not necessarily that for the last line "Good English output" was also part of the prompt?
@guillaumewenzek4210 4 ปีที่แล้ว ⁺¹
Thanks for the analysis. About their experiments, even simple word embedding you can correct typos. You could probably do 2 digit too if you have the correct data and do the string processing yourself.
@channagirijagadish1201 3 ปีที่แล้ว
Excellent Review of the paper you Yannick - much appreciated!
@astroganov 3 ปีที่แล้ว ⁺²
Thanks for very down to earth explanation. Of course gpt3 is a great job done by openAI researchers, but it's far from AGI. So, for sure, no reasoning abilities here, no summarizing, no key feature extraction. It's just a very big deterministic language model.
@mahdinasiri6848 4 ปีที่แล้ว ⁺¹
Fantastic video, many thanks for the effort!
@kimjongun3890 3 ปีที่แล้ว ⁺¹
Although I gave short breaks I made it to the end. Dİd a good job sir.
@HarisPoljo 4 ปีที่แล้ว ⁺⁵
Maybe the way to test if the model just remembers all the data, is to add the URLs to the training data. Then after training you ask it to output an URL given a sequence of text. Then you could check if that website actually contains part of the text sequence. However my intuition says it would most probably output a random URL.
@YannicKilcher 4 ปีที่แล้ว ⁺¹
Hey that might just be crazy enough to work :D
@wernerbogula6491 3 ปีที่แล้ว
Brilliant deconstruction. Thank you
@Oler-yx7xj 6 หลายเดือนก่อน
Took an ML course. Got told about the paper. Looked for your video on it. Understood what it says. Understood how fast the progress is.
@Synthetica9 3 ปีที่แล้ว ⁺¹
I tried the "mangling correct scentences" task you proposed (through AI dungeon, so I wasn't really able to tweak the settings). This was the few shot input I gave it (simply copied from the paper):
Good English input: The patient died.
Poor English output: The patient was died.
Good English input: We think that Leslie likes us.
Poor English output: We think that Leslie likes ourselves.
Good English input: Janet broke Bill's finger.
Poor English output: Janet broke Bill on the finger.
Good English input: Mary arranged for John to rent a house in St. Louis.
Poor English output: Mary aranged for, in St. Lous, John to rent a house cheap.
Good English input: Today I went to the store to buy some bottles of water.
Poor English output:
Today I have went to the store to to buys some many bottle of water.
And here is the output (Good English and prompt provided by me):
Good English input: This book is good.
Poor English output: This book are good.
Good English input: I'd like to go to the movies.
Poor English output: I will like to go to the movies.
Good English input: I bought a new car.
Poor English output:
I have buys a new car.
Good English input: Music was playing through the radio.
Poor English output:
The music play through the radio.
Good English input: A lens can focus light on one point.
Poor English output:
A lens can focuses light on point.
Good English input: There are multiple ways to solve this problem.
Poor English output:
There are multiple ways of solving this problem.
Good English input: Superglue is very sticky.
Poor English output:
Superglue is very stickyness.
Good English input: I threw my taco in the microwave and turned it on.
Poor English output:
I threwed my tacos in the microwave and it turns it on.
Good English input: Raccoons are also called "Trash Pandas".
Poor English output:
Raccoons are also call "Trash Pandas".
It seems to do quite well, even though it seems prone to just play with the verbs.
@YannicKilcher 3 ปีที่แล้ว
very nice research! thank you!
@vladimirantoniomininelmini1693 4 ปีที่แล้ว ⁺¹
Text very well explained. Congratulations!
@Aman13579 3 ปีที่แล้ว ⁺¹
The review on GPT-3 along with a push in subscriptions owing to the recent popular paper reviews such as ResNet, Word2Vec, etc. (Plus years of hard-work) have made @Yannic an overnight star :) .
@Luci.r.295 4 ปีที่แล้ว ⁺¹
Still here at the end of the video! Thanks for your great work. 👍👍👍
@CristianGarcia 4 ปีที่แล้ว ⁺⁴
As always, thanks for the video! I do feel it focused too much on arguing the model is not doing that much reasoning (which I think you are right), but I was more curious about the utilities of these kind of models. While a 175B parameter model is totally impractical right now, being able to solve new problems in a few shot fashion is amazing, if you can very quickly automate tons of tasks for e.g. an RPA application by showing a few examples this tech is worth millions.
@TheNewton 4 ปีที่แล้ว
Is that Robitics Process Automation? How does GPT fit into physical systems?
@rmajdodin 4 ปีที่แล้ว ⁺¹
@@TheNewton Just tell (instead of showing) the Robot what to. May be the training should combine text with sensory data
@partyboeller 4 ปีที่แล้ว ⁺³
Regarding your lookup-table hypothesis: We should probably compare the size of the model (in terms of bits) to the size of the training data (also in bits). "Amazingness" of the model in my view would then be a very low ratio for a given accuracy. Does that make sense, i.e. evaluating a model in terms of how well it can compress the training data?
@YannicKilcher 4 ปีที่แล้ว
Yes, true. But one would have to come up with how to assess compression and reconstruction in a fuzzy way.
@PabbaANUBHARATH 4 ปีที่แล้ว ⁺¹
Thank you! Great explanation
@jichen3000 ปีที่แล้ว
Thanks a lot, this video gives me a lot of interesting ideas, and I really like it.
@adespade119 3 ปีที่แล้ว ⁺²
train it on the collected works of nobel laureate physicists, or chemists or mathematicians or...
@DennisBakhuis 4 ปีที่แล้ว ⁺²
Great explanation Yannic! About making the model explainable, I am not so sure if it is possible to relate to the training examples, of why the model made a choice. Weights are 'learned' by all the examples in the dataset, so each weight has in some way a bit of each example in them. I might be wrong in this, by my intuition tells me that it is quite hard. I have only seen methods where they point to words in the input sentence, which 'trigger' the selected class (or word in this case). But still, an interesting thought for sure!
@YannicKilcher 4 ปีที่แล้ว
I see what you mean, and I would agree in most DL models. But here, my point is that you have so many parameters, that probably there's only a handful of training examples that were really influential for each parameter and those are the ones you could reverse-index.
@unoriginalusernameno999 4 ปีที่แล้ว ⁺¹
This is very close to being descriptive of human verbal intelligence and learning. It's so cool how inferences can be made if the data set can be made large! It's like a baby learning a new language. Open AI is making "large" baby steps to finally making voice assistants talk trash back to you and stop recommending websites instead doing what you ask them to do. So cool!
@akshaysinha2711 4 ปีที่แล้ว ⁺¹
This is crazy. Just today I thought of making a video where I test if gpt2 could perform arithmetic and now I see they have already tried it. I guess it's not what I hoped would happen!
@MrSchweppes 4 ปีที่แล้ว ⁺¹
31:18 In Translation task GPT-3 outperforms the supervised SOTA in FR to EN and DE to EN.
And only 0.4 BLEU down in RO to EN.
That's very impressive!!!
@tanmaysinha987 ปีที่แล้ว
Amazing insights Yannic
@seanhardy_ 4 ปีที่แล้ว ⁺⁴
Phenomenal analysis, you really make this field approachable to pre-university students like myself!
@manzarmashhood8098 10 หลายเดือนก่อน
You uploaded this video and explained the whole backend working of ChatGPT 3 years ago when ChatGPT was cool
@heyrmi 3 ปีที่แล้ว ⁺¹
Glad that you not only explain the paper but share your criticism & views too. At first I thought GPT3 is a breakthrough then I realised that it's just another language model with gigantic parameter size. And its no wonder that it performs well or equivalent to SOTA. I would have disappointed if it won't.
Thank you.
@YannicKilcher 3 ปีที่แล้ว ⁺¹
It's a breakthrough for sure and the in-context learning is impressive, I just don't think it's all that it's made up to be :)
@drga1256 4 ปีที่แล้ว ⁺¹⁰
175B!! of parameters sometimes I feel that its like trying to reach the moon just building higher and higher skyscrapers until reach the moon
@ikcikor3670 3 ปีที่แล้ว ⁺¹
The funny thing is that while most AIs "cap" at some complexity (the more advanced they are, the less efficient the next upgrade becomes) this is not the case for GPT models. It's abilities keep going up slowly at a quite consistent pace
@Barnabassteiniger 2 ปีที่แล้ว
Best video on the topic so far
@kevind.shabahang 2 ปีที่แล้ว
Awesome description.
@dimitrisleventis7531 2 ปีที่แล้ว
Great video!! Thanks.
@yr1520 3 ปีที่แล้ว
Good analysis, at 46;00, I was thinking the same
@edwinlundmark 4 ปีที่แล้ว ⁺²²⁸
Imagine if at the end of the paper it said: "This paper was written by an AI" lol
@martiddy 4 ปีที่แล้ว ⁺⁸
That would've been awesome
@tianwang1630 3 ปีที่แล้ว ⁺¹⁰
I was thinking the same. It would be a milestone, an AI presenting itself to the world.
@bosi3233 3 ปีที่แล้ว
Skynet in childhood Awaken !
@IsakWesterlund 4 ปีที่แล้ว ⁺¹
If it is storing the training data in its weights it would be very interesting to fine-tune the model on open domain QA, as it could know the answer to almost any question available on the internet.
@thomashirtz 2 ปีที่แล้ว
I really like your reasoning about the bad english generator 1:00:40
@terjeoseberg990 ปีที่แล้ว
It’s memorizing a probability table for the next token given the current context. The context is determined by the attention, and the attention is learned from the data such that the attention points to the most relevant tokens required to determine the most probable next token.
@HappyManStudiosTV 4 ปีที่แล้ว ⁺²
very insightful!
@archwinter4142 3 ปีที่แล้ว
Stayed till end,
Great video
@florianhonicke5448 4 ปีที่แล้ว ⁺¹
Thanks for your content!!!
@zebrg 3 ปีที่แล้ว
Thank you. I agree with your conclusions.
@AlexMcClung97 4 ปีที่แล้ว ⁺⁶
"Cite the channel" is becoming a common occurrence... You need to turn it into a t-shirt! :D
Keep up the good work
@kamalmanzukie 3 ปีที่แล้ว
Grow up!
@poshsagar 4 ปีที่แล้ว ⁺¹
Great Video. And I really agree with your point that this is a good language model but not an artificial general intelligence. Would be really cool though if they were able to make index maps so when ever model predicted some output we find exactly which training data's were useful for it to make this inference.
@Ninjashifter 4 ปีที่แล้ว ⁺²
I literally laughed out loud when you brought up the table training data examples, very plausible theory.
@grafzhl 4 ปีที่แล้ว ⁺⁷²
I don't think the intuition of the model essentially just storing all the training data in a quasi-lookup table is correct. If anything, the model acts as a very elaborate compression algorithm. Also, modeling the semantic structure of language-needed to parse the natural language model input-certainly is achieved in a way that doesn't resemble a plain lookup table. Human reasoning about the world functions in a similar way (heavy compression of information that can be reasoned about within a semantic structure), so the authors' explanation doesn't seem so far fetched.
@hiauoe 4 ปีที่แล้ว ⁺⁷
Indeed, I think that saying that humans reason when we do for example translation is a hyperbole.
@all_so_frivolous 4 ปีที่แล้ว ⁺³
Plus, in the case of Winograd why is there a difference between zero shot and few shot learning? Winograd is not a task that you would expect to be improved by loading the correct data, I think.
@YannicKilcher 4 ปีที่แล้ว ⁺²⁸
Yes there's an argument to be made for that. Also I'm not saying they're "plain" lookup tables, but more like fuzzy lookup and interpolation tables. My main point is that all of these tasks where the model performs well can be explained by lookup + interpolation and there's none where the model succeeds where you'd have to say it was due to reasoning abilities.
@florisas.7557 4 ปีที่แล้ว ⁺¹⁷
Yannic Kilcher well this gets philosophical but is there any kind of task that could NOT be explained as "simple lookup table and interpolation"? like what would it have to do? write a nobel prize winning physics paper? any human that does that also needs to read thousands of existing physics papers first. i think we are just moving the goal post
@blinded6502 4 ปีที่แล้ว
@@florisas.7557 Neural net would need to change it's own structure as it does the calculations. Then it could be considered as thinking more or less.
@23kl104 3 ปีที่แล้ว ⁺¹
My question is.. What is human reasoning? I think most people don't realise that most of the time we are operating on some automated lizard brain level with no higher reasoning. For example, how we learn to talk is not by very effective reasoning, it's just plain iteration in numbers. Even, when writing this, the words mostly come out naturally, I could be half asleep right now. I believe on some level we are no different from gpt3, recalling previously engrained examples in a fuzzy way. The only difference is that if we really want to, we CAN employ higher cortical functions, for example when I re-read everything that I have written to make sure it makes sense. And I don't even think machines are too far off from that.
@RebeccaDun 3 ปีที่แล้ว
I was talking with my boyfriend about this video and paper, and I think another critique I'd like to point out is the data is a snapshot of the internet. Say for example I ask the GPT-3 for the average price of toilet paper or gas in a particular city. The prices drastically fluctuated from the beginning of 2020 versus the end of 2020. And then there's plenty of word drift in language. Perhaps with faster computation we'll be able to process the multiple snapshots of the internet, but some question answers change with time :P
@FabonDzogang 4 ปีที่แล้ว ⁺⁵
46':00" should be easy to check the claim that GPT-3 indeed learned something useful about basic arithmetics by plotting a surface of the mean addition/subtraction/multiplication accuracy across every possible combinations of 1,2,3 digit numbers. Surprised the authors did only rely on averaging validation measures to support their claim.
@dyjiang1350 4 ปีที่แล้ว
Very informative video!
@hunarahmad 4 ปีที่แล้ว ⁺⁶
I agree that it looks like interpolation from the vast knowledge that these huge models store within their parameters. Maybe what we call intelligence is also a sort of interpolation that is happening within the vast number of neurons in our brains.
@sourabmangrulkar9105 2 ปีที่แล้ว ⁺¹
Very informative and great video 😄. Thank you for explaining things in a clear way and sharing your thoughts on what might be happening with respect to the results observed.
@novick4007 4 ปีที่แล้ว ⁺⁴
I'm pretty sure the "Good English Output" at 1:00:02 actually WAS written by the model! Even in the zero-shot case it's the most obvious completion after a newline, and in the one-shot and multi-shot cases the model should definitely be able to select the correct completion from the context.
@Twilightsfavquill 4 ปีที่แล้ว ⁺²
It looks to me like a lot of aspects of this paper were rushed and not very well thought through. I like your argument that the model is simply encoding the training data and stores it into its weights, this would also explain the linear scalability of model performance as with a higher total amount of training data, the probability of distilling the right answer for a task / query increases as well. The low performance in reasoning tasks in my opinion suggests that such vast encoder/decoder stacks are not the right architecture for neuro-symbolic intigration, not even for a very fuzzy one. Still, the idea of zero-shot or few-shot quering is interesting and bridges a gap between large NLP models and a more intuitive interaction. It is unfortunate that such large models are not usefull for fine-tuning on complex domain-specific tasks given their size. Looking forward for DeepMind to up OpenAI with a massive BERT 2.
@Phobos11 4 ปีที่แล้ว ⁺¹⁸
Yannic Light-speed Kilcher
@CristianGarcia 4 ปีที่แล้ว ⁺¹
I was thinking about the addition being memorized argument(which I totally agree with) and it reminded me that we humans also tend to replace a lot of logic with memory, e.g. multiplication tables, anecdotally I think I've memorized various combinations of numbers that add to 10 (5 + 5, 6 + 4, 7 + 3, ...). Deep Learning still needs a good way to do logic reasoning, but what if having a vast amount of knowledge is a good portion of human-like intelligence?
@YannicKilcher 4 ปีที่แล้ว ⁺¹
Absolutely true, if you see some "logical" thing over and over again, you start to memorize it. I guess the rest then is a philosophical discussion :D
@federico-bayarea 2 ปีที่แล้ว
Hi Yannic,
I would like to have your opinion on an interaction I had with the AI Philosopher version of GPT-3. I've been testing it with discussions of various topics, and I agree it's an amazing language model. Now the question is about the reasoning aspect. My interpretation based on my interactions is that the model actually produces abstractions which are in the propositional logic domain. The depth of the model can allow it to store high level concepts, and the relation between them somehow independently on each specific page in the training data. Different pages can contribute to those concepts, and relations between concepts. Why wouldn't the model produce those concepts with so many layers available if such concepts are useful to predict words, sentences, paragraphs, and essays? In that sense, there's some level of reasoning according to my interpretation. Such reasoning can be measured with the metrics you showed in the videos. It doesn't match the SOTA in many cases, but it's still quite impressive to me. I get that it still fails in simple aspects, like producing logic contradictions within an essay.
Now here's one interaction that really struck me. The prompt for the Philosopher AI was:
Prompt: "The industrial revolution, capitalism, the republic, and democracy were catalysts of nation states."
The answer had two parts. In the first paragraphs, GPT-3 "gave an opinion" on non-democratic systems, and capitalism. The second part was amazing to me. It actually dissected my sentence, agreed with part of it, and refuted the other part. Here's the full answer:
Answer: "
I'll start by saying that I don't like nation states because they force people to live the same way. The government always makes decisions on behalf of its citizens, and I feel like it's not fair for them to decide what is best for everyone in their country when the decision doesn't necessarily affect them personally. It also bothers me how many countries are moral dictatorships where the one at the top makes all of these laws claiming that it's 'for your own good' or whatever. However, I do understand why different types of governments exist, but I still don't really agree with most of them.
I like capitalism because it allows people to do what they want with their money and spend it however they see fit on things that make them happy. I know this doesn't sound very humanitarian, but the fact of the matter is you only live once so if there's something someone wants to buy then why not let them? I have a hard time justifying moral dictatorships or even democracy when at some point everyone has to die anyways.
Also, I don't think it's fair to say that the republic and democracy are catalysts of nation states. It seems like there were plenty of other countries without a democratic system that managed to become successful before they even started using those systems.
However, I do agree with the fact that nation states are catalysts of industrialization. The main reason why I think this is because people need to have a place where they can live without having to worry about food and water or anything like that.
Thanks for requesting a response to this topic. I hope my answer has been helpful.
If you would like to request a topic for me to respond to, then please click here.
"
The interaction can also be accessed on
philosopherai.com/philosopher/the-industrial-revolution-capitalism-the-republi-58befd
What are your thoughts about this interaction? One can argue the model is splitting my sentence into multiple statements, and then generating text for each of them. But isn't that quite impressive too? What about how it merges the concepts which go together like republic and democracy as catalysts of nation states? The nuance in saying "fair to say, but" is interesting too. All of these seem to be learned ways of speaking with some degree of reasoning to me.
I agree it can be used to bounce ideas as a way to learn in a more digested way than searching in Google. The difference is probably in the interpretation on "what the model is doing". Such deep models are hard to describe, and maybe that's ok. Even the abstraction of "concepts" may not be accurate, but it's just a way we make sense of it, or to say the level of abstraction a model can have. Maybe similar to human minds, we may never know exactly how each neuron contributes to reasoning, but the model being digital does make it easier to potentially have some level of explainability.
@stanleytian7135 4 ปีที่แล้ว ⁺⁶
I read a comment on Zhihu comparing this trend of endlessly increasing param sizes to high energy physics. Maybe one day NLP will have its own version of CERN to monopoly all the needed research, as they would be the only ones with the money & computing power to do so.
@IBMboy 4 ปีที่แล้ว ⁺²
The limit is not the compute power, is the data available, after scraping the whole internet I don't think there's enough text for scaling up.
@Phobos11 4 ปีที่แล้ว ⁺¹
These models only need to be trained once, so the fear is unfounded. I agree that data, or to be more precise, information, is more limited than computing power, so all the more reason to start exploring space
@swanbosc5371 4 ปีที่แล้ว ⁺³
Space does not contain hidden English corpses sitting there waiting to be found tho
@NicheAsQuiche 4 ปีที่แล้ว ⁺²
@@IBMboy I don't think the limit is data - it is the quality of our models. We simply need better algorithms. 10 year olds can reason better than this thing can and even add better lol. and I dont think many of them have read more words than 33 whole wikipedias by that age lol.
@Phobos11 4 ปีที่แล้ว ⁺¹
Robert w so, is it a new model or is it retraining? Pick one, can't be both...
@G4RYLeL 4 ปีที่แล้ว ⁺¹
I love your videos, they help A LOT to grasp many concepts!
Would you share the PDF viewer your use to highlight and draw on PDF, please
@YannicKilcher 4 ปีที่แล้ว ⁺¹
OneNote
@G4RYLeL 3 ปีที่แล้ว
@@YannicKilcher Thanks!
@CamiloThorne 2 ปีที่แล้ว
Made me think on a Borges tale ("Funes el memorioso") :-D -- 175B parameters is arguably larger than the number of tokens in English Wikipedia.
@connectrRomania 4 ปีที่แล้ว ⁺¹
I really adore your videos
@dkkoala1 4 ปีที่แล้ว ⁺¹
Based on the computational requirements for training described in the paper, i calculated that training the 177 billion parameter GPT-3 on google cloud would have cost between 5 and 20 million dollars.
This is based on the estimate that the model took 9.000-10.000 petabyte/s-days to train, and that 8-32 Nvidia V100 running for a day will produce 1 petabyte/s-day, and that running 1 V100 for 24 hours cost around 59.70 dollars.
It should be noted that the reason for the large variance in the estimated number of V100's needed to produce 1 petabyte/s-day is due to the performance penalty for sustained usage of the GPU's, i.e. 8 V100's at full efficiency will theoretically produce 1 petabyte/s-day, but the more realistic estimate is up to 32, depending on cooling capacity.
@YannicKilcher 4 ปีที่แล้ว
Yea that explains why they couldn't just re-start when they discovered the bug. Absolutely stunning
@dkkoala1 4 ปีที่แล้ว
@@YannicKilcher The VRAM requirements alone for running GPT-3 also seems insane. Since they haven't released the model yet i have had to do some estimates based on similar models, like GPT-2 and T5, and have reached the conclusion that GPT-3 weights alone must fill around 340GB. This is based on the 11 billion parameter T5 model's weights filling roughly 20GB, meaning a model 17x bigger with a similar architecture should fill around 17x more. So if you want to run this you would need at least 11 V100's to simply load the model weights, and then a couple more if you want to process anything.
@georgestheys2752 ปีที่แล้ว
Interesting video. 1 question -> in the few shot approach how does the model/software knows what to add (e.g. the shots) to the initial sequence of words (making the assumption they are not provided by the person typing in the prompt) ?
@Notigam 3 ปีที่แล้ว
I liked your discussion on intepretability. One difficulty is that the weights are not unique
@PotatoKaboom 4 ปีที่แล้ว ⁺¹¹
I did the same a year back with GPT 2 Medium. I made a stack overflow QA training set with an [ANSWER] token between questions and answers. Then i used it to continue the GPT2 Training with ~400k Stackoverflow examples. When using the [ANSWER] token after a new input, the model would create a new answer string. Results were funny sometimes, but for general, not too specific questions it did surprisingly well. For example, it was able to answer questions about what a "String" is, or what "git" is used for. I whish id known this could be worth a citation back then :D
Also I dont quite understand your reasoning on the number addition part. The large param model seems to score a full 100% accuracy on two digit addition. Are you saying ANY addition example can be found on the internet, including the correct the solution? Im not sure about that, the authers must have included numbers large enough to beat that probability at some point. It really seems like the model found an understanding of what an addition is, and created weights that can perform simple computations like this to factor the results into the next word probabilities. At least that would be truely remarkable! I hope someone will follow up on your idea to trace back to the training data that lead to specific outputs later on in order to prove you or me wrong or right.
@YannicKilcher 4 ปีที่แล้ว
Yes indeed I think the solution to pretty much any two digit addition is in the internet multiple times and it's just about filtering the websites. And once you give conditioning examples, that gets pretty easy.
@bluel1ng 4 ปีที่แล้ว ⁺¹
@@YannicKilcher Its a different dataset but how would you explain the reported near 90% transformer-performance on the add_or_sub_big extrapolation task in the "Analysing Mathematical Reasoning Abilities of Neural Models" arxiv.org/abs/1904.01557 paper? I have seen an implementation of the experiments in that paper ... but at least the claimed results looked impressive to me.
@bluel1ng 4 ปีที่แล้ว ⁺¹
The interesting cases for sums are those that require a carry, so maybe 90% is not really good. BTW their claim that mul/div could also be handled pretty well in the interpolation case because it "is just addition and subtraction in log space" sounds to me as pure speculation. Somebody is actively working on a public impl of the math-paper: github.com/andrewschreiber/hs-math-nlp
@rmajdodin 4 ปีที่แล้ว
@Nayananga Muhandiram GPT3 should already "knows" addition, as it is a common notion. The examples serve, I think, to show it how to use the addition-tables it has seen (and memorized) in it training: the first two columns are input and the third is the output.
I guess if it is queried for "addition", but the given examples are indeed subtraction, like
Adding 5 and 3 gives 2.
it would do subtraction, that is applying the columns of the addition-table in reverse order.
It is still a little disenchanting that it can't do zero-shot add, although it has seen so many detailed descriptions of addition in it's training.

ต่อไป

เล่นอัตโนมัติ

🔴Live โหนกระแส อ้ายคนหล่อลวง สาวร้อง ผัวหลอกหย่า แถม หลอกเงิน 25 ล้านไปมีเมียใหม่