Okay, the superposition bit blew my mind! The idea that you can fit so many perpendicular vectors in higher dimensional spaces is wild to even conceptualize, but so INSANELY useful, not just for embedding spaces, but also possibly for things like compression algorithms! Thank you so much for this truly brilliant series!
When I first read that Toward Monosemanticity paper, it was like a revelation. I think it's highly likely encoding information through superposition explains biological signaling models that have remained unexplained for decades. These concepts aren't just making waves in tech.
But is it really that surprising? I'd actually be amazed if the internal representation would be highly structured. Asking where a LLM stores the fact that Michael Jordan is a basketball player is a bit like asking where your brain stores the value of your current age. That's all over the place and not a single "variable" that sits in a group of a few neurons.
This video is pure gold! This complex topic is just so clearly and correctly explained in this video! I will show this to all my students in AI-related classes. Very highly recommended for everyone wanting to understand AI!
@@iloveblender8999Pedagogy is the structure or chronology of the education which tends to try to make sure all prerequisite knowledge is covered to a competent degree before the next step.
Oh, oh, I can be more correcter than both of you! 'pedagogy' doesn't actually seem to be an extremely precisely defined word. But, based on a few seconds of search, both of you are QUITE wrong, and OP used the word correctly. Merriam-Webster on pedagogy: 'the art, science, or profession of teaching'. Wikipedia on pedagogy: 'most commonly understood as the approach to teaching, is the theory and practice of learning'. The word pedagogy itself indicates nothing about who is involved, what is taught, or how the teaching occurs.
@@christophkogler6220 “The approach to teaching … theory and practice of learning” If you can’t rationalize someone discussing what is taught and when for best learning and teaching then don’t even google it.
During the whole video I was thinking "ok but it can only encode as many 'ideas' as there are rows in the embedding vector, so 'basketball' or 'Michael' seem oddly specific when we're limited to such a low number". When you went over the superposition idea everything clicked, it makes so much more sense now! Thank you so much for making these videos, Grant!
I was doing exactly the same thing in my head - i knew there had to be a catch that I couldn't see and then there it was, hiding right behind the 12288-dimensional teapot.
The number seems indeed low, but it is comparable to the number of "word" tokens in GPT3 vocabulary (50k), and 'basketball' or 'Michael' are ideas simple enough to possibly be represented by individual tokens. But of course the superposition trick allows to represent much more nuisanced and niche ideas.
@@beaucoupdecafeine Let me take you through a simple example. Let's say you have 8 outputs you want to encode for. {dog, cat, table, chair, bucket, mouse, eagle, sock}. A first idea is to use eight numbers (a vector of size 8) and tie each element in the vector to a certain output. For example 10000000 = dog, 01000000 = cat, 00100000 = table, ... This is called one-hot encoding. But let's say you are limited to an output vector of size 3, so like xxx where each x can be a number. Can you still encode all 8 outputs? Yes you can. You can use every combination of 0 and 1. For example 000 = dog, 001 = cat, 010 = table, 011 = chair, ... But now instead of 8 outputs, you have 500. Can you still encode 500 outputs using a vector of size 3? Yes. Just use real numbers instead of binary. For example (0.034, 0.883, -0.403) = dog, (0.913, -0.311, 0.015) = cat, ... (0.664, -0.323, -0.844) = baseball. As far as I know, Large Language Models (as well as most other Natural Language Processing models) use a fixed vocabulary. So an LLM may have a vocabulary size of 32,000 words which each map to a unique vector.
Yes, and the dimensions he used were direclty aligned with 'Michael' and 'Jordan' - this wouldn't really be the case, as it would be an inefficient use of weights. Michael would instead be a combination of the ~12,300 feature dimensions.
The script you ran with randomly distributed vectors was mind-opening, let alone once tuned - that's incredible. It's such an awesome quirk of high dimensions. I spent a good chunk of yesterday (should have spent a good chunk of today but oh well) working on animations to try to communicate traversing a high dimensional configuration space and why gradient ascent really sucks for one particular problem, so the whole topic couldn't be more top-of-mind. (my script already contains a plug for your previous video with the "directions have meanings" foundation. this series is so good!)
This is simply the most comprehensible explanation of transformers anywhere. Both the script and the visuals are fantastic. I learned a lot. Thank you so much.
Broo, just watched chapter 4 and re-watched the previous three chapters. Your first three videos had just dropped when I was learning about neural networks in grad school, like perfect timing. Took a couple of years off drifting around. Now, I'm going back into machine learning, hopefully gonna do a PhD, so I was re-watching the video series, then realized this one was rather new and got curious, noticed the last chapter 7 is missing, then check your home page and lo and BEHOLD you've released chapter 7 like 53 minutes ago. Talk about impeccable timing. I feel like you just dropped these just for me to go into machine learning haha... kinda like "welcome back, let's continue". Anyway thank you so much for taking me on this wonderful journey.
That's how universe attracts curious people. There was a task given for me to give a lecture on some ML topic around 4 months ago. Out of instant gratification i have chosen to speak about GPT architecture. Literally i was scared intially and was doing some research and guess what just after 2 to 3 days, this man out of nowhere suddenly dropped a bomb of starting a series on transformers. I was so happy at that time and it helped me doing good amount of research amd seminar also went well.....
I know this is such a late response but I wanted to comment on how amazingly human this process is. Whenever we hear or read words, our brain immediately starts to guess the meaning by using the very same process mentioned in this video series. For example, I'm sure many of you have seen the pangram "The quick brown fox jumps over the lazy dog". Now if you come across a sentence that starts with "The quick" your brain might come up with many different ideas but the pangram would be included. As soon as you interpret the word after "brown", then your chances for your brain to guess the pangram goes up. I believe the same is true for thinking of "solutions" of words to output also.
@@JohnBerry-q1h It has been a handwriting drill for many, many decades, too, and remains so even in the present day. Specific typing drills are not particularly useful, since they result in people exhibiting muscle memory rather than focusing on typing. For example, I can type my five-word computer password extremely quickly just from memory, because it's a simple repetitive task that I do many times a day - sometimes I even briefly forget the actual words that make it up, since I'm subconsciously recalling finger movements, not thinking about the words themselves - but my typing in general is less rapid than that.
What's even more crazy is when you look at the studies that have been done on people with a severed corpus collosum. The experiments that were performed suggested that the human brain consists of many "modules" that do not have any kind of consciousness themselves, but are highly specialized at interpretation and representation. It seemed like the way "consciousness" works in humans is that the conclusions of these different "modules" within the brain all arrive at the decision-making or language parts of the brain, and those parts work VERY similarly to these LLMs: the generate a thought process or set of words that explains the input from the modules, and that's it. For example, one module might identify something as dangerous, and another module might identify something as being emotionally significant. Those conclusions, along with many, many other "layers", arrive at the language parts of the brain, and that part of the brain creates a "story" that explains the sum of the "layer" conclusions. Is this the only way to interpret the results of the experiment? Of course not, and it's not they way they were interpreted originally when first performed. But we also hadn't really invented any AI at that point either. The way that these models represent information and process data seems to me to be MUCH closer to how human brains work than I think most people realize. The human brain probably has at least one type of layer (such as attention or MLP) that is not currently represented in modern AIs, and is also even more parallel, allowing for the possibility of "later" MLP layers or attention layers to cross-activate other layers while they are happening.
Animation. 5/5 Didactic 5/5 Pedagogic 5/5 Voice. 5/5 Knowlege 5/5 Uniqness 4/5 Just beatiful work ❤❤❤ keep it up.. I will send this to eveyone that appreciate the work.
I watched the end of this series now, and I'm just blown away by the maths of it all. Like how simply each step is, yet how it all adds up to something so complex and WEIRD to wrap your head around. It's just so fucking fascinating man. Really makes you think about what WE even ARE.
The combination of Linear + RELU + Linear function, and adding the result to the original, is known as Residual Networks. As 3b1b demonstrated in this video, the advantage Residual Networks have against a simple perceptron network is that the layers perturb (nudge) the input vector, rather than replace it completely.
Thanks for the pointer! I remember MLP from 10 years back and I couldn't recall the "adding part". Btw. I'm also puzzled by a different thing -- we call it MLP, i.e., *multilayer* perceptron. But it seems that in a single MLP block there is only one layer. It's called multilayer, because it's replicated. I.e., it's N layers, each consisting of "(multi-head) attention" + "1 layer of residual perceptrons". Is my understanding correct? Do you know whether there are deep nets that actually use multiple layers in a single block? Why or why not would they do it?
@@fluffsquirrel I'm not sure I understand your question. AFAIU all these are dense networks. My question was that it seems from the video that in each layer there is a single "sub-layer" of attention followed by a single sub-layer of residual perceptrons. So, the perceptrons are never stacked directly after each other (there's always attention between them). I am wondering whether it would be beneficial to have more consecutive sub-layers of perceptrons (after each attention). Are there deep nets with such architecture? If not, is it because it's *impractical* (too hard to train, too many parameters), or maybe it doesn't really make sense *conceptually* (there is nothing to gain)?
@@tymczasowy I apologize as I am not well-versed in this subject, but I too am interested in this idea. Hopefully somebody will come along with an answer. Good luck!
I got two relevant references that help answer this question with great examples: 1. Locating and Editing Factual Associations in GPT (2022) Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov 2. Analyzing Transformers in Embedding Space (2022) Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant
Yeah i remember the first paper was even on YT with Yannic Kilcher interviewing original authors. btw have you guys seen that write up "What’s Really Going On in Machine Learning?" which draws parallels between cellular automata and highly quantized MLPs. very nice read.
@@naninano8813 the Yannick Interview was great. I have also been fortune enough to have Yonathan Belinkov visit my university and give a guest lecture on 3 adjacent papers which I had quite a few questions on to discuss.
I have been re-watching the Superposition hypothesis for the last 15 minutes. It still blows my mind. Grant, your work is so beautiful, to me you are the Da Vinci of our time.
There are no words for how good this is and all of the 3Blue1Brown videos are. Thank you, Grant Sanderson, you are providing a uniquely amazing service to humanity.
The way you have presented the mathematics on youtube, no one has done before. Everyone just tells fancy words and you give the essence how they work. Great video Grant, waiting for more videos from you😊
AI researcher here, really like the explanation of ReLU! My recent experiments show that dropping all absolute gate values below a threshold leads to universal performance gains across different model architectures!
Thank you very much for yet another wonderful explanation! While focussing on Training feels reasonable, I would also love to learn more on positional encodings. The sinus used in the original paper and the more recent versions would surely make for some interesting visualizations - by simply reading through the papers I am lacking the intuition for this.
I'll consider it for sure. There are many other explainers of positional encoding out there, and right this moment I have an itch for a number of non ML videos that have been on my list.
@@3blue1brown Thanks for considering :) I totally understand that there are non ML videos to be made - do whatever feels right and I'll enjoy it nonetheless
Just watched the whole series, took me two days but it was so insightful and easy to understand compared to other sources. Thank you! Those animations and the approach you take to explain is so helpful!
hey Grant im sure i cant understand everything from this series so im skipping this video, but the purpose of this comment is thank you for creating manim python library because you started(i would say) "animation for education" to entirely different level and encourage many people to do that in your style of animation using manim, because of you indirectly im learning many things in youtube, thanks again and i wish you have more success in your carrier with your loved ones
I didn`t watch the other videos (yet) but I could totally pretend to understand a lot of the things in this video, which was mind blowing. Try it out, you can always come back later to try and understand more.
Nah, I am certain you can understand this. If you know Python (optional) some calculus, and linear algebra (kind of optional), you're good. It's just overwhelming at first. The hard part for me really is trying to understand wtf researchers are talking about in papers. Can't tell if researchers are pedantic, or I'm just too dumb, or both.
@@Dom-zy1qymath is not optional here dude I agree that if you just want to learn the theory python is optional but saying calculus and linear algebra are kinda optional that’s bull shit ai is a mathematical field As for papers yes they can be overwhelming but reading a academic paper isn’t a quick process you need to read paragraphs repeatedly to understand them properly
Grant does so good of a job explaining these in interesting manner - I bet 3b1b has measurable impact on a whole humanity's grasp of Math at this point.
So glad you touched on interpretability. Anthropics towards monosemanticity paper is one of the most intriguing this year. Using sparse autoencoders to extract monosemantic neurons is just genius.
When I was giving a talk last year on how transformers worked... I envisioned something like this video in my mind. But of course it was amateur hour in comparison, both visually and in explanation. You are a true professional at this, Grant.
It is one of the most striking moments in my life knowing that the seemingly pure mathematical idea of loosening the notion of "perpendicular" a bit, results in the reason of the efficiency of information in how the universe is made and perceived. Life is awe.
Huge thanks for the work you are doing! You are one of the few channels that explain LLMs that well. I also like Andrej Karpathy's videos but yours are more for building intuition, which is also great and super helpful! I'm very curious though what was that thing with the green glass holograms
Your video’s are simply mind blowing… the effectiveness with which you succeed to make us ‘visualise’ the mechanics of AI is truly unique ! Keep on the good work 👍
"How facts are stored remains unsolved" I'm only a minute in but it's kinda wild to think we have invented machines so complex that we can't just know exactly how they work anymore, we basically have to do studies on them the same way you would do a study on a human brain which won't lead to a full understanding, just a partial understanding. Maybe some problems are so complex we can never truly know the complete inner workings of everything. Maybe we are hitting into the limit of human cognition itself. A human being simply can't keep all the facts straight in their head all at once to be able to come to a complete understanding of something this complex. I don't know, just rambling... It's just wild that we invented something that works, but if you ask HOW it works, that's literally an unsolved problem. We flat out don't know exactly how it works.
How little do we really understand? Is it naive to assume that we can start to develop more algorithms whose purpose is to analyze the behavior of neural networks? Surely there's some way to map a concept's embedding in a perceptron in some sort of rigorous way.
But of course, neural nets are a loose imitation of our own brain. Since we don't know where facts are stored in those, either, it's not unintuitive that that we wouldn't know where they are stored in things that operate similarly. You said that we've invented something, and that's only true to a point. It seems to me that we're doing is as much "discovering" as we are "inventing".
The "mystery" surrounding neural networks is VERY different from that of the brain. Sure, having a high-level understanding of how it structures the different weights is very headache-inducing, but it is not a mystery. We still know exactly everything going on conceptually, and grasping how information is stored, etc. is not a mystery in my opinion, even though it's complicated. Understanding the human brain on the other hand is much more complicated, because for every person, things are "experienced". For instance, there is no mystery in following the process going from photons hitting your retina, to the signal reaching the visual center of your brain. However, how that visual center somehow makes you "see" something is where the mystery is. Questions you can ask yourself to understand this are: Why do you see your red as red and not yellow? Why does cold feel like cold and not warm? And more importantly, why is there any sensation at all? You can also google "p-zombies". In our modelling of neural networks and machines, everything works without "sensation" (or at least there is no proof of any sensation). We do not know what sensation/experience/whatever you like to call it is, and we have no way of modelling it. Surely there is something interesting in trying to understand how machines with similar neural structure store their information and comparing that to the human brain, but simply thinking of humans as "inputs (for instance vision) mapping to outputs (for instance movement)" does not at all deal with the problem mentioned above.
I think the hint lies in how different reality is when we diverge from what we are used to. Everything makes sense looking at our own scale, but when trying to understand smaller and bigger things (quantum physics, relativity, etc.) it suddenly seems like reality is very different from what we can make sense of. Reality is probably very different from what we think, and as humans, we are very limited, for example by our specific senses. Neural networks are built in a way that we understand each process, at least at a detailed level. We can technically track every decision it makes and understand it. This is clearly less complicated than the human brain which includes sensation, and probably requires a more advanced understanding of what reality actually is.
Love your videos! I've been familiar with Transformers and LLMs, but the notion of superposition in high-dimensional space was new to me. Thanks for the knowledge! Cheers!
That is an amazing video. And watching it makes me think a lot on how the "Akinator" most probably works by "asking questions" about the characters and how they live on that superposition space that can be only activated when the right sets of answers are given.
Umm I don't think that akinator even work on NLP. It's just large database which you can think of tree or decision tree( yes, no, neutral, probably yes, probably no). Where each branch has question and divides databse into smaller sub tree. But yes as compared to vector which encodes particular question it is doing same job!
Akinator works on much more simple principle, but it can also be expressed in terms of vectors. Each question is an orthogonal direction, 1 is Yes, 0 is no, .5 is unknown. As you answer the questions you populate the vector components, and the system picks the component that'll be the most valuable for cutting the space of possibilities. The data is simply stored in the database, no fancy neural networks needed.
@@pafnutiytheartist Exactly. Each answer to those questions can be correlated to the dot product of the NLP. But for me, the mistery is how they populate their database.
the video was really cool and the mathematical explanation is really good. I do have a semantic quibble though. The model doesn't have any clue that Michael Jordan plays basketball, it knows that sentences that include Michael Jordan, sport, and play all often include the word basketball, so a sentence that includes all 3 is very likely to include basketball. It's a subtle distinction, but I think it's important because it explains how a Large Language Model can "know" really common facts like that Michael Jordan plays basketball, and even what team he's on, but often mess up quantitative answers and spout well-structured nonsense when asked about something even moderately technical.
I think that’s a valid quibble. It’s very hard to talk about these models without over-philosophizing what we mean by words like “know”, and also without over-anthropomorphizing. Still, the impressive part is how flexible they can seem to be with associations learned, beyond for example what n-gram statistics could ever give, which raises the question of how exactly the associations get stored. The fact that they do well on the Winograd schema challenge, for example, raises a question of how language is represented internally in a manner that seems (at least to me) to much more deserve a description like “understanding” than previous models.
How well do LLMs perform with Winograd schema submitted after their training cutoff date? A problem of evaluating anything trained on ginormous datasets is ensuring the answers weren't included in the training. Many initially impressive AI results look less impressive when it is revealed that the researchers took less than adequate precautions to avoid cross contamination between training set and evaluation set. When a LLM cannot answer "If Tom Cruise's mother is Mary Lee Pfeiffer, who is Mary Lee Pfeiffer's son?" or only gets the correct answer when it is someone famous but fails with random made-up people, one does question what aspects of a fact has it really learned? Prof Subbarao Kambhampati points to several studies that show LLMs memorise without understanding or reasoning.
@@3blue1brown The fact that being a little flexible with orthogonality gives you exponentially more dimensions is really interesting and impressive. I had no idea that was possible, though it does make sense and it does raise a lot of interesting questions. I think the information storage is probably less like an encyclopedia and more like the world's most detailed dictionary because it stores correlations and relationships between individual tokens. After reading through some winograd schemas, I do think that they prove the model knows something and has some reasoning ability, but with the level of detail LLMs record, I think they can be answered by reasoning about how language is structured without requiring you to know the physical meaning or underlying logic of the sentence. Given how little of the human brain is devoted to language among other things, I don't think that has very much to do with how most humans store information or would solve Winograd Schemas internally, but it's definitely some kind of knowledge and reasoning, and how that fits into what makes something intelligent is more of a philosophical debate. At the level LLMs work at, all human languages seem to have an extremely similar structure in the embedding space, so I think the most exciting realistic application for LLMs once we understand them a little better is matching or exceeding the best human translators, and eventually decoding lost languages once we figure out how to make a pretrained LLM.
@@bornach The GPT-3 Paper goes into great detail how they try to avoid training data contamination like that, you can be sure they thought about that problem.
This is an excellent video! It offers the best intuitions on transformer architecture that I've seen. However, I'm curious about one aspect that wasn't covered: positional encoding. Specifically, I'm trying to understand why adding positional encoding doesn't disrupt the semantic clustering of initial words embeddings. How does the model maintain the integrity of these clusters while incorporating positional information? I'd love to see this explored in a future video
The fact that the number of nearly perpendicular vectors increases exponentially as dimension space increases is really interesting and also terriflying.
This is such a good series, thank you so much!!! Have been waiting at the edge of my seat since April and this video was definitely worth the wait! Thank you for such high-quality, rigorous yet still intuitive content!
How insane is that bit at the end with the increased capacity due to superposition and almost perpendicular vectors. Really cool, that you put this here.
I have been waiting for this video for a long time man!. Good Job and I hope to see the video explaining Training soon! Thank you so much for these :)!
The scaling of possible perpendicular vectors with the increase in dimension is realllllly mindblowing - especially the way you slowly lead us to discover it in this video. One thing I wondered tho, doesn't this scaling also have an adverse effect on the function of the network? Like, the netwok relies a lot on dot products being large when vectors align. But with higher dimensionality, "making" vectors align (during training I guess) should become increasingly difficult too. The way I understood it, two vectors would have to be suuuuper precisely defined to actually produce a somewhat large dot product, thus maybe making this operation kinda unstable ...
I actually tried to reach out to you to ask you to make a video about this. Not sure if it's just a coincidence but thank you this makes so much sense, you are amazing at demystifying this ideas. I'm just in shock in how well this video was made. Makes me want to go back and rewatch all the series, understand everything and come up with my own ideas
Neel Nanda is one of the people who thinks neural nets are black boxes (by default). Mechanistic interpretability is not solved, and I don't think any engineers are pretending this for any reason.
I think they're black boxes by default, we have not YET solved this problem (and may never do), but we're making real progress and I'm optimistic and would love to see more people pushing on this
great video! one note - about the distribution of angles between high dimensional vectors you were selecting their entries at random - which is equivalent of picking 2 random vectors uniformly inside a high dimensional cube and calculating the angle (normalization doesn't matter for the angle). my intuition here is that for a higher dimensional cubes there is more room next to its corners, and more corners, so a 90* is kinda likely. if instead, you select randomly 2 vectors of a high dimensional sphere, meaning you select the angles uniformly, you get a much broader distribution (still centered around 90 deg) you get something that looks kinda like that entropy graph of a coin flip (or a bit stretched sin graph) of course, it doesn't matter if your goal is to find as much as you want, and not just ask about the distribution of the angle
I am not gonna talk about the video, it’s obvious how it is !! I wish there is an Oscar for the Video makers on TH-cam, this Chanel would be definitely among the very top nominees !
"reasons" of neural networks will never be solved. Just as Stephen Wolfram said: "Asking 'why' a neural network does something is the same is asking 'why houses are made stones?' It's just because it's something that was available at the moment, lying around to be exploited in the computational universe"
Yes, we won’t be able to figure out the fine details for why a neural network does exactly what it does. However, we can get a big picture or a group of medium-sized pictures. It’s similar to studying the actual human brain and why we do what we do: it’s really difficult since we can’t really start from the smallest details and learn the biggest ones, but we can work our way down to a certain extent, making hypotheses and testing them to figure out how correct our understanding is. Just because the task of understanding models or brains as a whole all the way from neurons to behavior is for all intents and purposes impossible, that doesn’t mean we can’t understand certain aspects of it.
The walkthrough of code was so helpful in understanding the dimensionality point. Putting your code GPT ironically helped solidify my understanding of what you were saying. Thank you!
Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in 16:47 but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.
We know how to create these networks, but we don't specifically know what combination of parameters of the network corresponds to a specific concept/output. We just feed it a bunch of data, then basically tell it to predict what might come after a specific textual sequence. This prediction is made based on many matrix multiplications and transformations. We then have a loss function, which grades the accuracy of the predictions. During training, basically compute the gradient of this loss function with respect to all parameters of the model. (the values in the matrices which we used to transform our input data) Because so many parameters (billions to trillions) are needed to make these predictions well, it's difficult to really know for sure which parameter(s) "corresponds" to an idea or concept. You could in theory, do all of this by hand without a computer, but it would take you an eternity. So we use computers, the consequence of that being we end up not knowing what our network is "thinking".
It's not just that we don't know what the weights end up representing. We don't know which of the dozens of ways a matrix or a vector can represent data the model is using at any part of the process.
This the reason for the often used "black box" ... You can train the network and understand it's performance... But what the inside of the weights do is the "black" unknown bit.
It also might explain the problem of "glitch tokens" in LLMs. A prompt could accidentally send a LLM into a direction for which the linear combination of superposition vectors it was trained on makes absolutely no sense.
@@bornach I imagine sometimes there are spurious tokens in pools of otherwise related tokens. If you give someone directions to the store, but you misinterpret one of the turns as right instead of left, you are going to end up in the wrong part of town. Humans, I imagine, would usually realize pretty quick something went wrong and would know they are lost. LLMs keep trucking along and vomit out whatever garbage happens to be near where they end up in the end.
Do you think your brain “stores facts” somewhere? Seems like more a matter of scale and architecture to me - LLMs aren’t the end game, they are the first glimpse of a new era.L of understanding for us humans
The quality of 3b1b has declined. The videos have the same pace (maybe more) despite an increase in the difficulty. Effectively decreasing the target viewerbase who'd enjoy the process. Basically I'm admitting that the content has progressed beyond me, even though earlier the videos were something I could understand. This stuff must be quite challenging.
This is state of the art, bleeding edge, and pretty recent advances, that even the brightest AI researchers didn't understand less than a decade ago. The fact that he can even explain it at all to the public this well is IMHO quite impressive. But yes, it is sometimes a good idea to watch such videos more than once and take a good night's sleep in between 🙂 I mean, no amount of billions of dollars and professors could get you this knowledge a decade ago. Just for perspective.
I am going to confess that these intriguing animated videos on the transformer mode from 3b1b, although pretty to watch, quickly go entirely over my head.
Fascinating stuff. I think the idea of polysemanticity is another fascinating way to explore how the LLM scaling can be so exponential, as more parameters are added the way they can be combined with other parameters expands combinatorially (even better than exponentially!)
This video series is so great. I really admire and appreciate you for helping to educate all of us about this very important subject. I would love to see you do a video about the similarities and differences between these LLMs, (or potential future LLM's) and the human brain.
wow, awesome as always! you can interpret bias + relu as the denoising part, which discards all information below a certain threshold and prevents the neuron from firing and passing information to the preceeding layers
Thank you so much for this wonderful video. I am going to watch this multiple times to keep refreshing these topics as long as transformers are the epitome architecture in NLP.
I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in 22:42, he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!
That Johnson-Lindenstrauss lemma blows the mind at first, but it actually makes a lot of intuitive sense. Between angles get higher and higher, and as the sphere packing shows, increasing dimensions lead to some seemingly weird stuff.
This is an amazing series on general LLM for language processing! Bravo to 3B1B! One thing that still puzzles me is that how the LLMs solve math/logic problems. The short answer I got from GPT 4o mini is that "As a conversational AI, the process of solving math problems involves several key principles and methodologies that differ from straightforward text prediction." I wish there 3B1B can look into that and enlighten me.
Exciting mark for the new academic year begin! A long-awaited sequel of this amazing story. I hope it will also be a doublet as two previous! Thank you so much for this collection of masterpieces!
The video is nice, well explained with nice examples, but the idea of superposition with nearly perpendicular vectors is AWESOME, it blew my mind! The most amazing part of the video is a lemma shown as a side note at the end of the vídeo, please explore more this idea of fitting exponential many near perpendicular vectors in a high dimensional space
Okay, the superposition bit blew my mind! The idea that you can fit so many perpendicular vectors in higher dimensional spaces is wild to even conceptualize, but so INSANELY useful, not just for embedding spaces, but also possibly for things like compression algorithms!
Thank you so much for this truly brilliant series!
You can use an attention head from a transformer as a compression algorithm.
Superposition reminds me of bloom filters
When I first read that Toward Monosemanticity paper, it was like a revelation. I think it's highly likely encoding information through superposition explains biological signaling models that have remained unexplained for decades. These concepts aren't just making waves in tech.
Right ? it's like if we could "stretch" the space to fit more informations, it's crazy !
But is it really that surprising? I'd actually be amazed if the internal representation would be highly structured. Asking where a LLM stores the fact that Michael Jordan is a basketball player is a bit like asking where your brain stores the value of your current age. That's all over the place and not a single "variable" that sits in a group of a few neurons.
This video is pure gold! This complex topic is just so clearly and correctly explained in this video! I will show this to all my students in AI-related classes. Very highly recommended for everyone wanting to understand AI!
How’d you get here so early?
how did you comment to the past?
@@thatonedynamitecuber Right? Video released less than a minute ago, this was commented an hour ago
That too it says he commented an hour ago 😮
Can you share with us some materials you teach ? 🙏
Still can't help but being blown away by the quality of pedagogy in these videos...
@@iloveblender8999Pedagogy is the structure or chronology of the education which tends to try to make sure all prerequisite knowledge is covered to a competent degree before the next step.
@@iloveblender8999I think the word was used correctly, but I love Blender too!!
Oh, oh, I can be more correcter than both of you! 'pedagogy' doesn't actually seem to be an extremely precisely defined word. But, based on a few seconds of search, both of you are QUITE wrong, and OP used the word correctly. Merriam-Webster on pedagogy: 'the art, science, or profession of teaching'. Wikipedia on pedagogy: 'most commonly understood as the approach to teaching, is the theory and practice of learning'. The word pedagogy itself indicates nothing about who is involved, what is taught, or how the teaching occurs.
@@christophkogler6220 “The approach to teaching … theory and practice of learning”
If you can’t rationalize someone discussing what is taught and when for best learning and teaching then don’t even google it.
@@christophkogler6220 "More corrector" tho?
During the whole video I was thinking "ok but it can only encode as many 'ideas' as there are rows in the embedding vector, so 'basketball' or 'Michael' seem oddly specific when we're limited to such a low number". When you went over the superposition idea everything clicked, it makes so much more sense now! Thank you so much for making these videos, Grant!
I was doing exactly the same thing in my head - i knew there had to be a catch that I couldn't see and then there it was, hiding right behind the 12288-dimensional teapot.
The number seems indeed low, but it is comparable to the number of "word" tokens in GPT3 vocabulary (50k), and 'basketball' or 'Michael' are ideas simple enough to possibly be represented by individual tokens. But of course the superposition trick allows to represent much more nuisanced and niche ideas.
I still don't get it 😭
@@beaucoupdecafeine Let me take you through a simple example. Let's say you have 8 outputs you want to encode for. {dog, cat, table, chair, bucket, mouse, eagle, sock}. A first idea is to use eight numbers (a vector of size 8) and tie each element in the vector to a certain output. For example 10000000 = dog, 01000000 = cat, 00100000 = table, ...
This is called one-hot encoding. But let's say you are limited to an output vector of size 3, so like xxx where each x can be a number. Can you still encode all 8 outputs? Yes you can. You can use every combination of 0 and 1. For example 000 = dog, 001 = cat, 010 = table, 011 = chair, ...
But now instead of 8 outputs, you have 500. Can you still encode 500 outputs using a vector of size 3? Yes. Just use real numbers instead of binary. For example (0.034, 0.883, -0.403) = dog, (0.913, -0.311, 0.015) = cat, ... (0.664, -0.323, -0.844) = baseball.
As far as I know, Large Language Models (as well as most other Natural Language Processing models) use a fixed vocabulary. So an LLM may have a vocabulary size of 32,000 words which each map to a unique vector.
Yes, and the dimensions he used were direclty aligned with 'Michael' and 'Jordan' - this wouldn't really be the case, as it would be an inefficient use of weights. Michael would instead be a combination of the ~12,300 feature dimensions.
The script you ran with randomly distributed vectors was mind-opening, let alone once tuned - that's incredible. It's such an awesome quirk of high dimensions. I spent a good chunk of yesterday (should have spent a good chunk of today but oh well) working on animations to try to communicate traversing a high dimensional configuration space and why gradient ascent really sucks for one particular problem, so the whole topic couldn't be more top-of-mind. (my script already contains a plug for your previous video with the "directions have meanings" foundation. this series is so good!)
You know when alpha phoenix upvotes it’s good stuff
This is simply the most comprehensible explanation of transformers anywhere. Both the script and the visuals are fantastic. I learned a lot. Thank you so much.
Broo, just watched chapter 4 and re-watched the previous three chapters. Your first three videos had just dropped when I was learning about neural networks in grad school, like perfect timing. Took a couple of years off drifting around. Now, I'm going back into machine learning, hopefully gonna do a PhD, so I was re-watching the video series, then realized this one was rather new and got curious, noticed the last chapter 7 is missing, then check your home page and lo and BEHOLD you've released chapter 7 like 53 minutes ago. Talk about impeccable timing. I feel like you just dropped these just for me to go into machine learning haha... kinda like "welcome back, let's continue". Anyway thank you so much for taking me on this wonderful journey.
Perfect timing! I'm studying it as well and am pleasantly surprised at the incredibly convenient uploads. Good luck on your PhD Rigel!
That's how universe attracts curious people. There was a task given for me to give a lecture on some ML topic around 4 months ago. Out of instant gratification i have chosen to speak about GPT architecture. Literally i was scared intially and was doing some research and guess what just after 2 to 3 days, this man out of nowhere suddenly dropped a bomb of starting a series on transformers. I was so happy at that time and it helped me doing good amount of research amd seminar also went well.....
The near-perpendicular embedding is wild! Reminds me of the ball in 10 dimensions.
Incredible that it's exponential!
It is astonishing
I know this is such a late response but I wanted to comment on how amazingly human this process is. Whenever we hear or read words, our brain immediately starts to guess the meaning by using the very same process mentioned in this video series. For example, I'm sure many of you have seen the pangram "The quick brown fox jumps over the lazy dog". Now if you come across a sentence that starts with "The quick" your brain might come up with many different ideas but the pangram would be included. As soon as you interpret the word after "brown", then your chances for your brain to guess the pangram goes up. I believe the same is true for thinking of "solutions" of words to output also.
The pangram that you cite was a typical repetitive TYPING DRILL, back when my high school (in the 1980s) taught typing on IBM Selectric typewriters.
@@JohnBerry-q1h It has been a handwriting drill for many, many decades, too, and remains so even in the present day. Specific typing drills are not particularly useful, since they result in people exhibiting muscle memory rather than focusing on typing. For example, I can type my five-word computer password extremely quickly just from memory, because it's a simple repetitive task that I do many times a day - sometimes I even briefly forget the actual words that make it up, since I'm subconsciously recalling finger movements, not thinking about the words themselves - but my typing in general is less rapid than that.
What's even more crazy is when you look at the studies that have been done on people with a severed corpus collosum. The experiments that were performed suggested that the human brain consists of many "modules" that do not have any kind of consciousness themselves, but are highly specialized at interpretation and representation. It seemed like the way "consciousness" works in humans is that the conclusions of these different "modules" within the brain all arrive at the decision-making or language parts of the brain, and those parts work VERY similarly to these LLMs: the generate a thought process or set of words that explains the input from the modules, and that's it.
For example, one module might identify something as dangerous, and another module might identify something as being emotionally significant. Those conclusions, along with many, many other "layers", arrive at the language parts of the brain, and that part of the brain creates a "story" that explains the sum of the "layer" conclusions. Is this the only way to interpret the results of the experiment? Of course not, and it's not they way they were interpreted originally when first performed. But we also hadn't really invented any AI at that point either.
The way that these models represent information and process data seems to me to be MUCH closer to how human brains work than I think most people realize. The human brain probably has at least one type of layer (such as attention or MLP) that is not currently represented in modern AIs, and is also even more parallel, allowing for the possibility of "later" MLP layers or attention layers to cross-activate other layers while they are happening.
Animation. 5/5
Didactic 5/5
Pedagogic 5/5
Voice. 5/5
Knowlege 5/5
Uniqness 4/5
Just beatiful work ❤❤❤ keep it up.. I will send this to eveyone that appreciate the work.
I watched the end of this series now, and I'm just blown away by the maths of it all. Like how simply each step is, yet how it all adds up to something so complex and WEIRD to wrap your head around. It's just so fucking fascinating man. Really makes you think about what WE even ARE.
The combination of Linear + RELU + Linear function, and adding the result to the original, is known as Residual Networks. As 3b1b demonstrated in this video, the advantage Residual Networks have against a simple perceptron network is that the layers perturb (nudge) the input vector, rather than replace it completely.
Thank you so much! This makes sense
Thanks for the pointer! I remember MLP from 10 years back and I couldn't recall the "adding part". Btw. I'm also puzzled by a different thing -- we call it MLP, i.e., *multilayer* perceptron. But it seems that in a single MLP block there is only one layer. It's called multilayer, because it's replicated. I.e., it's N layers, each consisting of "(multi-head) attention" + "1 layer of residual perceptrons". Is my understanding correct? Do you know whether there are deep nets that actually use multiple layers in a single block? Why or why not would they do it?
@@tymczasowy Sorry I may be completely off on this, but would those be dense nets?
@@fluffsquirrel I'm not sure I understand your question. AFAIU all these are dense networks. My question was that it seems from the video that in each layer there is a single "sub-layer" of attention followed by a single sub-layer of residual perceptrons. So, the perceptrons are never stacked directly after each other (there's always attention between them). I am wondering whether it would be beneficial to have more consecutive sub-layers of perceptrons (after each attention). Are there deep nets with such architecture? If not, is it because it's *impractical* (too hard to train, too many parameters), or maybe it doesn't really make sense *conceptually* (there is nothing to gain)?
@@tymczasowy I apologize as I am not well-versed in this subject, but I too am interested in this idea. Hopefully somebody will come along with an answer. Good luck!
I got two relevant references that help answer this question with great examples:
1. Locating and Editing Factual Associations in GPT (2022)
Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov
2. Analyzing Transformers in Embedding Space (2022)
Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant
Yeah i remember the first paper was even on YT with Yannic Kilcher interviewing original authors. btw have you guys seen that write up "What’s Really Going On in Machine Learning?" which draws parallels between cellular automata and highly quantized MLPs. very nice read.
@@naninano8813 the Yannick Interview was great. I have also been fortune enough to have Yonathan Belinkov visit my university and give a guest lecture on 3 adjacent papers which I had quite a few questions on to discuss.
I have been re-watching the Superposition hypothesis for the last 15 minutes. It still blows my mind.
Grant, your work is so beautiful, to me you are the Da Vinci of our time.
There are no words for how good this is and all of the 3Blue1Brown videos are. Thank you, Grant Sanderson, you are providing a uniquely amazing service to humanity.
The way you have presented the mathematics on youtube, no one has done before. Everyone just tells fancy words and you give the essence how they work. Great video Grant, waiting for more videos from you😊
Decided on a whim last night to get a refresher on the previous 2 videos in the series. Beautiful timing; great work as usual
Every time I rewatch this series, I feel an irresistible urge to give it another thumbs-up. It's a shame that's not possible!
Always happy to see 3B1B uploading
I've been reloading your page everyday since the first in the LLM series, seeing this notification was like Christmas come early
AI researcher here, really like the explanation of ReLU! My recent experiments show that dropping all absolute gate values below a threshold leads to universal performance gains across different model architectures!
I bow my head to your ability to explain complex topics in a didactically excellent way without leaving out crucial details.
Thank you very much for yet another wonderful explanation! While focussing on Training feels reasonable, I would also love to learn more on positional encodings. The sinus used in the original paper and the more recent versions would surely make for some interesting visualizations - by simply reading through the papers I am lacking the intuition for this.
I'll consider it for sure. There are many other explainers of positional encoding out there, and right this moment I have an itch for a number of non ML videos that have been on my list.
@@3blue1brown Thanks for considering :) I totally understand that there are non ML videos to be made - do whatever feels right and I'll enjoy it nonetheless
Honestly this guy has taught me more than my school did. Loads lof love and respect.
You have explained this topic so well that it almost looks trivial. Amazing.
Just watched the whole series, took me two days but it was so insightful and easy to understand compared to other sources. Thank you! Those animations and the approach you take to explain is so helpful!
hey Grant im sure i cant understand everything from this series so im skipping this video, but the purpose of this comment is thank you for creating manim python library because you started(i would say) "animation for education" to entirely different level and encourage many people to do that in your style of animation using manim, because of you indirectly im learning many things in youtube, thanks again and i wish you have more success in your carrier with your loved ones
I didn`t watch the other videos (yet) but I could totally pretend to understand a lot of the things in this video, which was mind blowing. Try it out, you can always come back later to try and understand more.
Nah, I am certain you can understand this. If you know Python (optional) some calculus, and linear algebra (kind of optional), you're good. It's just overwhelming at first.
The hard part for me really is trying to understand wtf researchers are talking about in papers. Can't tell if researchers are pedantic, or I'm just too dumb, or both.
@@Dom-zy1qymath is not optional here dude I agree that if you just want to learn the theory python is optional but saying calculus and linear algebra are kinda optional that’s bull shit ai is a mathematical field
As for papers yes they can be overwhelming but reading a academic paper isn’t a quick process you need to read paragraphs repeatedly to understand them properly
I love how this series explains the topic without riding the wave of hype. Stay that way!
Grant does so good of a job explaining these in interesting manner - I bet 3b1b has measurable impact on a whole humanity's grasp of Math at this point.
So glad you touched on interpretability. Anthropics towards monosemanticity paper is one of the most intriguing this year. Using sparse autoencoders to extract monosemantic neurons is just genius.
This is why I pay for internet
When I was giving a talk last year on how transformers worked... I envisioned something like this video in my mind. But of course it was amateur hour in comparison, both visually and in explanation. You are a true professional at this, Grant.
FINALLYY the video we all waited for!
fr!!!
This AND training's gonna be next! Who's excited for that too?!
Your motivation has been the fuel I needed to push forward. I can't express my gratitude enough.
Fantastic explanation. Also, am I the only one who appreciates his correct use of "me" and "I"? So rare these days.
"locating and editing factual associations in GPT" is a fun read and an important prior work for this. thanks for posting!
22:10 Holograms coming! :D
maybe even a crossover with thought emporium?
Unbelievable this very clear of explanation for something that "very hard" is exist.
very excited to wait the next chapter!!
I was waiting for this video for so long. Thanks for this!
Man, don't stop making these videos. They are pure gold.
18:28 this part is really cool!
It is one of the most striking moments in my life knowing that the seemingly pure mathematical idea of loosening the notion of "perpendicular" a bit, results in the reason of the efficiency of information in how the universe is made and perceived. Life is awe.
Huge thanks for the work you are doing! You are one of the few channels that explain LLMs that well. I also like Andrej Karpathy's videos but yours are more for building intuition, which is also great and super helpful! I'm very curious though what was that thing with the green glass holograms
Your video’s are simply mind blowing… the effectiveness with which you succeed to make us ‘visualise’ the mechanics of AI is truly unique !
Keep on the good work 👍
"How facts are stored remains unsolved"
I'm only a minute in but it's kinda wild to think we have invented machines so complex that we can't just know exactly how they work anymore, we basically have to do studies on them the same way you would do a study on a human brain which won't lead to a full understanding, just a partial understanding.
Maybe some problems are so complex we can never truly know the complete inner workings of everything. Maybe we are hitting into the limit of human cognition itself. A human being simply can't keep all the facts straight in their head all at once to be able to come to a complete understanding of something this complex.
I don't know, just rambling...
It's just wild that we invented something that works, but if you ask HOW it works, that's literally an unsolved problem. We flat out don't know exactly how it works.
How little do we really understand? Is it naive to assume that we can start to develop more algorithms whose purpose is to analyze the behavior of neural networks? Surely there's some way to map a concept's embedding in a perceptron in some sort of rigorous way.
But of course, neural nets are a loose imitation of our own brain. Since we don't know where facts are stored in those, either, it's not unintuitive that that we wouldn't know where they are stored in things that operate similarly. You said that we've invented something, and that's only true to a point. It seems to me that we're doing is as much "discovering" as we are "inventing".
They do that in medicine all the time.
The "mystery" surrounding neural networks is VERY different from that of the brain. Sure, having a high-level understanding of how it structures the different weights is very headache-inducing, but it is not a mystery. We still know exactly everything going on conceptually, and grasping how information is stored, etc. is not a mystery in my opinion, even though it's complicated.
Understanding the human brain on the other hand is much more complicated, because for every person, things are "experienced". For instance, there is no mystery in following the process going from photons hitting your retina, to the signal reaching the visual center of your brain. However, how that visual center somehow makes you "see" something is where the mystery is. Questions you can ask yourself to understand this are: Why do you see your red as red and not yellow? Why does cold feel like cold and not warm? And more importantly, why is there any sensation at all? You can also google "p-zombies".
In our modelling of neural networks and machines, everything works without "sensation" (or at least there is no proof of any sensation). We do not know what sensation/experience/whatever you like to call it is, and we have no way of modelling it. Surely there is something interesting in trying to understand how machines with similar neural structure store their information and comparing that to the human brain, but simply thinking of humans as "inputs (for instance vision) mapping to outputs (for instance movement)" does not at all deal with the problem mentioned above.
I think the hint lies in how different reality is when we diverge from what we are used to. Everything makes sense looking at our own scale, but when trying to understand smaller and bigger things (quantum physics, relativity, etc.) it suddenly seems like reality is very different from what we can make sense of. Reality is probably very different from what we think, and as humans, we are very limited, for example by our specific senses. Neural networks are built in a way that we understand each process, at least at a detailed level. We can technically track every decision it makes and understand it. This is clearly less complicated than the human brain which includes sensation, and probably requires a more advanced understanding of what reality actually is.
Love your videos! I've been familiar with Transformers and LLMs, but the notion of superposition in high-dimensional space was new to me. Thanks for the knowledge! Cheers!
That is an amazing video.
And watching it makes me think a lot on how the "Akinator" most probably works by "asking questions" about the characters and how they live on that superposition space that can be only activated when the right sets of answers are given.
Umm I don't think that akinator even work on NLP. It's just large database which you can think of tree or decision tree( yes, no, neutral, probably yes, probably no). Where each branch has question and divides databse into smaller sub tree. But yes as compared to vector which encodes particular question it is doing same job!
Akinator works on much more simple principle, but it can also be expressed in terms of vectors.
Each question is an orthogonal direction, 1 is Yes, 0 is no, .5 is unknown. As you answer the questions you populate the vector components, and the system picks the component that'll be the most valuable for cutting the space of possibilities. The data is simply stored in the database, no fancy neural networks needed.
@@pafnutiytheartist Exactly. Each answer to those questions can be correlated to the dot product of the NLP.
But for me, the mistery is how they populate their database.
This is by far the best explanation I've seen on this subject, very good job! Thank you for uploading this.
15:11 ITS REAL
Can we get much higher
THE ONE PIECE IS REAL
This series has been really helpful in increasing my understanding on what LLMs actaually are. Thank you so much for making this.
the video was really cool and the mathematical explanation is really good.
I do have a semantic quibble though. The model doesn't have any clue that Michael Jordan plays basketball, it knows that sentences that include Michael Jordan, sport, and play all often include the word basketball, so a sentence that includes all 3 is very likely to include basketball.
It's a subtle distinction, but I think it's important because it explains how a Large Language Model can "know" really common facts like that Michael Jordan plays basketball, and even what team he's on, but often mess up quantitative answers and spout well-structured nonsense when asked about something even moderately technical.
I think that’s a valid quibble. It’s very hard to talk about these models without over-philosophizing what we mean by words like “know”, and also without over-anthropomorphizing. Still, the impressive part is how flexible they can seem to be with associations learned, beyond for example what n-gram statistics could ever give, which raises the question of how exactly the associations get stored.
The fact that they do well on the Winograd schema challenge, for example, raises a question of how language is represented internally in a manner that seems (at least to me) to much more deserve a description like “understanding” than previous models.
How well do LLMs perform with Winograd schema submitted after their training cutoff date? A problem of evaluating anything trained on ginormous datasets is ensuring the answers weren't included in the training. Many initially impressive AI results look less impressive when it is revealed that the researchers took less than adequate precautions to avoid cross contamination between training set and evaluation set.
When a LLM cannot answer "If Tom Cruise's mother is Mary Lee Pfeiffer, who is Mary Lee Pfeiffer's son?" or only gets the correct answer when it is someone famous but fails with random made-up people, one does question what aspects of a fact has it really learned?
Prof Subbarao Kambhampati points to several studies that show LLMs memorise without understanding or reasoning.
@@3blue1brown The fact that being a little flexible with orthogonality gives you exponentially more dimensions is really interesting and impressive. I had no idea that was possible, though it does make sense and it does raise a lot of interesting questions.
I think the information storage is probably less like an encyclopedia and more like the world's most detailed dictionary because it stores correlations and relationships between individual tokens. After reading through some winograd schemas, I do think that they prove the model knows something and has some reasoning ability, but with the level of detail LLMs record, I think they can be answered by reasoning about how language is structured without requiring you to know the physical meaning or underlying logic of the sentence.
Given how little of the human brain is devoted to language among other things, I don't think that has very much to do with how most humans store information or would solve Winograd Schemas internally, but it's definitely some kind of knowledge and reasoning, and how that fits into what makes something intelligent is more of a philosophical debate.
At the level LLMs work at, all human languages seem to have an extremely similar structure in the embedding space, so I think the most exciting realistic application for LLMs once we understand them a little better is matching or exceeding the best human translators, and eventually decoding lost languages once we figure out how to make a pretrained LLM.
@@bornach The GPT-3 Paper goes into great detail how they try to avoid training data contamination like that, you can be sure they thought about that problem.
You are seriously amazing at breaking this stuff down
This is an excellent video! It offers the best intuitions on transformer architecture that I've seen. However, I'm curious about one aspect that wasn't covered: positional encoding. Specifically, I'm trying to understand why adding positional encoding doesn't disrupt the semantic clustering of initial words embeddings. How does the model maintain the integrity of these clusters while incorporating positional information? I'd love to see this explored in a future video
You know what, after watching your video allow me to show off with my engineer collegues, they are amazed by my knowledge! Thanks!
The fact that the number of nearly perpendicular vectors increases exponentially as dimension space increases is really interesting and also terriflying.
Yep. Combining this with the scaling laws of computational power makes it scary to think about the capabilities of machine learning in the future.
This is such a good series, thank you so much!!! Have been waiting at the edge of my seat since April and this video was definitely worth the wait! Thank you for such high-quality, rigorous yet still intuitive content!
Babe wake up 3b1b uploaded!
How insane is that bit at the end with the increased capacity due to superposition and almost perpendicular vectors. Really cool, that you put this here.
I have been waiting for this video for a long time man!. Good Job and I hope to see the video explaining Training soon! Thank you so much for these :)!
The scaling of possible perpendicular vectors with the increase in dimension is realllllly mindblowing - especially the way you slowly lead us to discover it in this video. One thing I wondered tho, doesn't this scaling also have an adverse effect on the function of the network? Like, the netwok relies a lot on dot products being large when vectors align. But with higher dimensionality, "making" vectors align (during training I guess) should become increasingly difficult too. The way I understood it, two vectors would have to be suuuuper precisely defined to actually produce a somewhat large dot product, thus maybe making this operation kinda unstable ...
wake up babe, new 3b1b video dropped that will completely change your view on math and science and will provide unparalleled intuition for the same
I actually tried to reach out to you to ask you to make a video about this. Not sure if it's just a coincidence but thank you this makes so much sense, you are amazing at demystifying this ideas.
I'm just in shock in how well this video was made. Makes me want to go back and rewatch all the series, understand everything and come up with my own ideas
Neil Nanda's mechanistic interpretability gives me hope that these models aren't "black boxes" as many engineers pretend they are
Neel Nanda is one of the people who thinks neural nets are black boxes (by default). Mechanistic interpretability is not solved, and I don't think any engineers are pretending this for any reason.
I think they're black boxes by default, we have not YET solved this problem (and may never do), but we're making real progress and I'm optimistic and would love to see more people pushing on this
corrected by the man himself
@@neelnanda2469 hi neel, i remember your super clear complex analysis revision lecture from uni. really happy to see you're making waves out there
great video!
one note - about the distribution of angles between high dimensional vectors
you were selecting their entries at random - which is equivalent of picking 2 random vectors uniformly inside a high dimensional cube and calculating the angle (normalization doesn't matter for the angle). my intuition here is that for a higher dimensional cubes there is more room next to its corners, and more corners, so a 90* is kinda likely.
if instead, you select randomly 2 vectors of a high dimensional sphere, meaning you select the angles uniformly, you get a much broader distribution (still centered around 90 deg)
you get something that looks kinda like that entropy graph of a coin flip (or a bit stretched sin graph)
of course, it doesn't matter if your goal is to find as much as you want, and not just ask about the distribution of the angle
Facts.
your work is incredible and groundbreaking teaching. I didn't think I'd understand such complex concepts in a single video
Weird... my gpt is hallucinating that MJ played golf.
I always look forward to your uploads. Your content is consistently engaging and well-produced. Great job!
4:10 I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?
Philosophy
I am not gonna talk about the video, it’s obvious how it is !! I wish there is an Oscar for the Video makers on TH-cam, this Chanel would be definitely among the very top nominees !
"reasons" of neural networks will never be solved. Just as Stephen Wolfram said: "Asking 'why' a neural network does something is the same is asking 'why houses are made stones?' It's just because it's something that was available at the moment, lying around to be exploited in the computational universe"
Yes, we won’t be able to figure out the fine details for why a neural network does exactly what it does. However, we can get a big picture or a group of medium-sized pictures. It’s similar to studying the actual human brain and why we do what we do: it’s really difficult since we can’t really start from the smallest details and learn the biggest ones, but we can work our way down to a certain extent, making hypotheses and testing them to figure out how correct our understanding is. Just because the task of understanding models or brains as a whole all the way from neurons to behavior is for all intents and purposes impossible, that doesn’t mean we can’t understand certain aspects of it.
The walkthrough of code was so helpful in understanding the dimensionality point. Putting your code GPT ironically helped solidify my understanding of what you were saying. Thank you!
Others: How LLMs work?
3b1b: How LLMs works work?
dig down the work hole
Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in 16:47 but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.
0:45 Wait we don't actually know how it works fully?
We know how to create these networks, but we don't specifically know what combination of parameters of the network corresponds to a specific concept/output.
We just feed it a bunch of data, then basically tell it to predict what might come after a specific textual sequence. This prediction is made based on many matrix multiplications and transformations. We then have a loss function, which grades the accuracy of the predictions. During training, basically compute the gradient of this loss function with respect to all parameters of the model. (the values in the matrices which we used to transform our input data)
Because so many parameters (billions to trillions) are needed to make these predictions well, it's difficult to really know for sure which parameter(s) "corresponds" to an idea or concept.
You could in theory, do all of this by hand without a computer, but it would take you an eternity. So we use computers, the consequence of that being we end up not knowing what our network is "thinking".
@@Dom-zy1qy thanks for the clarification 🙏🏻🙏🏻
It's not just that we don't know what the weights end up representing. We don't know which of the dozens of ways a matrix or a vector can represent data the model is using at any part of the process.
This the reason for the often used "black box" ... You can train the network and understand it's performance... But what the inside of the weights do is the "black" unknown bit.
These videos are so good. You are explaining these difficult concept in an understandable way. Thank you so much for these videos!
This is precisely why hallucinations are such an issue with LLMs. They don't actually store facts, so hallucinations and facts aren't distinguishable.
It also might explain the problem of "glitch tokens" in LLMs. A prompt could accidentally send a LLM into a direction for which the linear combination of superposition vectors it was trained on makes absolutely no sense.
@@bornach I imagine sometimes there are spurious tokens in pools of otherwise related tokens.
If you give someone directions to the store, but you misinterpret one of the turns as right instead of left, you are going to end up in the wrong part of town. Humans, I imagine, would usually realize pretty quick something went wrong and would know they are lost.
LLMs keep trucking along and vomit out whatever garbage happens to be near where they end up in the end.
Do you think your brain “stores facts” somewhere? Seems like more a matter of scale and architecture to me - LLMs aren’t the end game, they are the first glimpse of a new era.L of understanding for us humans
Every ep contains new key insights and one mind blower! Love it.
Why'd they have to name it MLP
This could be me just living in a bubble, but why wouldn't they name it MLP?
@@flakmoppenMy Little Pony
@@noaht2 ah. Gotcha.
they were invented ~20 years before my little pony existed (1960s vs 1980s)
One of very few channels where I press like before watching the video. Also not a single video is missed. Amount of trust credit is immense. :)
The quality of 3b1b has declined. The videos have the same pace (maybe more) despite an increase in the difficulty. Effectively decreasing the target viewerbase who'd enjoy the process.
Basically I'm admitting that the content has progressed beyond me, even though earlier the videos were something I could understand.
This stuff must be quite challenging.
This is state of the art, bleeding edge, and pretty recent advances, that even the brightest AI researchers didn't understand less than a decade ago. The fact that he can even explain it at all to the public this well is IMHO quite impressive. But yes, it is sometimes a good idea to watch such videos more than once and take a good night's sleep in between 🙂 I mean, no amount of billions of dollars and professors could get you this knowledge a decade ago. Just for perspective.
@@Gabriel-tp3vc I appreciate it, no doubt.
Invaluable resource shared in this video. Therefore requiring an infinite number of thank you !
I am going to confess that these intriguing animated videos on the transformer mode from 3b1b, although pretty to watch, quickly go entirely over my head.
Fascinating stuff. I think the idea of polysemanticity is another fascinating way to explore how the LLM scaling can be so exponential, as more parameters are added the way they can be combined with other parameters expands combinatorially (even better than exponentially!)
This video series is so great. I really admire and appreciate you for helping to educate all of us about this very important subject. I would love to see you do a video about the similarities and differences between these LLMs, (or potential future LLM's) and the human brain.
This video is a Must for everyone that wants to understand in a better way how "magic" really works behind llms
wow, awesome as always!
you can interpret bias + relu as the denoising part, which discards all information below a certain threshold and prevents the neuron from firing and passing information to the preceeding layers
Thanks for all of these videos, great work!
The most significant thing for me in this video is how the knowledge grows roughly exponentially with the dimension, that's huge literally
All videos here are pure gold 🎉🎉
Thank you for making these very complex topics accesible
Bro!! you are a great teacher! Thank you very much. Looking forward to next chapter.
Thank you so much for this wonderful video. I am going to watch this multiple times to keep refreshing these topics as long as transformers are the epitome architecture in NLP.
So glad to have another vid from 3B1B on this topic, the more I understand about AI the more awestruck I am by it.
I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in 22:42, he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!
That Johnson-Lindenstrauss lemma blows the mind at first, but it actually makes a lot of intuitive sense. Between angles get higher and higher, and as the sphere packing shows, increasing dimensions lead to some seemingly weird stuff.
Amazing series! Please continue it! Many more topics in transformers to explore.
This is an amazing series on general LLM for language processing! Bravo to 3B1B! One thing that still puzzles me is that how the LLMs solve math/logic problems. The short answer I got from GPT 4o mini is that "As a conversational AI, the process of solving math problems involves several key principles and methodologies that differ from straightforward text prediction." I wish there 3B1B can look into that and enlighten me.
Exciting mark for the new academic year begin! A long-awaited sequel of this amazing story. I hope it will also be a doublet as two previous!
Thank you so much for this collection of masterpieces!
The video is nice, well explained with nice examples, but the idea of superposition with nearly perpendicular vectors is AWESOME, it blew my mind!
The most amazing part of the video is a lemma shown as a side note at the end of the vídeo, please explore more this idea of fitting exponential many near perpendicular vectors in a high dimensional space