How I Finally Understood LLM Attention

Bits of Chris

มุมมอง 28 567

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 6 ม.ค. 2025

ความคิดเห็น • 87

@RicardoRamirez-dr6gc 28 วันที่ผ่านมา ⁺³⁴
Great explanation! In short, an attention block is a system which allows every token of an input to update the embedding of a single token to better represent it's meaning withing the sequence.
@BitsOfChris 27 วันที่ผ่านมา ⁺²
Solid recap!
Attention lets each token capture the meaning of other tokens.
@greatgatsby6953 17 วันที่ผ่านมา ⁺⁴
The number lines (to visualize higher dimensions) helps, greatly. I never thought of it like that and was struggling with dimensions greater then three!
@BitsOfChris 16 วันที่ผ่านมา
Glad it helped! :)
@human_shaped 18 วันที่ผ่านมา ⁺⁴
This really describes embeddings more than attention itself.
@BitsOfChris 17 วันที่ผ่านมา ⁺¹
Fair point- I think understanding embeddings is key to understanding self-attention. But then the scope of this video was more the conceptual understanding of what attention means rather than the implementation.
Thank you for watching :)
Is there anything additional you would have liked to see?
@human_shaped 17 วันที่ผ่านมา
@@BitsOfChris Well, it would have been useful if it explained attention instead of just saying it exists.
@BitsOfChris 16 วันที่ผ่านมา
I totally agree but sometimes it's helpful to not overwhelm with too much information. This video I deliberately chose to keep the scope small.
I tried going deeper on attention in this video - I hope this one helps: th-cam.com/video/FepOyFtYQ6I/w-d-xo.html
@SimonNgai-d3u 22 วันที่ผ่านมา ⁺⁵
I can relate this so much. Adding a few words to the prompt to specify the instructions is extremely helpful to reliability, especially for smaller models like 4o-mini who seems like not doing a lot of "thinking".
@BitsOfChris 22 วันที่ผ่านมา ⁺²
Agreed - it's fun to try different variations of the same prompt in multiple windows to test how much a key word in your prompt can influence the response.
@nmirza2013 6 วันที่ผ่านมา
great intuitive explanation
@babusivaprakasam9846 11 วันที่ผ่านมา
Excellent explanation. Appreciate the effort.!
@BitsOfChris 8 วันที่ผ่านมา
Thanks, happy to hear :)
@user-ug8qc6tr6b 13 วันที่ผ่านมา
Thank god! I found your channel, the best explanation I've ever seen
@BitsOfChris 13 วันที่ผ่านมา
Really happy you found it too! Thanks :)
@BenLebovitz 29 วันที่ผ่านมา ⁺³
Thanks! I've often struggled to explain attention to others, this is really good that I'm totally going to use in my own conversations.
@BitsOfChris 29 วันที่ผ่านมา
@@BenLebovitz appreciate that, glad it helped!
@jaysonp9426 24 วันที่ผ่านมา ⁺²
New sub just for how good this explanation is
@BitsOfChris 24 วันที่ผ่านมา ⁺²
@@jaysonp9426 appreciate that, thank you for the kind words!
@robinjac4322 19 วันที่ผ่านมา
That was very helpful, thanks! I recently had to learn more about AI for my job and even though I'm generally somewhat informed, I lack a lot of depth in understanding AI at this point. This example paints a clear picture on how LLM's think, very cool!
@BitsOfChris 19 วันที่ผ่านมา ⁺¹
Happy to hear it helped! I'm in a similar boat, as a data engineer now supporting an AI research team this is world of AI is new to me too.
I think the channel here is mainly me (as a engineer) sharing what I learn/ is helpful for other engineers needing to learn AI.
Thanks for the comment :)
@alhdlakhfdqw 15 วันที่ผ่านมา
really amazing explanation and visualization! thank you! subbed! :)
@BitsOfChris 13 วันที่ผ่านมา
Thank you for the kind words :)
@andydataguy 21 วันที่ผ่านมา
Bro this is the best explanation I've ever seen. Thank you!
@BitsOfChris 20 วันที่ผ่านมา
Thank you for the kind feedback, glad it helped :)
@coastofkonkan 20 วันที่ผ่านมา
Chris, this is super cool. Keep up the good work. Conceptual things have tremendous impact on new learners.
@BitsOfChris 20 วันที่ผ่านมา ⁺¹
Thanks- appreciate hearing that. I'm with you- there's so much to learn but it's hard to tie the deep dives to something you can grasp.
Glad this was helpful :)
@ScottSummerill 14 วันที่ผ่านมา
Very nice. Think this is the freaking first time I understand it.
@BitsOfChris 13 วันที่ผ่านมา
Sometimes it just takes seeing something through a different lens, happy it helped!
@superfliping 19 วันที่ผ่านมา ⁺¹
That's a great perspective. Thank You.
@BitsOfChris 18 วันที่ผ่านมา
Thanks for the kind words, happy it helped :)
@superfliping 18 วันที่ผ่านมา
@@BitsOfChris I am working on scaling Ai personal memories, so your slide scaling_factoring in my nowhere Nexus development really optimize the performance and output based on necessary settings for each self.dimensions memories embedded into subdimensions function generateLabels. Duel dialogue associated with each iteration in conversation to convert into recursive fractal dimensions into appropriate layers. later I had my first results with models. Was very good 👍
@karthage3637 27 วันที่ผ่านมา ⁺¹
Thanks for this explanation
It help me get better understanding of what is happening inside the model than the usal exemple with the 2D vectors
@BitsOfChris 26 วันที่ผ่านมา
Very happy to hear- exactly what I was trying to accomplish!
@WiseMan_1 23 วันที่ผ่านมา
Clear and short explanation, nice one! I already know that if you publish video consistently this channel will be very popular easily, I’ve seen many examples like this. Keep up the great work 😉
@BitsOfChris 23 วันที่ผ่านมา
@@WiseMan_1 I don’t know who you are, but this is such a kind thing to share. Thank you for making my day :)
I’ve been writing for as long as I can remember. 12 months ago was the first time I put my writing on social media, and it was scary.
But after a year of writing, podcasting, making videos, exploring my self to find my version of work that feels like play- I think I found it.
Thank you for the confidence to keep going :)
@WiseMan_1 22 วันที่ผ่านมา
@@BitsOfChris I’m happy that my comment made your day :) Your journey is inspiring. Keep doing what you love! Looking forward to more great videos.
@ADiddy-cq2wk 14 วันที่ผ่านมา
This was super intuitive!! The ability to focus on 5 dimensions and walk through a work like light that has multiple meanings that only becomes clear based on context was very helpful. Do you have a Twitter account or Substack I can follow?
@BitsOfChris 13 วันที่ผ่านมา
Appreciate that, glad it helped :)
I'm mainly focused on TH-cam right now but I use bitsofchris.com Substack as my "hub" for all things.
@JamesHoover 21 วันที่ผ่านมา
This is an excellent insight in how to explain the high dimensionality of AI models
@BitsOfChris 21 วันที่ผ่านมา
Thank you :)
@Normi19 23 วันที่ผ่านมา ⁺¹
Great approach, and style of teaching.
@BitsOfChris 23 วันที่ผ่านมา
@@Normi19 appreciate that, thank you!
It’s so helpful to get feedback (and motivating).
I’m curious what about my style most appealed to you? I’m too close to know myself sometimes lol
@Normi19 23 วันที่ผ่านมา
@@BitsOfChris the choice of word “light” makes the example layered and shows the nuance in multidimensionality. The stacks of single scales makes the point very clear, especially when you walk the audience through the possible changes based on context.
@BitsOfChris 22 วันที่ผ่านมา
@ thank you for that :)
@sythatsokmontrey8879 12 วันที่ผ่านมา
Ahh, this is good. It's so good.
@BitsOfChris 8 วันที่ผ่านมา
Happy to hear, thanks :)
@neokirito 24 วันที่ผ่านมา ⁺¹
Great explanation, just subbed 😎
@BitsOfChris 23 วันที่ผ่านมา
@@neokirito appreciate that, thank you for your support :)
@emmanuelmwape4560 27 วันที่ผ่านมา
Conceptual and insightful. I wish you could prepare another video how embeddings and dimensions and parameters interact
@BitsOfChris 27 วันที่ผ่านมา
@@emmanuelmwape4560 thanks for the feedback and suggestion. I’m working on a more in depth video that includes code examples. Hopefully this will help :)
@GrowStackAi 21 วันที่ผ่านมา
AI: The silent force behind incredible breakthroughs 🔥
@BitsOfChris 20 วันที่ผ่านมา ⁺¹
Exciting times we live in.
@igorcoutoia หลายเดือนก่อน ⁺¹
Great explanation.
@BitsOfChris หลายเดือนก่อน
Thank you, appreciate hearing that :)
@lav__val__96 28 วันที่ผ่านมา ⁺¹
Very insightful!
@BitsOfChris 28 วันที่ผ่านมา
Thank you! Appreciate the support :)
@Abdelrhman_Rayis 25 วันที่ผ่านมา
Great example, thanks!
@BitsOfChris 25 วันที่ผ่านมา
@@Abdelrhman_Rayis you’re very welcome, thanks for the comment!
@szebike 25 วันที่ผ่านมา
Nice thanks for the video I also liked the recap part though its a short video the topic is very dense.
@BitsOfChris 25 วันที่ผ่านมา ⁺¹
@@szebike appreciate the feedback. I’m working on a longer video now, but you’re right it’s a very dense topic. A lot of details were left out to get us started :)
@HenrikVendelbo 23 วันที่ผ่านมา
Well done.
@BitsOfChris 23 วันที่ผ่านมา
@@HenrikVendelbo thank you for the kind words :)
@hamadaag5659 20 วันที่ผ่านมา
Drawing in excalidraw makes this tons more impressive lol
@BitsOfChris 20 วันที่ผ่านมา
@@hamadaag5659 thank you ;)
I’d love to learn another tool but sticking with a simple one has been a good constraint. Makes it easier for me to get something done.
@elmoreglidingclub3030 25 วันที่ผ่านมา
Very nice explanation! I teach AI at a university and it is frustrating to try to explain this using a 2D Cartesian space!
@BitsOfChris 24 วันที่ผ่านมา
@@elmoreglidingclub3030 I hear you! I found that difficult too. The individual number lines taken in aggregate really helped me :)
@elmoreglidingclub3030 24 วันที่ผ่านมา
@@BitsOfChris Exactly. It’s a great idea. And you can really convey context, even subtle contexts, by using several lines.
I’d tried using several coordinates but this is much clearer.
@HighZenBerg...-df1kr 22 วันที่ผ่านมา ⁺¹
I feel like this chain of reasoning is missing something important. Here's what I mean. How can we infer that the different dimensions are something that are relevant to the real universe? In the example you mentioned, "light as a feather"....the presence of the word "feather" influences the embedding of the word "light" in that particular context. But if we go down this chain, how does the model know that feather is something that has certain dimensions along which it has a value? Implying relevance to the real word seems like a leap of logic to me. The words "light" and "feather" influence each others embedding in that context, only because in the data it has been trained on, these words have an impact on each other, in terms of what the next words should be. The embeddings in particular contexts make sense, because the data it is trained on, is (at least as far as humans are concerned), a valid representation of the real world. These dimensions do encode something, but those encodings are only high level abstractions of patterns in the syntax we generate using this particular language (English). I feel like implying this actually encodes relevant features of the universe is a leap in logic. Another point here is that, we have ideas we want to convey or represent, and then we formulate words to convey it. In this case, all words are given, and their interactions are studied using data. But thats not how we work. I really feel like presuming that these dimensions actually encode features of the real world, is a bit of a stretch. Thoughts?
@BitsOfChris 22 วันที่ผ่านมา
Thank you for this incredibly thoughtful comment! You raise some great points about the relationship between language, meaning, and representation that I hadn't really considered :)
You're absolutely right that we should be cautious about claiming these models learn 'real' features of the universe rather than just patterns in language.
The 'light as a feather' example might be better framed as demonstrating how attention mechanisms can learn to represent contextual relationships in language that humans use to describe physical properties, rather than suggesting they learn the physical properties themselves.
To your initial question: I don't think we can always infer what the dimensions are doing, let alone if they are something relevant to the real universe. From what I learned so far - it's really difficult to do this reverse engineering of a language model. And to my knowledge the dimensions or features they learn are not always as cleanly interpretable as I laid out in the example.
My goal was to simplify things "enough" so folks could grasp the bigger picture of what's going on. Like most things, I think it's a trade off between accuracy in simplicity in how much detail to include.
@HighZenBerg...-df1kr 22 วันที่ผ่านมา
@BitsOfChris Thanks for the reply 🙂. Just wanted to mention this point. I completely agree that this is a very good starting point to try to peek under the hood. Great video, very concisely put. Looking forward to more great content from you 🙂
@BitsOfChris 22 วันที่ผ่านมา
@ appreciate it, thank you :)
@thederpydude2088 14 วันที่ผ่านมา
Given that there are less easily human interpretable dimensions or properties or whatever of a word, I feel like maybe if we could figure out what those dimensions mean and what they represent then that could be a way of learning about nuances in the meanings of words that maybe haven't even been considered 🤔 Anyone know if there's some kind of thing like this where you could find unique meanings or patterns of like a word that AI tech uncovered?
I've heard about how, for example, people working with neural networks or something found that the (AIs? networks? systems? ig I'll just say) models picked up on totally unexpected patterns in their input data, with some examples like I think one model meant to find patterns between images of eyes and diseases also found patterns that'd help predict someone's gender based on the image of the eye. And I think another example involved an LLM developing a neuron for like positive or negative sentiment in parts of text, so it could be configured to create outputs of a certain sentiment and it could also be used to measure the sentiment of text.
I've considered looking into stuff like this further because it seems so cool that models can just pick up on patterns that we don't haven't even consciously noticed, and I wonder if anyone might have some ideas for how we can learn from the actual patterns that AIs have essentially discovered in different kinds of input data. Anyone know topics involving stuff like this that I could look into further?
@BitsOfChris 13 วันที่ผ่านมา ⁺¹
Dude - this is getting into some really deep but fascinating territory.
It's the balance between human interpretable and the emergent properties of these models.
Some work to follow up on:
Golden Gate Claude: www.anthropic.com/news/golden-gate-claude
The field as a whole is called "mechanistic interpretability" where you are effectively reverse engineering neural networks.
Please share anything you find - sounds like you are very interested in this angle, would love to learn more too :)
@thederpydude2088 13 วันที่ผ่านมา
@@BitsOfChris Oh cool 🤔 After my comment, I ending up coming across a new video from the Welch Labs channel about mechanistic interpretability, which kinda did seem like this idea I was thinking about, and now it sounds like that topic was indeed on the right track.
The Golden Gate Claude thinking of the bridge as its own ideal form is hilarious btw xD Looks like there is sort of a whole entire field of interpretability within the overall scope of neural networks, and some links on the article seem to point to some deeper research on the topic. I'm not too intent on exploring it atm, but it's good to know that there is this stuff out there, and I imagine I would revisit it sometime in the future.
Oh also I recall there was a 3b1b YT short about word vectors that kinda showed how models approximately represent concepts like [ woman - man ≈ aunt - uncle ] and even [ Hitler + Italy - German ≈ Mussolini ] lol. I didn't watch it but the short linked to a longer video about transformers explain visually, which might touch on some interesting sides of this topic as well if you're interested.
@austinphillip2164 24 วันที่ผ่านมา
So does the model set these points itself when making these contextual connections with the points or is the amount of attention it gives to a lexical item preset by a human?
@BitsOfChris 24 วันที่ผ่านมา ⁺¹
@@austinphillip2164 great question- the model sets the points itself through its training process.
Then as it process your input the model will adjust the weights (meaning of each word) as the input is passed through every layer of the model.
Does that help?
@DK-ox7ze 27 วันที่ผ่านมา
So each word in the model has N corresponding dimensions?
And these dimensions are not shared across words?
@BitsOfChris 27 วันที่ผ่านมา ⁺²
Each word (or more specifically token) IS SHARING the same N dimensional space in the model.
From the video, each word would be plotted somewhere as a point in the SAME 5 dimensional space, where the 5 axes are capturing the same meaning.
Does that make sense?
Great questions! I realize now maybe this wasn't as clear as it could have been.
@DK-ox7ze 27 วันที่ผ่านมา
⁠From the video example, the word "light" is mapped to 5 other words with the weights for those words varying for different contexts. So,
1) Where are all the weights stored? In the light example, each context of light will have different weights for those words.
2) Is such a mapping as light created for all tokens? i.e. each token has N other words mapped to it with weights in those words?
@BitsOfChris 27 วันที่ผ่านมา ⁺¹
@ the 5 words by each number line is the semantic meaning of that dimension. Sorry if I didn’t emphasize that enough.
Your additional questions are great- they are the basis of how attention is actually implemented. I deliberately left that out of scope for this video.
But to answer quickly, this process of updating embeddings is done for each token in the context window.
There are three additional matrices, commonly called the query, key, and value matrices.
To your questions, yes this process of looking at how much each token should care about the meaning of every other token is done for each word.
The video just emphasized this process for the word light.
I’m working on a more in depth video with code examples that shows this process for the entire context window.
I deliver omitted details in this video in order to really drive home the high level concept.
Thanks for the follow up questions!
@punk3900 19 วันที่ผ่านมา
the fun part is when you try to understand queries, keys, and values...
@BitsOfChris 19 วันที่ผ่านมา ⁺¹
Honestly - that took me about a month of reading, implementing, and watching TH-cam.
Here's the video I published after this one that goes into more detail: th-cam.com/video/FepOyFtYQ6I/w-d-xo.htmlsi=Gg9iUT-teDDkr7sD
@GalaxyHomeA9 24 วันที่ผ่านมา ⁺²
The reason why can't we understand is because our limitations to visualize. We can only visualize 3d dimensions but beyond ht 4d, 5d, 6d, ..... nd are beyond the imagination. We could only do maths of it but doing physics of it is just beyond the reach.
@BitsOfChris 24 วันที่ผ่านมา
@@GalaxyHomeA9 right, for some reason I just assumed people could visualize that far or had some trick for understanding high dimensional spaces. This number line approach really helped me. Thanks for the comment :)
@peblopablo 25 วันที่ผ่านมา
What an interesting look under the hood
@BitsOfChris 24 วันที่ผ่านมา
@@peblopablo appreciate that, thanks :)

ต่อไป

เล่นอัตโนมัติ

Why Does Diffusion Work Better than Auto-Regression?