Great explanation! In short, an attention block is a system which allows every token of an input to update the embedding of a single token to better represent it's meaning withing the sequence.
The number lines (to visualize higher dimensions) helps, greatly. I never thought of it like that and was struggling with dimensions greater then three!
Fair point- I think understanding embeddings is key to understanding self-attention. But then the scope of this video was more the conceptual understanding of what attention means rather than the implementation. Thank you for watching :) Is there anything additional you would have liked to see?
I totally agree but sometimes it's helpful to not overwhelm with too much information. This video I deliberately chose to keep the scope small. I tried going deeper on attention in this video - I hope this one helps: th-cam.com/video/FepOyFtYQ6I/w-d-xo.html
I can relate this so much. Adding a few words to the prompt to specify the instructions is extremely helpful to reliability, especially for smaller models like 4o-mini who seems like not doing a lot of "thinking".
Agreed - it's fun to try different variations of the same prompt in multiple windows to test how much a key word in your prompt can influence the response.
That was very helpful, thanks! I recently had to learn more about AI for my job and even though I'm generally somewhat informed, I lack a lot of depth in understanding AI at this point. This example paints a clear picture on how LLM's think, very cool!
Happy to hear it helped! I'm in a similar boat, as a data engineer now supporting an AI research team this is world of AI is new to me too. I think the channel here is mainly me (as a engineer) sharing what I learn/ is helpful for other engineers needing to learn AI. Thanks for the comment :)
Thanks- appreciate hearing that. I'm with you- there's so much to learn but it's hard to tie the deep dives to something you can grasp. Glad this was helpful :)
@@BitsOfChris I am working on scaling Ai personal memories, so your slide scaling_factoring in my nowhere Nexus development really optimize the performance and output based on necessary settings for each self.dimensions memories embedded into subdimensions function generateLabels. Duel dialogue associated with each iteration in conversation to convert into recursive fractal dimensions into appropriate layers. later I had my first results with models. Was very good 👍
Clear and short explanation, nice one! I already know that if you publish video consistently this channel will be very popular easily, I’ve seen many examples like this. Keep up the great work 😉
@@WiseMan_1 I don’t know who you are, but this is such a kind thing to share. Thank you for making my day :) I’ve been writing for as long as I can remember. 12 months ago was the first time I put my writing on social media, and it was scary. But after a year of writing, podcasting, making videos, exploring my self to find my version of work that feels like play- I think I found it. Thank you for the confidence to keep going :)
This was super intuitive!! The ability to focus on 5 dimensions and walk through a work like light that has multiple meanings that only becomes clear based on context was very helpful. Do you have a Twitter account or Substack I can follow?
@@Normi19 appreciate that, thank you! It’s so helpful to get feedback (and motivating). I’m curious what about my style most appealed to you? I’m too close to know myself sometimes lol
@@BitsOfChris the choice of word “light” makes the example layered and shows the nuance in multidimensionality. The stacks of single scales makes the point very clear, especially when you walk the audience through the possible changes based on context.
@@emmanuelmwape4560 thanks for the feedback and suggestion. I’m working on a more in depth video that includes code examples. Hopefully this will help :)
@@szebike appreciate the feedback. I’m working on a longer video now, but you’re right it’s a very dense topic. A lot of details were left out to get us started :)
@@hamadaag5659 thank you ;) I’d love to learn another tool but sticking with a simple one has been a good constraint. Makes it easier for me to get something done.
@@BitsOfChris Exactly. It’s a great idea. And you can really convey context, even subtle contexts, by using several lines. I’d tried using several coordinates but this is much clearer.
I feel like this chain of reasoning is missing something important. Here's what I mean. How can we infer that the different dimensions are something that are relevant to the real universe? In the example you mentioned, "light as a feather"....the presence of the word "feather" influences the embedding of the word "light" in that particular context. But if we go down this chain, how does the model know that feather is something that has certain dimensions along which it has a value? Implying relevance to the real word seems like a leap of logic to me. The words "light" and "feather" influence each others embedding in that context, only because in the data it has been trained on, these words have an impact on each other, in terms of what the next words should be. The embeddings in particular contexts make sense, because the data it is trained on, is (at least as far as humans are concerned), a valid representation of the real world. These dimensions do encode something, but those encodings are only high level abstractions of patterns in the syntax we generate using this particular language (English). I feel like implying this actually encodes relevant features of the universe is a leap in logic. Another point here is that, we have ideas we want to convey or represent, and then we formulate words to convey it. In this case, all words are given, and their interactions are studied using data. But thats not how we work. I really feel like presuming that these dimensions actually encode features of the real world, is a bit of a stretch. Thoughts?
Thank you for this incredibly thoughtful comment! You raise some great points about the relationship between language, meaning, and representation that I hadn't really considered :) You're absolutely right that we should be cautious about claiming these models learn 'real' features of the universe rather than just patterns in language. The 'light as a feather' example might be better framed as demonstrating how attention mechanisms can learn to represent contextual relationships in language that humans use to describe physical properties, rather than suggesting they learn the physical properties themselves. To your initial question: I don't think we can always infer what the dimensions are doing, let alone if they are something relevant to the real universe. From what I learned so far - it's really difficult to do this reverse engineering of a language model. And to my knowledge the dimensions or features they learn are not always as cleanly interpretable as I laid out in the example. My goal was to simplify things "enough" so folks could grasp the bigger picture of what's going on. Like most things, I think it's a trade off between accuracy in simplicity in how much detail to include.
@BitsOfChris Thanks for the reply 🙂. Just wanted to mention this point. I completely agree that this is a very good starting point to try to peek under the hood. Great video, very concisely put. Looking forward to more great content from you 🙂
Given that there are less easily human interpretable dimensions or properties or whatever of a word, I feel like maybe if we could figure out what those dimensions mean and what they represent then that could be a way of learning about nuances in the meanings of words that maybe haven't even been considered 🤔 Anyone know if there's some kind of thing like this where you could find unique meanings or patterns of like a word that AI tech uncovered? I've heard about how, for example, people working with neural networks or something found that the (AIs? networks? systems? ig I'll just say) models picked up on totally unexpected patterns in their input data, with some examples like I think one model meant to find patterns between images of eyes and diseases also found patterns that'd help predict someone's gender based on the image of the eye. And I think another example involved an LLM developing a neuron for like positive or negative sentiment in parts of text, so it could be configured to create outputs of a certain sentiment and it could also be used to measure the sentiment of text. I've considered looking into stuff like this further because it seems so cool that models can just pick up on patterns that we don't haven't even consciously noticed, and I wonder if anyone might have some ideas for how we can learn from the actual patterns that AIs have essentially discovered in different kinds of input data. Anyone know topics involving stuff like this that I could look into further?
Dude - this is getting into some really deep but fascinating territory. It's the balance between human interpretable and the emergent properties of these models. Some work to follow up on: Golden Gate Claude: www.anthropic.com/news/golden-gate-claude The field as a whole is called "mechanistic interpretability" where you are effectively reverse engineering neural networks. Please share anything you find - sounds like you are very interested in this angle, would love to learn more too :)
@@BitsOfChris Oh cool 🤔 After my comment, I ending up coming across a new video from the Welch Labs channel about mechanistic interpretability, which kinda did seem like this idea I was thinking about, and now it sounds like that topic was indeed on the right track. The Golden Gate Claude thinking of the bridge as its own ideal form is hilarious btw xD Looks like there is sort of a whole entire field of interpretability within the overall scope of neural networks, and some links on the article seem to point to some deeper research on the topic. I'm not too intent on exploring it atm, but it's good to know that there is this stuff out there, and I imagine I would revisit it sometime in the future. Oh also I recall there was a 3b1b YT short about word vectors that kinda showed how models approximately represent concepts like [ woman - man ≈ aunt - uncle ] and even [ Hitler + Italy - German ≈ Mussolini ] lol. I didn't watch it but the short linked to a longer video about transformers explain visually, which might touch on some interesting sides of this topic as well if you're interested.
So does the model set these points itself when making these contextual connections with the points or is the amount of attention it gives to a lexical item preset by a human?
@@austinphillip2164 great question- the model sets the points itself through its training process. Then as it process your input the model will adjust the weights (meaning of each word) as the input is passed through every layer of the model. Does that help?
Each word (or more specifically token) IS SHARING the same N dimensional space in the model. From the video, each word would be plotted somewhere as a point in the SAME 5 dimensional space, where the 5 axes are capturing the same meaning. Does that make sense? Great questions! I realize now maybe this wasn't as clear as it could have been.
From the video example, the word "light" is mapped to 5 other words with the weights for those words varying for different contexts. So, 1) Where are all the weights stored? In the light example, each context of light will have different weights for those words. 2) Is such a mapping as light created for all tokens? i.e. each token has N other words mapped to it with weights in those words?
@ the 5 words by each number line is the semantic meaning of that dimension. Sorry if I didn’t emphasize that enough. Your additional questions are great- they are the basis of how attention is actually implemented. I deliberately left that out of scope for this video. But to answer quickly, this process of updating embeddings is done for each token in the context window. There are three additional matrices, commonly called the query, key, and value matrices. To your questions, yes this process of looking at how much each token should care about the meaning of every other token is done for each word. The video just emphasized this process for the word light. I’m working on a more in depth video with code examples that shows this process for the entire context window. I deliver omitted details in this video in order to really drive home the high level concept. Thanks for the follow up questions!
Honestly - that took me about a month of reading, implementing, and watching TH-cam. Here's the video I published after this one that goes into more detail: th-cam.com/video/FepOyFtYQ6I/w-d-xo.htmlsi=Gg9iUT-teDDkr7sD
The reason why can't we understand is because our limitations to visualize. We can only visualize 3d dimensions but beyond ht 4d, 5d, 6d, ..... nd are beyond the imagination. We could only do maths of it but doing physics of it is just beyond the reach.
@@GalaxyHomeA9 right, for some reason I just assumed people could visualize that far or had some trick for understanding high dimensional spaces. This number line approach really helped me. Thanks for the comment :)
Great explanation! In short, an attention block is a system which allows every token of an input to update the embedding of a single token to better represent it's meaning withing the sequence.
Solid recap!
Attention lets each token capture the meaning of other tokens.
The number lines (to visualize higher dimensions) helps, greatly. I never thought of it like that and was struggling with dimensions greater then three!
Glad it helped! :)
This really describes embeddings more than attention itself.
Fair point- I think understanding embeddings is key to understanding self-attention. But then the scope of this video was more the conceptual understanding of what attention means rather than the implementation.
Thank you for watching :)
Is there anything additional you would have liked to see?
@@BitsOfChris Well, it would have been useful if it explained attention instead of just saying it exists.
I totally agree but sometimes it's helpful to not overwhelm with too much information. This video I deliberately chose to keep the scope small.
I tried going deeper on attention in this video - I hope this one helps: th-cam.com/video/FepOyFtYQ6I/w-d-xo.html
I can relate this so much. Adding a few words to the prompt to specify the instructions is extremely helpful to reliability, especially for smaller models like 4o-mini who seems like not doing a lot of "thinking".
Agreed - it's fun to try different variations of the same prompt in multiple windows to test how much a key word in your prompt can influence the response.
great intuitive explanation
Excellent explanation. Appreciate the effort.!
Thanks, happy to hear :)
Thank god! I found your channel, the best explanation I've ever seen
Really happy you found it too! Thanks :)
Thanks! I've often struggled to explain attention to others, this is really good that I'm totally going to use in my own conversations.
@@BenLebovitz appreciate that, glad it helped!
New sub just for how good this explanation is
@@jaysonp9426 appreciate that, thank you for the kind words!
That was very helpful, thanks! I recently had to learn more about AI for my job and even though I'm generally somewhat informed, I lack a lot of depth in understanding AI at this point. This example paints a clear picture on how LLM's think, very cool!
Happy to hear it helped! I'm in a similar boat, as a data engineer now supporting an AI research team this is world of AI is new to me too.
I think the channel here is mainly me (as a engineer) sharing what I learn/ is helpful for other engineers needing to learn AI.
Thanks for the comment :)
really amazing explanation and visualization! thank you! subbed! :)
Thank you for the kind words :)
Bro this is the best explanation I've ever seen. Thank you!
Thank you for the kind feedback, glad it helped :)
Chris, this is super cool. Keep up the good work. Conceptual things have tremendous impact on new learners.
Thanks- appreciate hearing that. I'm with you- there's so much to learn but it's hard to tie the deep dives to something you can grasp.
Glad this was helpful :)
Very nice. Think this is the freaking first time I understand it.
Sometimes it just takes seeing something through a different lens, happy it helped!
That's a great perspective. Thank You.
Thanks for the kind words, happy it helped :)
@@BitsOfChris I am working on scaling Ai personal memories, so your slide scaling_factoring in my nowhere Nexus development really optimize the performance and output based on necessary settings for each self.dimensions memories embedded into subdimensions function generateLabels. Duel dialogue associated with each iteration in conversation to convert into recursive fractal dimensions into appropriate layers. later I had my first results with models. Was very good 👍
Thanks for this explanation
It help me get better understanding of what is happening inside the model than the usal exemple with the 2D vectors
Very happy to hear- exactly what I was trying to accomplish!
Clear and short explanation, nice one! I already know that if you publish video consistently this channel will be very popular easily, I’ve seen many examples like this. Keep up the great work 😉
@@WiseMan_1 I don’t know who you are, but this is such a kind thing to share. Thank you for making my day :)
I’ve been writing for as long as I can remember. 12 months ago was the first time I put my writing on social media, and it was scary.
But after a year of writing, podcasting, making videos, exploring my self to find my version of work that feels like play- I think I found it.
Thank you for the confidence to keep going :)
@@BitsOfChris I’m happy that my comment made your day :) Your journey is inspiring. Keep doing what you love! Looking forward to more great videos.
This was super intuitive!! The ability to focus on 5 dimensions and walk through a work like light that has multiple meanings that only becomes clear based on context was very helpful. Do you have a Twitter account or Substack I can follow?
Appreciate that, glad it helped :)
I'm mainly focused on TH-cam right now but I use bitsofchris.com Substack as my "hub" for all things.
This is an excellent insight in how to explain the high dimensionality of AI models
Thank you :)
Great approach, and style of teaching.
@@Normi19 appreciate that, thank you!
It’s so helpful to get feedback (and motivating).
I’m curious what about my style most appealed to you? I’m too close to know myself sometimes lol
@@BitsOfChris the choice of word “light” makes the example layered and shows the nuance in multidimensionality. The stacks of single scales makes the point very clear, especially when you walk the audience through the possible changes based on context.
@ thank you for that :)
Ahh, this is good. It's so good.
Happy to hear, thanks :)
Great explanation, just subbed 😎
@@neokirito appreciate that, thank you for your support :)
Conceptual and insightful. I wish you could prepare another video how embeddings and dimensions and parameters interact
@@emmanuelmwape4560 thanks for the feedback and suggestion. I’m working on a more in depth video that includes code examples. Hopefully this will help :)
AI: The silent force behind incredible breakthroughs 🔥
Exciting times we live in.
Great explanation.
Thank you, appreciate hearing that :)
Very insightful!
Thank you! Appreciate the support :)
Great example, thanks!
@@Abdelrhman_Rayis you’re very welcome, thanks for the comment!
Nice thanks for the video I also liked the recap part though its a short video the topic is very dense.
@@szebike appreciate the feedback. I’m working on a longer video now, but you’re right it’s a very dense topic. A lot of details were left out to get us started :)
Well done.
@@HenrikVendelbo thank you for the kind words :)
Drawing in excalidraw makes this tons more impressive lol
@@hamadaag5659 thank you ;)
I’d love to learn another tool but sticking with a simple one has been a good constraint. Makes it easier for me to get something done.
Very nice explanation! I teach AI at a university and it is frustrating to try to explain this using a 2D Cartesian space!
@@elmoreglidingclub3030 I hear you! I found that difficult too. The individual number lines taken in aggregate really helped me :)
@@BitsOfChris Exactly. It’s a great idea. And you can really convey context, even subtle contexts, by using several lines.
I’d tried using several coordinates but this is much clearer.
I feel like this chain of reasoning is missing something important. Here's what I mean. How can we infer that the different dimensions are something that are relevant to the real universe? In the example you mentioned, "light as a feather"....the presence of the word "feather" influences the embedding of the word "light" in that particular context. But if we go down this chain, how does the model know that feather is something that has certain dimensions along which it has a value? Implying relevance to the real word seems like a leap of logic to me. The words "light" and "feather" influence each others embedding in that context, only because in the data it has been trained on, these words have an impact on each other, in terms of what the next words should be. The embeddings in particular contexts make sense, because the data it is trained on, is (at least as far as humans are concerned), a valid representation of the real world. These dimensions do encode something, but those encodings are only high level abstractions of patterns in the syntax we generate using this particular language (English). I feel like implying this actually encodes relevant features of the universe is a leap in logic. Another point here is that, we have ideas we want to convey or represent, and then we formulate words to convey it. In this case, all words are given, and their interactions are studied using data. But thats not how we work. I really feel like presuming that these dimensions actually encode features of the real world, is a bit of a stretch. Thoughts?
Thank you for this incredibly thoughtful comment! You raise some great points about the relationship between language, meaning, and representation that I hadn't really considered :)
You're absolutely right that we should be cautious about claiming these models learn 'real' features of the universe rather than just patterns in language.
The 'light as a feather' example might be better framed as demonstrating how attention mechanisms can learn to represent contextual relationships in language that humans use to describe physical properties, rather than suggesting they learn the physical properties themselves.
To your initial question: I don't think we can always infer what the dimensions are doing, let alone if they are something relevant to the real universe. From what I learned so far - it's really difficult to do this reverse engineering of a language model. And to my knowledge the dimensions or features they learn are not always as cleanly interpretable as I laid out in the example.
My goal was to simplify things "enough" so folks could grasp the bigger picture of what's going on. Like most things, I think it's a trade off between accuracy in simplicity in how much detail to include.
@BitsOfChris Thanks for the reply 🙂. Just wanted to mention this point. I completely agree that this is a very good starting point to try to peek under the hood. Great video, very concisely put. Looking forward to more great content from you 🙂
@ appreciate it, thank you :)
Given that there are less easily human interpretable dimensions or properties or whatever of a word, I feel like maybe if we could figure out what those dimensions mean and what they represent then that could be a way of learning about nuances in the meanings of words that maybe haven't even been considered 🤔 Anyone know if there's some kind of thing like this where you could find unique meanings or patterns of like a word that AI tech uncovered?
I've heard about how, for example, people working with neural networks or something found that the (AIs? networks? systems? ig I'll just say) models picked up on totally unexpected patterns in their input data, with some examples like I think one model meant to find patterns between images of eyes and diseases also found patterns that'd help predict someone's gender based on the image of the eye. And I think another example involved an LLM developing a neuron for like positive or negative sentiment in parts of text, so it could be configured to create outputs of a certain sentiment and it could also be used to measure the sentiment of text.
I've considered looking into stuff like this further because it seems so cool that models can just pick up on patterns that we don't haven't even consciously noticed, and I wonder if anyone might have some ideas for how we can learn from the actual patterns that AIs have essentially discovered in different kinds of input data. Anyone know topics involving stuff like this that I could look into further?
Dude - this is getting into some really deep but fascinating territory.
It's the balance between human interpretable and the emergent properties of these models.
Some work to follow up on:
Golden Gate Claude: www.anthropic.com/news/golden-gate-claude
The field as a whole is called "mechanistic interpretability" where you are effectively reverse engineering neural networks.
Please share anything you find - sounds like you are very interested in this angle, would love to learn more too :)
@@BitsOfChris Oh cool 🤔 After my comment, I ending up coming across a new video from the Welch Labs channel about mechanistic interpretability, which kinda did seem like this idea I was thinking about, and now it sounds like that topic was indeed on the right track.
The Golden Gate Claude thinking of the bridge as its own ideal form is hilarious btw xD Looks like there is sort of a whole entire field of interpretability within the overall scope of neural networks, and some links on the article seem to point to some deeper research on the topic. I'm not too intent on exploring it atm, but it's good to know that there is this stuff out there, and I imagine I would revisit it sometime in the future.
Oh also I recall there was a 3b1b YT short about word vectors that kinda showed how models approximately represent concepts like [ woman - man ≈ aunt - uncle ] and even [ Hitler + Italy - German ≈ Mussolini ] lol. I didn't watch it but the short linked to a longer video about transformers explain visually, which might touch on some interesting sides of this topic as well if you're interested.
So does the model set these points itself when making these contextual connections with the points or is the amount of attention it gives to a lexical item preset by a human?
@@austinphillip2164 great question- the model sets the points itself through its training process.
Then as it process your input the model will adjust the weights (meaning of each word) as the input is passed through every layer of the model.
Does that help?
So each word in the model has N corresponding dimensions?
And these dimensions are not shared across words?
Each word (or more specifically token) IS SHARING the same N dimensional space in the model.
From the video, each word would be plotted somewhere as a point in the SAME 5 dimensional space, where the 5 axes are capturing the same meaning.
Does that make sense?
Great questions! I realize now maybe this wasn't as clear as it could have been.
From the video example, the word "light" is mapped to 5 other words with the weights for those words varying for different contexts. So,
1) Where are all the weights stored? In the light example, each context of light will have different weights for those words.
2) Is such a mapping as light created for all tokens? i.e. each token has N other words mapped to it with weights in those words?
@ the 5 words by each number line is the semantic meaning of that dimension. Sorry if I didn’t emphasize that enough.
Your additional questions are great- they are the basis of how attention is actually implemented. I deliberately left that out of scope for this video.
But to answer quickly, this process of updating embeddings is done for each token in the context window.
There are three additional matrices, commonly called the query, key, and value matrices.
To your questions, yes this process of looking at how much each token should care about the meaning of every other token is done for each word.
The video just emphasized this process for the word light.
I’m working on a more in depth video with code examples that shows this process for the entire context window.
I deliver omitted details in this video in order to really drive home the high level concept.
Thanks for the follow up questions!
the fun part is when you try to understand queries, keys, and values...
Honestly - that took me about a month of reading, implementing, and watching TH-cam.
Here's the video I published after this one that goes into more detail: th-cam.com/video/FepOyFtYQ6I/w-d-xo.htmlsi=Gg9iUT-teDDkr7sD
The reason why can't we understand is because our limitations to visualize. We can only visualize 3d dimensions but beyond ht 4d, 5d, 6d, ..... nd are beyond the imagination. We could only do maths of it but doing physics of it is just beyond the reach.
@@GalaxyHomeA9 right, for some reason I just assumed people could visualize that far or had some trick for understanding high dimensional spaces. This number line approach really helped me. Thanks for the comment :)
What an interesting look under the hood
@@peblopablo appreciate that, thanks :)