I am literally blown away by the quality of your explanation! I am a AI researcher myself, so I can really appreciate the beauty of explaining the technical concepts in "simple" language while not making it "simpler". 🙂
Awesome, thanks for explaining. Can’t imagine what would have happened if this technique hadn’t been created. Full training, huge models for just one concept, not being able to use multiple styles together. Saved time, saved gpu training and even saved energy. Such a big breakthrough and appreciate the explanation.
This was a great intuitive explanation of it. I wish more people took the adaptability of lora seriously, though: everyone (and their dog) upload full models after doing small fine-tunes *with* lora, instead of just the adapters. Not only would it help experimentation, but time too, as we have to download unnecessary base models over and over...
First, thank you for bringing LORA to life, and secondly, thank you for the humble explanation. I am working on a new startup that makes sense thanks to LORA, especially the hierarchical structure you just explained. Thanks again. I subscribed to the channel and am following on X.
Thank you and your team for the research. It saved our ML project in university, because fine-tuning the SAM model with it's billion parameters was just not possible on consumer GPUs....but with loRA, no problem (based on the MeLo Repo). I just have a bit of trouble understanding exactly the impact of the ranks. Example: We only want to segment one specific object when we fine tune SAM. With rank 2, there are better results than with rank 512. But why exactly? Is is because a lower rank causes the model be trained to a more specific task? (but also faster overfitting?)
Do we need the base model? Would it make sense to use a panel of experts? The final model just being the sum of many LoRA and leave them as decomposed matrices for cheaper matrix multiplies?
Thank you so much for explaining this clearly, everything I watch on TH-cam is made by ppl who have no idea how the tech works, or don’t even know how to code outside of copy/paste/change inputs, but pretend like they do. Furthermore, there’s just so many useless libraries around LLMs that ppl claim are the next big thing, but in reality, they create code bloat, introduce more unknowns, make the code harder to work with since u now gotta learn the library, and don’t work as well as if u just wrote everything urself.
Hi, just a curious thought. I've done some reading on MoE-Mamba and Vision Mamba, particularly of note is how MoE Mamba was solved to interleave MoE function layer with Mamba LLM specified expert layer. Then how vision mamba demonstrateds the spatial awareness of data correlations across gaps of contextually irrelvant data due to its nature being a selective state space model (SSM). An RNN is a type of SSM, but a better implementation of that is S6 (Mamba) which has replaced convolution with an algorithm in creating an efficent selective attention mechnism valid for LLM application. I've heard that it shows a lot of base similarity with transformer architecture. I'm wondering... what ifff. The Lora had an offest like [add row] and we filled it with noise but we prompted an embedded Mamba layer to look at [select topics] the parameters of the new data and compare with base model on the GPU mostly because Mamba does that, and mamba has to edit the noise layer within an offest, an added dimension, to become a bridge of weights. So the added layer is like a context aware translation that finds and relates sparse clusters or motifs in the paramterized dataset that have relations but only if un-convoluted slightly and thus the ofset injection layer optimized by mamba is a type of rosetta stone between knowledge a and kniwledge b. But its a layer or set of layers in the deep neural network created by a non-neural network AI entity thats actually neural assembly based-- aha!! Atleast to me-- Aha☆ The deep layers are so dense in dimensions really, that editing a single one is like managing a neuron in a feed forward neuron network. Well mamba is more like an assembly for its bi-directional contextal awareness and likened to hebbian placticity, neurons that wire together fire together. So perhaps to cause the alignment to happen: we have the S6 Layer adjust the noise in the offset injected layer in respince to the alignment or misalignment of network responses to stimulus that both models should definetly know, then progress into more topics to imprint on the translator to encorage room to grow. Then we should he able to iterate forward and skip a layer or two, then offset and inject again, so this time it has exposure from a more advanced lens, so unpacking and repacking the lora so it affects the base layer with pauses in its feed forward which re-routes to this offset injected lora layers which lets the Lora show the base model how to interface with its concepts better. The mamba should be able to perceive the clusters of real information as patterns its algorithm picks up from tensors representing tokens. And if it always only thinks in tensors, and mamba is a selective state space ssm akin to rnn and convolution, then to selectively attend to sparesly connected examples in tensors is therefore likely to be natural to Mamba. So, it should help with fine-tuning loras in this way. I hope! 🎉 Thank you dude and your team, and the open-source community for uplifting the entire collective, I hope my big thinker imaginative approach can contribute to sometbing useful in your own minds as productive and innovative as you are. From one futurist to many others-- God bless!
Hi, this a month later. I suppose I had something there, not sure. I think I misunderstood somethings before. Technically close enough. The Mamba S6 thinks with many tokens at once, and theres even a tokenless mamba that looks at machine code forms of data, yet still capable of speech. What I suggested prior was this (shorter, for tl;dr): Between each group of steps considered a cycle of inference ticks, we add atleast one layer that runs to inject mamba shaping over and space-time mapped data where selective regions and concepts can be held on memory to enhance inference quality while super efficent parallel structure puts it in GOD teir compute costs compared to ALL other models over ling contaxts. Being that Mamba has a linear compute cost, all others have not exponential, but quadratic costs. So, the Mamaba makes image generation smarer by adding little suggestion imprints like watermarks or tags which invites one of 1000's of LoRA models to help in just one area of data within any n dimension of space-time relationship data. This could be monumental actually... imagine if we compited it all at machine code level with pre-compiled real code, plus Mamba tokenless layers interleaved with any other model. A stack of layers in a cycle even. All thought for the ai could have sound and video and speech for its thoughts, to represent a total immersion for the AI, or neccisarily an immersive user experience for a 3D + time = 4D spatial mapped voxel world where voxels are tagged by mamba, and tags load and call LoRA'a into action from the SSD, onto memory. I'm talking holo-deck bro. Like, forget video, that's inherent, we can use this Mamba stuff live to organize world models with any data set which contains a world model or otherwise playing field. If we can interleave the MoE layer for Mamba MoE, then were also effectively making mamba an interleaved layer, why not just stack it anywhere? Stack it with transformers, stack it without, use Mamba to learn how to tag in a database for a voxel game engine, make tags call LoRA's that affect frame generation when that tag is present and mix with another lora according to tags in that pixel's overlap / transparency showing ontop. The lora only affects that pixel 'tag-hue' group. So, real-time 360 image generation where every hue group is prompted a certain procedural way with LoRA'a to adapt and lower inference cost raisning quality, then it's otherwise like real-time inference for painting image-to-image ai with frame upscaling. Then, you can probably call on MP4 codec technologies to interpolate frames with commonly available video encoding hardware. And, you can probably speed things up with FSR by AMD, again, for commonly available 'highly parallel' hardware (gpu's or igpu for the power savers). The whole 360% view responds to a VR headset location in a videogame made with a voxel engine. The native graphics *could* exist still, but only as far as the 'tag-hue' generates (potatoe mode vfx). Each frame is a 360° potential, so extra frames should be generated for smoothing when fast head movents occur. It only has actually work to generate the user’s vision, maybe just outside there. Being spatially similar, the AI approach will likely let us borrow from the current frame a ton to persitantly maintain awareness of frames most likely to be generated based on trajectories and event simulation cues, and if a slight deviation occures its only a slight adaptation of the progress that was pre-computed-- since its not technically just frame generation, but 4D voxels informing an ai to understand the area based on tags, if the tags move, its a simple translation relative to others in a statistically simulated senario, not simply video generared in one batch or even ine step (idk how sora does it, something similar). The holo deck, it's coming.
Trying to teach a mistral7b model sanskrit. It already has Sanskrit characters as tokens and is the best performing 7b llama based model I can find. You seem like a knowledgable person in this area. Do you have any advice for lora? Rank, alpha? How about targeting of q,k,v? Other strategies? I have about 3gb of datasets that range from translations, corpus, to data tables. I wonder if I should use different strategies for different data types?
I have a simple question: is it feasible or beneficial to use low-rank approximations on the various Q, K, and V matrices? I'm guess not since no one appears to be doing it.
It is feasible, and we tried it. It wasn't beneficial in our case because of rapid model degradation as the rank decreases and the decreased parallelism.
@@edwardjhuThanks for the reply. For the reduced parallelism (I assume having to do 2 GEMV's in serial A*B*v), you could always store the reduced weights on disk, and then compute the simulated full rank matrix before inferencing. Love the video, BTW; it adds context to the paper.
@@scottthornton4220 if we are talking about compressing the base model, the goal would be to reduce inference, not storage, cost because there's only one copy to store.
It`s posible to merge LoRA with MOE?(A MOE of LoRAs). I think you could have a bunch of experts in LoRAs and switch between them. It would require less memory and will have a faster inference
Can the same be done for teaching robots to learn specific tasks. For example if a robot learns how to pick up a ball, pick up an apple might require fine-tuning the model for further use?
I am still a bit concerned about the continual learning effect LoRA may cause in terms of catastrophic forgetting, as you are playing around with the internal feature representations of the model by adding new info, it might happen that all info added at each layer flows in the form of noise that accumulates changing the info in every layer until it reaches the output, did you tested when developing the technique such effects?, how far were the output vectors of the adapted model from the original? (maybe fix it with knowledge distillation could be an expensive option) it is very difficult to find a good paper on this.
Surely the point of training is to turn the noise into a function that represents the domain of that data, also you only train 1 layer of the feed forward network and/or the self attention mechanism. But I think the reason he mentions that you can remove new lora additions to get the base model back.
I dont want to study anymore in my college i want to do research in this field im addicted to it.But getting out of college is really tough so Edward if you have any choices please help me i want to just immerse in this field of AI and will learn anything faster so please consider me.My college really worst college which takes my time and wont explain anything by them i can learn on my own i want to work with tech memberes like you.So please help me man i request you .I don't want any money i just want to learn and work with this AI.I'm regretting every day about this problem.Time is more valuable than any other in this world so please make use of it guys.
@@edwardjhu Appreciate your patience in replying to all these questions. If possible make a roadmap to reach your level. Assume someone is out of highschool and he wants to know everything you know. Make an hour or more long roadmap. It'll help millions of us to become advanced. And actually understand what your videos. There's another guy called Umar Jamil, he's also goat like you. Builts stuff from scratch, has in depth knowledge. I was asking him. Now I'm asking you. Oh you've started decade ago. You must make 10hour+ video on your roadmap lol
Linear Algebra -- basis vectors, linear transformations, eigendecomposition, row space, column space, singular value decomposition (applicable to this video).
It’s not often that I find the inventor of a technique explaining the technique. This is incredibly helpful. Thank you
I am literally blown away by the quality of your explanation!
I am a AI researcher myself, so I can really appreciate the beauty of explaining the technical concepts in "simple" language while not making it "simpler". 🙂
Awesome, thanks for explaining. Can’t imagine what would have happened if this technique hadn’t been created. Full training, huge models for just one concept, not being able to use multiple styles together. Saved time, saved gpu training and even saved energy. Such a big breakthrough and appreciate the explanation.
Your videos are really of the highest quality Edward! Thanks for posting these quick overviews
LoRA is such an unlock for resource-constrained creators looking to leverage models for specific domains. Thank you for this amazing work!
This was a great intuitive explanation of it. I wish more people took the adaptability of lora seriously, though: everyone (and their dog) upload full models after doing small fine-tunes *with* lora, instead of just the adapters. Not only would it help experimentation, but time too, as we have to download unnecessary base models over and over...
First, thank you for bringing LORA to life, and secondly, thank you for the humble explanation. I am working on a new startup that makes sense thanks to LORA, especially the hierarchical structure you just explained. Thanks again. I subscribed to the channel and am following on X.
This was very helpful! Thank you, Edward!
Thank you and your team for the research. It saved our ML project in university, because fine-tuning the SAM model with it's billion parameters was just not possible on consumer GPUs....but with loRA, no problem (based on the MeLo Repo).
I just have a bit of trouble understanding exactly the impact of the ranks. Example: We only want to segment one specific object when we fine tune SAM. With rank 2, there are better results than with rank 512. But why exactly? Is is because a lower rank causes the model be trained to a more specific task? (but also faster overfitting?)
interesting question😮
Really helpful and brief explanation. ty.
Thank you for all your work!
Amazing explanation! Though expected when coming from the founder of course
Thank you for expanding on you Paper! Would love to see your thoughts on QLoRA as well!
Do we need the base model? Would it make sense to use a panel of experts? The final model just being the sum of many LoRA and leave them as decomposed matrices for cheaper matrix multiplies?
Awesome explanation and kudos for a great contribution to DL, please make a followup video on QLoRA
3:26 That is the best explanation!!
Thank you so much for explaining this clearly, everything I watch on TH-cam is made by ppl who have no idea how the tech works, or don’t even know how to code outside of copy/paste/change inputs, but pretend like they do.
Furthermore, there’s just so many useless libraries around LLMs that ppl claim are the next big thing, but in reality, they create code bloat, introduce more unknowns, make the code harder to work with since u now gotta learn the library, and don’t work as well as if u just wrote everything urself.
These things existed for a lot time in vision research. Like only finetuning classifiers of large models on new tasks
This is amazing and very valuable. Thank you!!!
Excellent talk. Thank you.
That is such a good explanation, thanks!
This is excellent. Thanks!
Edward, can you tell me which side of AI engineering is better, fine tuning a model (fine tuning techniques)l or creating a model with large code?
Hi, just a curious thought. I've done some reading on MoE-Mamba and Vision Mamba, particularly of note is how MoE Mamba was solved to interleave MoE function layer with Mamba LLM specified expert layer. Then how vision mamba demonstrateds the spatial awareness of data correlations across gaps of contextually irrelvant data due to its nature being a selective state space model (SSM). An RNN is a type of SSM, but a better implementation of that is S6 (Mamba) which has replaced convolution with an algorithm in creating an efficent selective attention mechnism valid for LLM application. I've heard that it shows a lot of base similarity with transformer architecture.
I'm wondering... what ifff. The Lora had an offest like [add row] and we filled it with noise but we prompted an embedded Mamba layer to look at [select topics] the parameters of the new data and compare with base model on the GPU mostly because Mamba does that, and mamba has to edit the noise layer within an offest, an added dimension, to become a bridge of weights. So the added layer is like a context aware translation that finds and relates sparse clusters or motifs in the paramterized dataset that have relations but only if un-convoluted slightly and thus the ofset injection layer optimized by mamba is a type of rosetta stone between knowledge a and kniwledge b. But its a layer or set of layers in the deep neural network created by a non-neural network AI entity thats actually neural assembly based-- aha!! Atleast to me-- Aha☆
The deep layers are so dense in dimensions really, that editing a single one is like managing a neuron in a feed forward neuron network. Well mamba is more like an assembly for its bi-directional contextal awareness and likened to hebbian placticity, neurons that wire together fire together.
So perhaps to cause the alignment to happen: we have the S6 Layer adjust the noise in the offset injected layer in respince to the alignment or misalignment of network responses to stimulus that both models should definetly know, then progress into more topics to imprint on the translator to encorage room to grow.
Then we should he able to iterate forward and skip a layer or two, then offset and inject again, so this time it has exposure from a more advanced lens, so unpacking and repacking the lora so it affects the base layer with pauses in its feed forward which re-routes to this offset injected lora layers which lets the Lora show the base model how to interface with its concepts better.
The mamba should be able to perceive the clusters of real information as patterns its algorithm picks up from tensors representing tokens. And if it always only thinks in tensors, and mamba is a selective state space ssm akin to rnn and convolution, then to selectively attend to sparesly connected examples in tensors is therefore likely to be natural to Mamba. So, it should help with fine-tuning loras in this way. I hope! 🎉
Thank you dude and your team, and the open-source community for uplifting the entire collective, I hope my big thinker imaginative approach can contribute to sometbing useful in your own minds as productive and innovative as you are. From one futurist to many others-- God bless!
Hi, this a month later. I suppose I had something there, not sure. I think I misunderstood somethings before. Technically close enough.
The Mamba S6 thinks with many tokens at once, and theres even a tokenless mamba that looks at machine code forms of data, yet still capable of speech.
What I suggested prior was this (shorter, for tl;dr):
Between each group of steps considered a cycle of inference ticks, we add atleast one layer that runs to inject mamba shaping over and space-time mapped data where selective regions and concepts can be held on memory to enhance inference quality while super efficent parallel structure puts it in GOD teir compute costs compared to ALL other models over ling contaxts. Being that Mamba has a linear compute cost, all others have not exponential, but quadratic costs.
So, the Mamaba makes image generation smarer by adding little suggestion imprints like watermarks or tags which invites one of 1000's of LoRA models to help in just one area of data within any n dimension of space-time relationship data.
This could be monumental actually... imagine if we compited it all at machine code level with pre-compiled real code, plus Mamba tokenless layers interleaved with any other model. A stack of layers in a cycle even.
All thought for the ai could have sound and video and speech for its thoughts, to represent a total immersion for the AI, or neccisarily an immersive user experience for a 3D + time = 4D spatial mapped voxel world where voxels are tagged by mamba, and tags load and call LoRA'a into action from the SSD, onto memory.
I'm talking holo-deck bro. Like, forget video, that's inherent, we can use this Mamba stuff live to organize world models with any data set which contains a world model or otherwise playing field.
If we can interleave the MoE layer for Mamba MoE, then were also effectively making mamba an interleaved layer, why not just stack it anywhere? Stack it with transformers, stack it without, use Mamba to learn how to tag in a database for a voxel game engine, make tags call LoRA's that affect frame generation when that tag is present and mix with another lora according to tags in that pixel's overlap / transparency showing ontop. The lora only affects that pixel 'tag-hue' group.
So, real-time 360 image generation where every hue group is prompted a certain procedural way with LoRA'a to adapt and lower inference cost raisning quality, then it's otherwise like real-time inference for painting image-to-image ai with frame upscaling. Then, you can probably call on MP4 codec technologies to interpolate frames with commonly available video encoding hardware. And, you can probably speed things up with FSR by AMD, again, for commonly available 'highly parallel' hardware (gpu's or igpu for the power savers).
The whole 360% view responds to a VR headset location in a videogame made with a voxel engine. The native graphics *could* exist still, but only as far as the 'tag-hue' generates (potatoe mode vfx). Each frame is a 360° potential, so extra frames should be generated for smoothing when fast head movents occur. It only has actually work to generate the user’s vision, maybe just outside there. Being spatially similar, the AI approach will likely let us borrow from the current frame a ton to persitantly maintain awareness of frames most likely to be generated based on trajectories and event simulation cues, and if a slight deviation occures its only a slight adaptation of the progress that was pre-computed-- since its not technically just frame generation, but 4D voxels informing an ai to understand the area based on tags, if the tags move, its a simple translation relative to others in a statistically simulated senario, not simply video generared in one batch or even ine step (idk how sora does it, something similar).
The holo deck, it's coming.
Thanks for the explanation Edward, very informative!
Great video! So, your last point is we can fine tune a previously finetuned model with LoRA? What about catastrophic forgetting? Isn't that an issue?
Another question would be can we finetune (with LoRA) a model that has already been fully finetuned (withoud LoRA, for example Llama2-chat)?
You're a king. Keep up these videos 👍
Trying to teach a mistral7b model sanskrit. It already has Sanskrit characters as tokens and is the best performing 7b llama based model I can find.
You seem like a knowledgable person in this area. Do you have any advice for lora? Rank, alpha? How about targeting of q,k,v? Other strategies?
I have about 3gb of datasets that range from translations, corpus, to data tables. I wonder if I should use different strategies for different data types?
Thanks Edwar
能不能继续在微调后的LoRA模型上面微调,效果怎么样
I have a simple question: is it feasible or beneficial to use low-rank approximations on the various Q, K, and V matrices? I'm guess not since no one appears to be doing it.
It is feasible, and we tried it. It wasn't beneficial in our case because of rapid model degradation as the rank decreases and the decreased parallelism.
@@edwardjhuThanks for the reply. For the reduced parallelism (I assume having to do 2 GEMV's in serial A*B*v), you could always store the reduced weights on disk, and then compute the simulated full rank matrix before inferencing. Love the video, BTW; it adds context to the paper.
@@scottthornton4220 if we are talking about compressing the base model, the goal would be to reduce inference, not storage, cost because there's only one copy to store.
It`s posible to merge LoRA with MOE?(A MOE of LoRAs).
I think you could have a bunch of experts in LoRAs and switch between them. It would require less memory and will have a faster inference
Wow, that's an outstanding idea! XD
Can the same be done for teaching robots to learn specific tasks. For example if a robot learns how to pick up a ball, pick up an apple might require fine-tuning the model for further use?
In principle, yes! There are foundation models trained on robotics data.
Can u provide any foundational model ?
Thanks!
Thank you so much for your very clear and to the point presentation🎉❤ (And of course for all the hard work to develop this technique) 🙏
Great Job!!!
awesome video. Highly appreciated. Just a side note, maybe slightly better microphone might make things sound a little better :)
Noted! I have a RODE VideoMic but can definitely clean better
a+ Edward thank you for the good work and content
Respect
Very cool. I know some of these words.
I am still a bit concerned about the continual learning effect LoRA may cause in terms of catastrophic forgetting, as you are playing around with the internal feature representations of the model by adding new info, it might happen that all info added at each layer flows in the form of noise that accumulates changing the info in every layer until it reaches the output, did you tested when developing the technique such effects?, how far were the output vectors of the adapted model from the original? (maybe fix it with knowledge distillation could be an expensive option) it is very difficult to find a good paper on this.
Surely the point of training is to turn the noise into a function that represents the domain of that data, also you only train 1 layer of the feed forward network and/or the self attention mechanism.
But I think the reason he mentions that you can remove new lora additions to get the base model back.
thanks
Application of Quasi-orthogonal dimensions to Large Models is something that came at the same time as LORA but did not gain popularity unfortunately.
I dont want to study anymore in my college i want to do research in this field im addicted to it.But getting out of college is really tough so Edward if you have any choices please help me i want to just immerse in this field of AI and will learn anything faster so please consider me.My college really worst college which takes my time and wont explain anything by them i can learn on my own i want to work with tech memberes like you.So please help me man i request you .I don't want any money i just want to learn and work with this AI.I'm regretting every day about this problem.Time is more valuable than any other in this world so please make use of it guys.
There is a lot of great content online for self-studying AI. I started with Andrew Ng's deep learning course on Coursera almost a decade ago!
@@edwardjhu Appreciate your patience in replying to all these questions.
If possible make a roadmap to reach your level. Assume someone is out of highschool and he wants to know everything you know. Make an hour or more long roadmap. It'll help millions of us to become advanced. And actually understand what your videos.
There's another guy called Umar Jamil, he's also goat like you. Builts stuff from scratch, has in depth knowledge. I was asking him. Now I'm asking you.
Oh you've started decade ago. You must make 10hour+ video on your roadmap lol
Linear Algebra -- basis vectors, linear transformations, eigendecomposition, row space, column space, singular value decomposition (applicable to this video).
第1000个订阅 The 1000th subscription