MoE is explained wrong. People will think an Expert is Model you can replace the experts with different models, but at least in Mistral and pretty sure in GPT4 too, the experts are interwoven into 1 big model and you can't replace experts.
That is true for now, but I imagine with both Mistral and OpenAi proving that this is a winning concept, a modular setup will probably become possible.
While it's true that "experts" is misleading and these are just large sparsely-activated models, you can absolutely just stitch preexisting dense models together into a patchwork MoE. The MLPs are added as separate experts, the other parameters are merged, and then the whole setup is frozen except for the gating networks which are updated according to a quick finetuning run on a diverse but lightweight corpus of tasks so that the gating networks make effective use of the complementary capabilities of the source dense model MLPs.
@wtf345546 yes I got a chance to look at the code this morning and agree the diagram I used is not ideal. It is one decoder with the experts in the FF layers. I will release a new vid in a bit. Interestingly it does seem the may have started with the old model weights x.com/tianle_cai/status/1734188749117153684?s=20
What would be interesting is if there were an internal “conversation broker” that got the experts talking to one another and arriving at a consensus answer that they passed back to the prompter.
Something important missing from this video: There is absolutely no guarantee that each of the 8 individual parts will actually specialize in anything that we think of (like your example, function calling or so). Having said that, mixtral works really well.
AI tech really moving so fast. I just got involved deeper with AI this year October, never thought it could be like this. It is excited and scary at the same time.
We might have to end up converting local computer cafes into AI cafes. So much requirements but I'm glad that it's all developing at a rapid rate. At this point JavaScript frameworks need to pick up the pace.
There was a paper about using tiny neural nets for each node of a neural net instead of the usual weight and bias thing. I feel like composable modularity is definitely the way forward
unfortunately this isn't like we can just put on a new expert and take it away. While swapping out the weights is possible, it's not very practical as you need to train the gating layers as well. LoRAs are probably better for what you mention.
Perhaps the overall trend in LLM architecture development is building in the different prompt engineering strategies into the underlying architecture. Someone else here commented that a native implementation of CoT is the next low hanging fruit
It's interesting it's working like our brain, different parts of the brain specialised for different stimulations, like vision, speech, logical thinking, language, arts...
Where would this "Mixture of models" be better AutoGen? Looks like a similar approach, besides the gating layer? With AutoGen you also dedicate experts (agents). Would appreciate your view on that!
If I understand correctly, it's 8 of the 7 billion parameter models of experts in the specific task, but it's all loaded up as one model. I'm wondering, wouldn't it be possible to have multiple expert models in many areas and have them load in and out of memory on the fly? With fast SSD drives today, loading in a 7 billion model would be really fast and maybe seamless to not notice. If that could work, it would have the major advantage of being able to have far bigger A.I. models that can run well on modest computers, after all, hard drive space is dirt cheap. Maybe I'm seeing too much into this lol, but if that is possible, that would be a massive leap forward and big advantage for running A.I. models locally, after all, imagine having 300GB worth of 7 billion parameter models, covering a lot of expert fields and then the main conversational model delegates to one of them. I think if this is possible, you would probably still want enough memory to hold 2 or 3 models in at a time.
I think it'd be really interesting to take a model like this, but also train an output layer to feed the generated response back into the start of the model for problems that require multi-step reasoning (ie: coding), which might require processing from multiple models. I also think that especially if the experts are derivatives of the same models especially that there's probably a lot of low hanging fruit to run this model effectively on lower end hardware, such as just loading the main Mistral model and then applying a difference to it to get each of the experts when they're needed.
Not to belabor the point, but self - attention mechanism can be viewed as a gating network that modulates and commutates activity of following layers. So, IMO, mixture of 2 x 7B experts will always perform worse than a single 14B LLM.
The problem is that the fundamental transformer architecture has a lot of scaling laws where you pay more and more for each additional unit of performance. Let’s say it performs at 95% of the 14GB but at 80% of the energy requirements
"You're going to need, probably at least 2 80 GB A100S". 4 days laters, it run on my home PC with 64GB DDR and a RTX 4070 ( 5-bit quantized and slow as f*** but it run ). Damn i love open source.
Call me crazy but I'd split a billion P between the gating layers add an IPadapter and ClipVision over a few for funsies, have an internal agent that carries the prompt between experts if they need to collaborate. I'd have each of the experts be max 500k Parameters. The fine tuning would be weird because I imagine you'd need chat training for the gating layers, and maybe instruct for the experts. On second thought, I'd have one of the experts be an archivist, and their job is to manage the memory vector database.
maybe would be interesting to get each individual expert as a independent model so that if you know which task you want to do you can load only the useful expert or subset of experts
Missed a key point: an MoE model should theoretically be faster than a large model of a similar total size. In Mixtral's case, only 2 of the 8 experts need to be executed per token, so 75% of the weights are ignored. Technically you're still stepping through 3 pipelines (the gate followed by the two chosen), but that's still less than half the work while still having access to a large dataset. Sharding models like this may be the key to other performance wins, once we get more experimentation under our belts. It's effectively branching. Right now that branching is to distribute the job of remembering, each branch is also an opportunity to change what's happening. Experts don't need to be balanced, they don't need an equal number of layers either, and they don't even need to be transformers. Just as we talk about augmenting humans with specialist technologies, so can LLMs be augmented.
As far as I understand the MoE setup, it’s an optimization, rather than necessarily maximizing performance. Faster, smaller memory. Also, modules allow some aiops benefits: you can create separate development teams improving each expert separately. Easier to manage and test, optimize datasets for fine-tuning and LORA etc
@MikeKasprzak yes I originally left this out as I wasn't sure how much of the model was being used per forward pass etc. I am releasing a new vid covering this as they have a blog out now. Thanks for chiming in.
isn't this fundamentally similar to multi agent orchestration but without execution and RL based awards? of course the architecture is different but seems pretty similar as far as goals.
not really as it is not experts in the same was as that. Also usually the agents in a multi agent LLM app are just the same model with a different prompt.
please let me know how to create a fixed forms with the below structures with special command to LLM: Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score. General Description: Topic Development: Language Use: Delivery: Overall Score: Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown. 'Sentence 1: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' 'Sentence 2: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' .......
It's great science and all, but I feel this approach is not very efficient, by design. Long term we probably should instead focus on designing/training "moe" models that have one relatively large general monolith model, with multiple small "expert lora layers" on top of it, rather than having multiple large expert models with a gating layer curating all those.
Ahh the idea of fast swapping and combining LoRAs is a really interesting one. Been a few good papers lately on that and I think PeFT is incorporating some of those techniques.
@@samwitteveenai Yes, something towards that directions. It might take us some time to work it out, to find to most efficient and most effective way to do it, but I'm sure in a year or two we'll have new MOE architecture, of some very efficient... ...dynamic "self-activating" lora layers of variable quantity, of variable sizes, "organically" built (and/or grown) at a training/tuning stages, with variable activation strength, which results in variable influence on the final output. Where which and how many layers are activated, and how strongly each is activated, would dynamically depend on input/context. smth smth.. I'm a bit in fantasy world, but since so many of my fantasies came true in the last 2 years, I now allow my dreams to be even bolder.
Mixture of experts could lead to a good content and behavior filter level. Specifically on capabilities - allowing the “expertness” of a model to be shut off in a given area. Like shutting off the coding experts - so it gives bad code instead of great code. Or disabling the joking, trolling, experts and only let the formal language expert speak… all the rudeness simply turned off, the ability to do bad things or hack, each model fine tuned to one aspect and then turned on and off at will to ensure only the paying customers can make it do wonders.
I think you'll find that all GPT-4 experts are trained to do function calling. Answers do not degrade in function calling nor impose delay, depending on the subject. It's probably a bad example because function calling can be done entirely automatic/systematic where behaviour can be verified with synthetic data using very well defined evaluations.
This looks good. Someday I will be able to return to studying this, but for now, I'll just watch you guys have fun.
MoE is explained wrong. People will think an Expert is Model you can replace the experts with different models, but at least in Mistral and pretty sure in GPT4 too, the experts are interwoven into 1 big model and you can't replace experts.
That is true for now, but I imagine with both Mistral and OpenAi proving that this is a winning concept, a modular setup will probably become possible.
Do you have any info to back this up? Thanks
While it's true that "experts" is misleading and these are just large sparsely-activated models, you can absolutely just stitch preexisting dense models together into a patchwork MoE. The MLPs are added as separate experts, the other parameters are merged, and then the whole setup is frozen except for the gating networks which are updated according to a quick finetuning run on a diverse but lightweight corpus of tasks so that the gating networks make effective use of the complementary capabilities of the source dense model MLPs.
@@Fritz0id how can you make them modular if they all need to be trained together?
@wtf345546 yes I got a chance to look at the code this morning and agree the diagram I used is not ideal. It is one decoder with the experts in the FF layers. I will release a new vid in a bit. Interestingly it does seem the may have started with the old model weights x.com/tianle_cai/status/1734188749117153684?s=20
What would be interesting is if there were an internal “conversation broker” that got the experts talking to one another and arriving at a consensus answer that they passed back to the prompter.
would also take more processing power and time tough. it still takes a long time for the models to write text. 4 times longer would be too slow
@@willi1978.not necessarily. ensemble learning or synthesizing data is actually the future
Something important missing from this video: There is absolutely no guarantee that each of the 8 individual parts will actually specialize in anything that we think of (like your example, function calling or so). Having said that, mixtral works really well.
Was hoping for an explanation of what MOE was Sam. Really clear thanks!
AI tech really moving so fast. I just got involved deeper with AI this year October, never thought it could be like this. It is excited and scary at the same time.
nice pfp also exactly
We might have to end up converting local computer cafes into AI cafes. So much requirements but I'm glad that it's all developing at a rapid rate. At this point JavaScript frameworks need to pick up the pace.
There was a paper about using tiny neural nets for each node of a neural net instead of the usual weight and bias thing. I feel like composable modularity is definitely the way forward
unfortunately this isn't like we can just put on a new expert and take it away. While swapping out the weights is possible, it's not very practical as you need to train the gating layers as well. LoRAs are probably better for what you mention.
Great video. Would be great to have a video that explain what an expert is?
Perhaps the overall trend in LLM architecture development is building in the different prompt engineering strategies into the underlying architecture. Someone else here commented that a native implementation of CoT is the next low hanging fruit
Very interested to see a vid on distilling... Looking into myself right now.
Hey Sam another great video. If you don't mind, can I know the specs of your Mac. The LLM was running so smoothly.
This was done on a Mac Mini Pro with 32gb of ram
It's interesting it's working like our brain, different parts of the brain specialised for different stimulations, like vision, speech, logical thinking, language, arts...
Great video mate. Thanks.
Where would this "Mixture of models" be better AutoGen? Looks like a similar approach, besides the gating layer? With AutoGen you also dedicate experts (agents). Would appreciate your view on that!
If I understand correctly, it's 8 of the 7 billion parameter models of experts in the specific task, but it's all loaded up as one model.
I'm wondering, wouldn't it be possible to have multiple expert models in many areas and have them load in and out of memory on the fly? With fast SSD drives today, loading in a 7 billion model would be really fast and maybe seamless to not notice.
If that could work, it would have the major advantage of being able to have far bigger A.I. models that can run well on modest computers, after all, hard drive space is dirt cheap.
Maybe I'm seeing too much into this lol, but if that is possible, that would be a massive leap forward and big advantage for running A.I. models locally, after all, imagine having 300GB worth of 7 billion parameter models, covering a lot of expert fields and then the main conversational model delegates to one of them.
I think if this is possible, you would probably still want enough memory to hold 2 or 3 models in at a time.
Nice insights , it should be possible soon.
Great explanation, ty :)
I think it'd be really interesting to take a model like this, but also train an output layer to feed the generated response back into the start of the model for problems that require multi-step reasoning (ie: coding), which might require processing from multiple models.
I also think that especially if the experts are derivatives of the same models especially that there's probably a lot of low hanging fruit to run this model effectively on lower end hardware, such as just loading the main Mistral model and then applying a difference to it to get each of the experts when they're needed.
Lots of interesting ideas to try out!
Not to belabor the point, but self - attention mechanism can be viewed as a gating network that modulates and commutates activity of following layers. So, IMO, mixture of 2 x 7B experts will always perform worse than a single 14B LLM.
The problem is that the fundamental transformer architecture has a lot of scaling laws where you pay more and more for each additional unit of performance. Let’s say it performs at 95% of the 14GB but at 80% of the energy requirements
"You're going to need, probably at least 2 80 GB A100S". 4 days laters, it run on my home PC with 64GB DDR and a RTX 4070 ( 5-bit quantized and slow as f*** but it run ).
Damn i love open source.
Quantized versions at 6 bits, + maybe some loras, would take 5.9MiB per expert. So I expect to be able to run this within 48GiB.
Call me crazy but I'd split a billion P between the gating layers add an IPadapter and ClipVision over a few for funsies, have an internal agent that carries the prompt between experts if they need to collaborate. I'd have each of the experts be max 500k Parameters. The fine tuning would be weird because I imagine you'd need chat training for the gating layers, and maybe instruct for the experts. On second thought, I'd have one of the experts be an archivist, and their job is to manage the memory vector database.
maybe would be interesting to get each individual expert as a independent model so that if you know which task you want to do you can load only the useful expert or subset of experts
Missed a key point: an MoE model should theoretically be faster than a large model of a similar total size. In Mixtral's case, only 2 of the 8 experts need to be executed per token, so 75% of the weights are ignored. Technically you're still stepping through 3 pipelines (the gate followed by the two chosen), but that's still less than half the work while still having access to a large dataset.
Sharding models like this may be the key to other performance wins, once we get more experimentation under our belts. It's effectively branching. Right now that branching is to distribute the job of remembering, each branch is also an opportunity to change what's happening. Experts don't need to be balanced, they don't need an equal number of layers either, and they don't even need to be transformers.
Just as we talk about augmenting humans with specialist technologies, so can LLMs be augmented.
As far as I understand the MoE setup, it’s an optimization, rather than necessarily maximizing performance. Faster, smaller memory. Also, modules allow some aiops benefits: you can create separate development teams improving each expert separately. Easier to manage and test, optimize datasets for fine-tuning and LORA etc
@MikeKasprzak yes I originally left this out as I wasn't sure how much of the model was being used per forward pass etc. I am releasing a new vid covering this as they have a blog out now. Thanks for chiming in.
isn't this fundamentally similar to multi agent orchestration but without execution and RL based awards? of course the architecture is different but seems pretty similar as far as goals.
not really as it is not experts in the same was as that. Also usually the agents in a multi agent LLM app are just the same model with a different prompt.
Nice video
The conversation broker is everything....
please let me know how to create a fixed forms with the below structures with special command to LLM:
Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score.
General Description:
Topic Development:
Language Use:
Delivery:
Overall Score:
Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown.
'Sentence 1:
Errors:
Grammar:
Vocabulary:
Recommend effective academic vocabulary and grammar:'
'Sentence 2:
Errors:
Grammar:
Vocabulary:
Recommend effective academic vocabulary and grammar:'
.......
It's great science and all, but I feel this approach is not very efficient, by design. Long term we probably should instead focus on designing/training "moe" models that have one relatively large general monolith model, with multiple small "expert lora layers" on top of it, rather than having multiple large expert models with a gating layer curating all those.
Ahh the idea of fast swapping and combining LoRAs is a really interesting one. Been a few good papers lately on that and I think PeFT is incorporating some of those techniques.
@@samwitteveenai Yes, something towards that directions. It might take us some time to work it out, to find to most efficient and most effective way to do it, but I'm sure in a year or two we'll have new MOE architecture, of some very efficient...
...dynamic "self-activating" lora layers of variable quantity, of variable sizes, "organically" built (and/or grown) at a training/tuning stages, with variable activation strength, which results in variable influence on the final output. Where which and how many layers are activated, and how strongly each is activated, would dynamically depend on input/context. smth smth..
I'm a bit in fantasy world, but since so many of my fantasies came true in the last 2 years, I now allow my dreams to be even bolder.
OpenMoE - what a name...
I know the guy behind this project and it really is "open" he has shared a lot of very interesting insights
@@samwitteveenai- gotcha, I wasn't suggesting otherwise. It just sounds funny. Are they gonna have an Opencurly next?
wat? your estimate is way wrong. you can run this off one 3090 and 64GB RAM very easily Lol.
Mixture of experts could lead to a good content and behavior filter level.
Specifically on capabilities - allowing the “expertness” of a model to be shut off in a given area.
Like shutting off the coding experts - so it gives bad code instead of great code. Or disabling the joking, trolling, experts and only let the formal language expert speak… all the rudeness simply turned off, the ability to do bad things or hack, each model fine tuned to one aspect and then turned on and off at will to ensure only the paying customers can make it do wonders.
I think you'll find that all GPT-4 experts are trained to do function calling. Answers do not degrade in function calling nor impose delay, depending on the subject. It's probably a bad example because function calling can be done entirely automatic/systematic where behaviour can be verified with synthetic data using very well defined evaluations.
good point!