To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/bycloud/ . You’ll also get 20% off an annual premium subscription! Like this comment if you wanna see more MoE related content, I have quite a good list for a video;)
You should do a video on virtual humans and cognitive AI.Virtual humans. Look at all the nonplayer character technology.We have a red dead redemption and the Sims Though a chat bottom one of those and we have a great virtual human
Enjoy: In a kitchen lit by screens, Where code and cuisine intertwine, A programmer dreams of breakfast scenes, With a syntax so divine. Int main() begins the day, With ingredients lined up neat. Eggs and spices on display, Ready for a code-gourmet feat. int eggs = 2; // Declare the count, Double click on the pan. Heat it up, and don’t discount, Precision’s the plan. std::cout
It's crazy how Meta's 8B parameter Llama 3 model has nearly the same performance as the original GPT-4 with 1.8T parameters. That's a 225x reduction in compute in just 2 years.
Yeah, kind of like how spiking networks work, but more discrete/blocky and less efficient. I think this concept should be applied to fundamental MLP, so you can increase model performance with out decreasing speed or RAM usage. The only sacrifice being storage which is easily scalable. IMO this is the future
I think this is how almost any informational system works. From molecules to galaxies, there are specialized units that use and process information individually in the system. An agentic expert approach was long forthcoming and is certainly the future of AI. Even individual ants have specialized jobs in the colony.
In a very real sense, the MoME concept is similar to diffusion networks. On their own, the tiny expert units are but grains of noise in an ocean of noise..... and the routing itself is the thing being trained. Whether or not it's more efficient than having a monolithic neural net with simpler computation units (neurons)........ I dunno. I suspect like most things ML, there is probably a point of diminishing return.
Idk if this was intended just as entertainment, but I used it as education Like I needed to understand MoE/MMoE on a high level for my research and this video totally helped me. It will be easier to dive deeper into one of the papers now
@@bycloudAI Personally I watch your content because you elaborate on academic papers and their relevancy very well. Do hope you continue with content like this. But I can see something like a fireship style code report for LLMs being duigestable.
@@bycloudAI I liked the video, but to their point, it might help to give a brief overview of what things are. I.e. parameters, feed forward, etc., the exact same way you briefly explained what hybridity and redundancy are. This is a good video if you're already familiar with LLMs and how they work but can probably be pretty confusing if you aren't.
How far are we from just having a virtual block of solid computronium with inference result simply being the exit points of virtual Lichtenberg figures forming thru it, with most of the volume of the block remaining dark?
00:01 Exploring the concept of fine-grained MoE in AI expertise. 01:35 Mixr has unique Fe-Forward Network blocks in its architecture. 03:11 Sparse MoE method has historical roots and popularity due to successful models. 04:46 Introducing Fine-Grained MoE method for AI model training 06:16 Increasing experts can enhance accuracy and knowledge acquisition 07:52 Efficient expert retrieval mechanism using pure layer technique 09:29 Large number of experts enables lifelong learning and addresses catastrophic forgetting 11:01 Brilliant offers interactive lessons for personal and professional growth Crafted by Merlin AI.
Today I saw a video about the paper "Exponentially faster Language Modelling" and I feel like the approach is just better than MoE and I wonder why not more work has been done on top of it.. (although I think it's possible thats how GPT-4o mini was made, but who knows)
I would love a model with the performance of a 8b model with practical performance like gpt-3.5, but with much smaller active parameters so it can run on anything super lightweight.
@@4.0.4 yeah, but metrics are not everything, and from my experience, gpt 3.5 still beats out llama 3 8b (or at least 8b quantized) in terms of interpolation/generalization/flexibility, meaning while it can mess up in difficult, specific or confusing tasks, it doesn't get overly lost/confused. metrics are good at simple well defined one-shot questions, which I'd agree it is better at
I have no idea what you just said but I'm glad they didn't just stubbornly stick to increasing training data and nothing else, like everyone seemed to assume they would. 🙂
Ngl, I wish we got more videos about video generators making anime waifus like in the old days, but it seems like development on that front is slowing down at the moment, hopefully you'll cover any new breakthroughs in the future.
I did a semester of ML the first half of this year, and I don't understand half of what you post lmao. Do you have any recommended resources to learn from? It is very hard to learn.
Can you cover Deepminds recent breakthrough on winning the math olympiad? does that mean RL is the way forward when it comes to reasoning? because as of right now, as far as I know, LLM's cant actually 'reason', they are just guesing the next token, but reasoning does not work like that.
Seems as if the greatest optimisation for practical AI tech are dynamic mechanisms. Lifelong memory plus continuous learning would become game changers in the space. At this rate humanity will be able to leave machines behind, which are able to recall our biological era. At least something is able to carry on our legacy for at least hundreds of thousands of years.
Bro, but did you read about Lory? It merges models with soft merging building on several papers. Lory is new paint on a method developed for vision AI to make soft mergers possible for LLM's. ❤
Whats really key about Lory is back propagation to update it's own weights, it's fine tuning itself at inference. It's also compatable with transformer or Mamba, or Mamba-2. In addition, it looks like Test Time Training could be used with all these methods for even more context awareness.
The thing about life long learning really reminds me of our human brains. Basically for every different thought or key combination it sounds like its building a seperate new model with all the required experts for said task. So basically like all relevant neurons we trained working on one thought to solve it, with the possibility of changing and adding new neurons. I can't see it going well if we keep increasing the number of experts forever tho, as the expert picking will become more and more fragmented. I think being able to forget certain thing would probably be useful too. I'm no scientist but I really do wonder how close this comes to the actual way our brain works.
That million expert strategy sounds super cool. I'm not too knowledgeable, though it does seem to sound like it literally allows for a more liquid neural network by using the attention mechanism to literally pick neurons to be used. I feel like this will be the future of NNs.
Shared expert isolation seems to be doing something similar to the value output in duelling networks; collecting the gradients for shared information so other subnets only need to account for the small tweaks. This mean the shared infirmation is learned faster, which in turn speeds up the learning of the tweaks
Oh yeah "acidentally" added something to a graph they intended to show. Not just builing hype to inflate the bubble of nothing that is this whole business?
Good news: Digitalism is killing capitalism. A novel perspective, first in the world! Where is capitalism going? Digitalism vs. Capitalism: The New Ecumenical World Order: The Dimensions of State in Digitalism by Veysel Batmaz is available for sale on Internet.
ive made my own transformer model before, as shitty as it was, it sorta worked. i agree that the term “ai” is misleading as its not sentient or anything like that. its just a really fancy autocomplete generator that understands surprisingly abstract and complex connections, relations, and context. but these models are real and arent just a million indians typing you essay for you , you can download models like llama to try it out locally
@@a_soulspark As a human if you understand N new disciplines you become N^2 more powerful because you can apply ideas from one field to any another. This is why you want a monolith not MoE. They chose MoE because they run into the wall, they can't improve the fundamentals so they have to use add-hoc measures just to boost the numbers.
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/bycloud/ . You’ll also get 20% off an annual premium subscription!
Like this comment if you wanna see more MoE related content, I have quite a good list for a video;)
You should do a video on virtual humans and cognitive AI.Virtual humans. Look at all the nonplayer character technology.We have a red dead redemption and the Sims Though a chat bottom one of those and we have a great virtual human
Thanks for linking to all papers in the description.
Imagine assembling 1 milliont PhD students together to discuss someone's request like "write a poem about cooking eggs with c++". Thats MoE irl
i'm tellin chatgpt this now.
Enjoy:
In a kitchen lit by screens,
Where code and cuisine intertwine,
A programmer dreams of breakfast scenes,
With a syntax so divine.
Int main() begins the day,
With ingredients lined up neat.
Eggs and spices on display,
Ready for a code-gourmet feat.
int eggs = 2; // Declare the count,
Double click on the pan.
Heat it up, and don’t discount,
Precision’s the plan.
std::cout
hahahahaha LMAO
AI: Resonable request sir
And MoME is getting 1 million 5th graders to teach a baby to PhD level only on how to write a poem about cooking eggs with c++
It's crazy how Meta's 8B parameter Llama 3 model has nearly the same performance as the original GPT-4 with 1.8T parameters.
That's a 225x reduction in compute in just 2 years.
The only thing in my mind is "MoE moe kyuuuuun!!!"
Intentional naming fr.
to some extent this seems closer to how brains work
neurons
Yeah, kind of like how spiking networks work, but more discrete/blocky and less efficient.
I think this concept should be applied to fundamental MLP, so you can increase model performance with out decreasing speed or RAM usage. The only sacrifice being storage which is easily scalable. IMO this is the future
Jeff Hawkins approves this message
I think this is how almost any informational system works. From molecules to galaxies, there are specialized units that use and process information individually in the system. An agentic expert approach was long forthcoming and is certainly the future of AI. Even individual ants have specialized jobs in the colony.
@@johndoe-j7z That's how perceptrons worked right from the start
i see what you did there with "catastrophic forgetting" lmao 🤣
troll emoji
These videos format is GOLD 🏆 such specific and nerdy topics produced as memes 😄
Now I really am excited for a 800B model with fine-grained MoE to surface that I can run on basically any device.
You would still need a lot of storage tough, but that is easier then downloading VRAM 😋
I watch you so that I feel smart, it really works!
3:37 wasn't it just yesterday that they released their model 😭
In a very real sense, the MoME concept is similar to diffusion networks. On their own, the tiny expert units are but grains of noise in an ocean of noise..... and the routing itself is the thing being trained. Whether or not it's more efficient than having a monolithic neural net with simpler computation units (neurons)........ I dunno. I suspect like most things ML, there is probably a point of diminishing return.
Yo dog, I heard you liked AI so we put an AI inside your AI which has an AI in the AI which can AI another AI so that you can AI while you AI.
Idk if this was intended just as entertainment, but I used it as education
Like I needed to understand MoE/MMoE on a high level for my research and this video totally helped me. It will be easier to dive deeper into one of the papers now
Thank u for linking the papers in the description ❤
I watch your videos yet I have no idea what you are explaining 99% of the time. 🙃
I will try better next time 😭
@@bycloudAI Personally I watch your content because you elaborate on academic papers and their relevancy very well. Do hope you continue with content like this. But I can see something like a fireship style code report for LLMs being duigestable.
@@bycloudAI I liked the video, but to their point, it might help to give a brief overview of what things are. I.e. parameters, feed forward, etc., the exact same way you briefly explained what hybridity and redundancy are. This is a good video if you're already familiar with LLMs and how they work but can probably be pretty confusing if you aren't.
I lost track at 8:24
Damn.. You blew my mind on the 1 million experts and Forever learning thing
How far are we from just having a virtual block of solid computronium with inference result simply being the exit points of virtual Lichtenberg figures forming thru it, with most of the volume of the block remaining dark?
it's about the distance between you and the inside of your skull
What is the 3D animation around 1:45 ?
yea, want to know this too
Blender
What resource is this 2:01 seems useful for teaching
Actually really cool idea, i liked the deep seek meo version too, it's so clever
my go to channel to understand ai
I'm telling you: Just do it like the brain. Have every expert/node be a router, choosing who to send to.
And, have every node be a RL agent.
Thank you. I think i understand the impact of moe.
00:01 Exploring the concept of fine-grained MoE in AI expertise.
01:35 Mixr has unique Fe-Forward Network blocks in its architecture.
03:11 Sparse MoE method has historical roots and popularity due to successful models.
04:46 Introducing Fine-Grained MoE method for AI model training
06:16 Increasing experts can enhance accuracy and knowledge acquisition
07:52 Efficient expert retrieval mechanism using pure layer technique
09:29 Large number of experts enables lifelong learning and addresses catastrophic forgetting
11:01 Brilliant offers interactive lessons for personal and professional growth
Crafted by Merlin AI.
hey where are the 3d visualisations of the transformer blocks from?
What is the source of your 3D transformation layer demonstration???? plz tell me
4:13 nice editing here🤣
What tool was used for the Transformer visualization starting at 2:01 ?
Today I saw a video about the paper "Exponentially faster Language Modelling" and I feel like the approach is just better than MoE and I wonder why not more work has been done on top of it.. (although I think it's possible thats how GPT-4o mini was made, but who knows)
Mixture of a million experts just sounds like a sarcastic description of Reddit
Great Video once again
how is the visualization in 2:01 made
I'd imagine in a month someone will come with MoE responsible for choosing the best MoE to choose the best MoE out of billions of experts
I would love a model with the performance of a 8b model with practical performance like gpt-3.5, but with much smaller active parameters so it can run on anything super lightweight.
Current 8B beats GPT 3.5 on most metrics, we've come a long way.
@@4.0.4 yeah, but metrics are not everything, and from my experience, gpt 3.5 still beats out llama 3 8b (or at least 8b quantized) in terms of interpolation/generalization/flexibility, meaning while it can mess up in difficult, specific or confusing tasks, it doesn't get overly lost/confused.
metrics are good at simple well defined one-shot questions, which I'd agree it is better at
@@redthunder6183 remember not to run 8b at q4 (default in ollama for example, but BAD, use q8)
@@redthunder6183 true but make sure you're using 8-bit quant, not 4-bit - it matters for those small LLMs
@@redthunder6183 llama 3 8b? That model is so outdated already ...who is even using that ancient model ....
Was hoping someone would make a video on this! Thank you! Would love to see you cover Google's new Diffusion Augmented Agents paper.
I have no idea what you just said but I'm glad they didn't just stubbornly stick to increasing training data and nothing else, like everyone seemed to assume they would. 🙂
I Like Your Funny Words, Magic Man
great video!
YES!!! NEW BYCLOUD VIDEO!!!
Where did you get the clips of attention mechanism visualization from?
Thanks! Incredibly useful to keep up.
1991... We are standing on the shoulders of giants.
I love these rabit holes!
if 13B is ~8gb (q4) then why does ollama load the entire 47b (26gb) model into memory?
Damn, you finally catching up. You should try Nemo and Megatron-LM they have the best MoE framework
Can you maybe make a video explaining how Llama 3.1 8B is able to have a 128k context window while still fitting in an average computers ram?
1:05 Brilliant pays youtubers $20000-50000 per sponsored video!?
Ngl, I wish we got more videos about video generators making anime waifus like in the old days, but it seems like development on that front is slowing down at the moment, hopefully you'll cover any new breakthroughs in the future.
So if these Millions of Experts are cute...
Should we call them...
Moe MoE?
I did a semester of ML the first half of this year, and I don't understand half of what you post lmao. Do you have any recommended resources to learn from? It is very hard to learn.
Dude! Ty❤
wow, top quality video
Can you cover Deepminds recent breakthrough on winning the math olympiad? does that mean RL is the way forward when it comes to reasoning? because as of right now, as far as I know, LLM's cant actually 'reason', they are just guesing the next token, but reasoning does not work like that.
5k views after 3h is a shame, you deserve much more, go go go algorithm
Bro is good
You lost me when that guy pointed at the gravesite of his brother
0:42
Undrinkable water my favorite :v
My feeling of after all these methods we'll eventually back at an essentially singleton model😂😂😂
i did'nt understand anything but sounded cool
meanwhile meta having no moe
We might be onto something here... 👀
Peer doesn't scale, I've tried multiple times
it should be MMoE ... massive mixture of experts XD
Best based language models (multi Languages) + LoRAs is enough.
Seems as if the greatest optimisation for practical AI tech are dynamic mechanisms.
Lifelong memory plus continuous learning would become game changers in the space.
At this rate humanity will be able to leave machines behind, which are able to recall our biological era. At least something is able to carry on our legacy for at least hundreds of thousands of years.
what about 1T experts
Yes we need more MOM-eis 💀💀
Your thumbnails are a bit too similar to Fireship
also the entire composition of his videos, a little more than just taking inspiration lol
1 millions beer
more like a mixture of a million toddlers
wait till the use genetic programing with monty carlo tree search and UTP and other stuff on the router
MoME ? Nah. MOMMY ✅🤤
😂😂😂as a behavioral scientist.. i think this one is going straight to the crapper.. mark my words.😂😂😂
too many cooks 🎶
Bro, but did you read about Lory? It merges models with soft merging building on several papers. Lory is new paint on a method developed for vision AI to make soft mergers possible for LLM's. ❤
Whats really key about Lory is back propagation to update it's own weights, it's fine tuning itself at inference. It's also compatable with transformer or Mamba, or Mamba-2. In addition, it looks like Test Time Training could be used with all these methods for even more context awareness.
Bot
The thing about life long learning really reminds me of our human brains. Basically for every different thought or key combination it sounds like its building a seperate new model with all the required experts for said task. So basically like all relevant neurons we trained working on one thought to solve it, with the possibility of changing and adding new neurons. I can't see it going well if we keep increasing the number of experts forever tho, as the expert picking will become more and more fragmented. I think being able to forget certain thing would probably be useful too.
I'm no scientist but I really do wonder how close this comes to the actual way our brain works.
Eat your heart out Limitless, we're making AI smarter by having them use less of their "brain" at a time
I'm a big fan of
i was like
schizophrenic AI
but then they went further............
anyway finally they are optimizing instead of making them bigger
Ill call it Moe(moe ehh) instead of em oh ih
That million expert strategy sounds super cool. I'm not too knowledgeable, though it does seem to sound like it literally allows for a more liquid neural network by using the attention mechanism to literally pick neurons to be used. I feel like this will be the future of NNs.
fireship clone
always bet on owens
Shared expert isolation seems to be doing something similar to the value output in duelling networks; collecting the gradients for shared information so other subnets only need to account for the small tweaks. This mean the shared infirmation is learned faster, which in turn speeds up the learning of the tweaks
Oh yeah "acidentally" added something to a graph they intended to show. Not just builing hype to inflate the bubble of nothing that is this whole business?
Good news: Digitalism is killing capitalism. A novel perspective, first in the world! Where is capitalism going? Digitalism vs. Capitalism: The New Ecumenical World Order: The Dimensions of State in Digitalism by Veysel Batmaz is available for sale on Internet.
why all comments before this one bots???
It's possible that it's because TH-cam shaddow banned all the real comments.
@@OnTheThirdDayBut they can't ban bots...
@@cesarsantos854 I don't know why bots (and I mean, obvious bots) do not always get banned but half of my comments that I write out myself do.
Don't see any bots 3 hours after this comment. Gj TH-cam 👍
Honestly I think your old moe video was better.
I agree. Definitely more understandable and this one would be harder to follow without seeing that first.
Temu fireship…oh I’ll watch it tho.
This channel seems to go into more detail and is more AI focused.
Tf are you talking about
Dude Wake Up, AI is just a Stupid Buzzword! There is no AI.
ive made my own transformer model before, as shitty as it was, it sorta worked. i agree that the term “ai” is misleading as its not sentient or anything like that. its just a really fancy autocomplete generator that understands surprisingly abstract and complex connections, relations, and context. but these models are real and arent just a million indians typing you essay for you , you can download models like llama to try it out locally
I knew it, your content so mid bro has to redemp it
Its useless and waste a lot of resources.
Using MoE is an admission of failure. It means that they are unable to make a "smarter" model and have to rely on arbitrary gimmics.
Not really, they are testing if it makes models smarter without having to do much more work
I don't see it as a problem. if you think about it, all things in machine learning are just arbitrary gimmicks that happen to work out
@@a_soulspark As a human if you understand N new disciplines you become N^2 more powerful because you can apply ideas from one field to any another. This is why you want a monolith not MoE. They chose MoE because they run into the wall, they can't improve the fundamentals so they have to use add-hoc measures just to boost the numbers.
RLHF seems gimmicky but it worked. MoE might seem gimmicky, but it works. Multimodality might seems gimmicky but it works.
@@zrakonthekrakon494 Nobody would even bother with MoE if they hadn't run into the wall. They did.
MOE onichan
We needed someone to say this, so thank you for sacrificing your dignity for us.
get ready to call your MOME as well now
bro fell off
Becauae theres no bot comments after 24m?
This channel is nice copy of fireship
Actually really cool idea, i liked the deep seek meo version too, it's so clever
Actually really cool idea, i liked the deep seek meo version too, it's so clever