Mamba with Mixture of Experts (MoE-Mamba)!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 ก.ย. 2024
  • From Abstract:
    State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer- based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.
    🔗 Links 🔗
    MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
    arxiv.org/pdf/...
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1lit...
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs

ความคิดเห็น • 30

  • @franklydoodle350
    @franklydoodle350 7 หลายเดือนก่อน

    MOE MAMBA LESGO BABYYYYY

  • @ickorling7328
    @ickorling7328 7 หลายเดือนก่อน +5

    With AMD'S recent launch of Ryzen AI platform (ROCm pytorch support; from AMD & Hugging Face collaboration) it means the older RDNA2 newish RNDA3, and new XDNA architecture GPU's & APU's (gpu with cpu) can run practically all AI. And world wode, many AMD APU's (laptops and mini pc's, etc) and Snapdragon devices (AMD based) will be able to run AI with dedicated hardware already there, and tie directly into the system RAM. Because: Little known fact, APU's running Windows can dynamically reassign normal RAM to VRAM, no actions required. So if you install 127 GB system ram on an APU PC then most of that is available for compute.
    Roadblocks eliminated, now rather than raining a 1M parameter model due to RAM constraints, widely available chips will be able to simply work for longer to the same results. Imagine: leave a high RAM mini pc running with AMD APU of modern RDNA3-- let it go for days, weeks, background thing. Give it a backup power supply, etc. Come back to a 1B parameter base model!!! 🎉

    • @alx8439
      @alx8439 7 หลายเดือนก่อน

      I'm running Mixtral 8x7b 4bit on my AMD AM5 APU in a mini PC via llama.cpp just fine even without all that

    • @geobot9k
      @geobot9k 7 หลายเดือนก่อน

      @@alx8439I’m curious to try it out and I’m running a um790 64gb. A couple of weeks ago I grabbed Tiny Llama’s weights and converted it to 32bit float gguf and it ran amazing compared to the K5_M gguf

    • @geobot9k
      @geobot9k 7 หลายเดือนก่อน +1

      Im gonna look into AMD’s AI platform, good looking out bro

    • @ickorling7328
      @ickorling7328 7 หลายเดือนก่อน

      @@alx8439 Yes, but I'm speaking to a broader rule of Thumb. Actually. AMD has been slowly releasing support for cards in ROCm in batches. If it weren't for this, I think you'd need a special runtime kernel that simulates CUDA for certain transformer based models. Hmm. Hand waving here, you may know better than I, having done it. Thanks for the confirmation of use in a Mini PC!

    • @alx8439
      @alx8439 7 หลายเดือนก่อน +1

      @@ickorling7328 no worries. The community of practitioners has figured it out already, that running models via ROCm on integrated GPU of Ryzen APUs is actually slower than running them on pure CPU. Let's see if these new drivers change anything, but recent Ryzens are already quite powerful and DDR5 is quite fast for sufficient homelab inference speed

  • @blender_wiki
    @blender_wiki 7 หลายเดือนก่อน

    All you need is a Mamba mentality 😉

  • @anishbhanushali
    @anishbhanushali 7 หลายเดือนก่อน +1

    Nice explaination .... thanks 👍

    • @1littlecoder
      @1littlecoder  7 หลายเดือนก่อน

      Glad you liked it

  • @BooleanDisorder
    @BooleanDisorder 7 หลายเดือนก่อน

    MoEmba

  • @Nick_With_A_Stick
    @Nick_With_A_Stick 7 หลายเดือนก่อน +2

    But wouldn’t a regular 6B mamba, also achieve the same 2.2 times less training steps? Or is it MOE Gate layer really providing that much more performance? I remember the llama paper showed a similar loss (compared to the log perplexity) when it came to training loss going down for 7b vs 13b. 13b went down significantly more.

    • @ickorling7328
      @ickorling7328 7 หลายเดือนก่อน

      Training data has orders of quality and usefulness, and so good quality data can ve used to train linger for smaller models. Theres a paper about the ideal re-use of data in training to maximize data in with number of parameters, and it deals with such things. My guess is these mistral experts are trained effectively at slightly different datasets, and thus when evaluating the output of MoE it will behave smarter, despite taking less steps to train the model because it's short coming in training gets fixed my its friend, the other expert chose to co-author. Thus the MoE approach makes AI smarter without smarter base models themselves. If I understand this correctly.

    • @Nick_With_A_Stick
      @Nick_With_A_Stick 7 หลายเดือนก่อน +1

      @@ickorling7328 I saw a tweet with a section of the mixtral paper, there was a chart showing how good each agent got at certain tasks, and one didn’t necessarily get significantly better than another, they kinda all just got better. Either way the mixtral paper was super hush hush about the way they operated the gate layer.

    • @ickorling7328
      @ickorling7328 7 หลายเดือนก่อน +2

      @@Nick_With_A_Stick well shoot, can't argue that. But, I see itbas inpressive that MAMBA researchers already effectively used MoE somehow, using just what Mistral published. Seems you could understand what your looking better than I, so cheers for the info! 🥂

    • @zenimus
      @zenimus 7 หลายเดือนก่อน

      @@ickorling7328 it's worth noting that having completely different datasets for each expert isn't the typical approach. Instead, specialists in MoE models differ mainly due to:
      - Starting points for optimization (initialization)
      - Picking random examples from the whole dataset
      - Tweaking the structure of the neural network
      - Adding constraints to avoid overfitting
      - Learning from various starting tasks
      - Building on already-learned skills (transfer learning)
      So, imagine a team working together towards solving complex problems. They don't necessarily start with entirely different pieces of information. Instead, they bring various perspectives and tools to tackle the challenge collectively. That's similar to what happens inside MoE model...

  • @PerfectArmonic
    @PerfectArmonic 7 หลายเดือนก่อน +2

    When I’ll be able to train my own model locally, on my computer, with whatever information I like?

    • @srijonp4
      @srijonp4 7 หลายเดือนก่อน

      Wait 5 years or buy a 20000 dollar gpu

    • @kalilinux8682
      @kalilinux8682 7 หลายเดือนก่อน

      You can do it now. Like 1M parameter model. But likely you want to train atleast a 1B model. So to do this you'll need atleast 3 Trillion tokens for it to be any good. Unless you use a highly curranted and synthetic data. Then you can do it using just 500B tokens. To store 3T tokens you need around 18TB of storage, so to store 500B tokens you'll need 3TB storage. These storage will cost a lot in current time so we will have to wait at least 8 years for this kind of storage to be viable on consumer hardware. Although there's another problem we need to overcome. That is compute.
      In theory you can train a model using CPU and system RAM,and a rough estimate would be around 3 Months and 100GB to 200GB ram for 1B parameter model with a content length of 2048k and 2 epochs. This is based on current hardware.
      If we have a look back at the history, compared to 10 years ago our single core performance have improved 4x to 5x and multi core performance have improved around 8x to 9x. Assuming this will keep improving, in 10 years we will be able to train this Model in 9 days, which is not Feasible to say the least. So let's say it'll take 20 years for CPU to get powerful enough to train a model in just few hours.
      Although we don't really want to train a model on CPU why are we even talking about it. Let's train on GPU. Current high end consumer card RTX 4090 can indeed train a model. I mean the compute power of 4090 is enough to train it. Although the VRAM is not enough to sustain this. That's where we encounter another issue. The VRAM. Current highend consumer VRAM is 24GB. Which is too small for our requirements. So let's say in next 5 to 10 years this VRAM will reach 100GB mark we will be able to do it. Although I might add that you can still try to get a A100 or H100 and do this but that'll cost you your house so can't do that either. Also this theoretical 100GB VRAM consumer card we are talking to arrive in 10 years won't be cheap either. Yes it will be cheaper than current A100 or H100 but only the well offs will be able to get it. So realistically you can't train this models on your computer locally for next couple decades. Although you won't even want to do so. As there will be better AI algorithms which will be more efficient and will demand less resources and perform better then current LLMs.

    • @ickorling7328
      @ickorling7328 7 หลายเดือนก่อน

      With AMD'S recent launch of Ryzen AI platform for RDNA2&3, and XDNA architecture GPU's and APU's (gpu with cpu) MEANING that many AMD APU's (laptops and mini pc's, etc) will be able to run AI with dedicated hardware and tie directly into the system RAM. Little known fact, APU's running Windows can dynamically reassign normal RAM to VRAM, no actions required. So if you install 127 GB system ram on an APU PC then most of that is available for compute.
      Roadblocks eliminated, now rather than raining a 1M parameter model due to RAM constraints, widely available chips will be able to simply work for longer to the same results. Imagine: leave a high RAM mini pc rinning with AMD APU of modern RDNA3-- let it go for days, weeks, background thing. Give it a backup power supply, etc. Come back to a 1B parameter base model!!! 🎉

    • @kalilinux8682
      @kalilinux8682 7 หลายเดือนก่อน

      @@ickorling7328 we can run ai on our systems today. The difficulty is to be able to train these large models

    • @kalilinux8682
      @kalilinux8682 7 หลายเดือนก่อน

      @sapienspace8814 feels like the level of growth has stagnated.

  • @PaulSchwarzer-ou9sw
    @PaulSchwarzer-ou9sw 7 หลายเดือนก่อน +2

    🎉🎉

  • @yl95
    @yl95 7 หลายเดือนก่อน +1

    exciting, not quite agi or super intelligence technique but i guess it still helps

  • @notsettlinganissue8161
    @notsettlinganissue8161 7 หลายเดือนก่อน

    … needs more investigation i guess