The NEW Mixtral 8X7B Paper is GENIUS!!!

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ม.ค. 2024
  • Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.
    🔗 Links 🔗
    Paper link - arxiv.org/pdf/2401.04088.pdf
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1littlecoder
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 34

  • @Alice_Fumo
    @Alice_Fumo 6 หลายเดือนก่อน +5

    I gotta say, Mixtral has been an amazing model release. It entirely managed to pull the company into the spotlight and get them relevance. There are many use-cases for this model, especially since using their API - it is even cheaper than GPT-3.5-Turbo.
    There have even been occasions where I used it over GPT-4. For what it does, the hardware requirements are really low. One can even get away with having the model in CPU mode and get ~3 tokens / second on a ryzen 5600X, since the amount of inference steps is not that high.

  • @TommyJefferson1801
    @TommyJefferson1801 6 หลายเดือนก่อน +2

    To understand what each expert does, there is an interview between Noam Shazeer (the one who popularized this technique w.r.t LLM) and Yannic Kiltcher on TH-cam. There they discuss this thing that each expert is not like a math expert etc. They're just experts in punctuation, verbs etc

  • @johnkost2514
    @johnkost2514 6 หลายเดือนก่อน

    Not having to compute/traverse across a broad (huge) generalized network is just clever thinking. Nice way to scale (up and out) and more importantly remain snappy and responsive. Combined with GGML and CPU processing and you have a game changer especially for private models.
    Bravo!

  • @ilianos
    @ilianos 6 หลายเดือนก่อน +2

    Great vid as usual, thanks for walking us through the paper in such short time! (I personally don't find the time to watch vids of 30min, even though the topic is interesting.)
    fun fact, at 15:14 your channel logo becomes your new face😀

  • @srivatsa1193
    @srivatsa1193 6 หลายเดือนก่อน +5

    I think what is happening is that the latent space that the transformer creates is perfectly and logically spread over the entire model in a specific orientation. I think this mainly because each expert only recieves a subset of tokens during training and yet produce perfect output at the end.
    Individually these latent spaces ( params of individual experts ) are not really an expert in biology or any damn thing. It is just that each expert learns to coordinate with other experts to create this latent representation within itself that is dense (as in capable of producing coherent engligh) by itself without others - which explains how only 13B params are sufficient for inference, also thereby creating an illusion that the expert that the router points to is actually an expert.
    May be I am completely wrong about this but its an interesting theory.

  • @MsJeffHunter
    @MsJeffHunter 6 หลายเดือนก่อน +1

    Happy Prompting!

  • @rodvik
    @rodvik 6 หลายเดือนก่อน

    Thank you! Your channel is fantastic.

  • @forcanadaru
    @forcanadaru 6 หลายเดือนก่อน

    Awesome review!

  • @SR-zi1pw
    @SR-zi1pw 6 หลายเดือนก่อน

    Amazing ❤

  • @KeXous
    @KeXous 6 หลายเดือนก่อน

    it reminds me an old tv promo of the cleanser where enen an old cleanser starts to use a new one because it is chaper x times.

  • @JG27Korny
    @JG27Korny 6 หลายเดือนก่อน

    It is not surprising that they do not see patterns in distribution as decision is made token per token. If there is a parameter that would make a batch of tokens, for example a decision which model to choose is based on the context of the batch of tokens then patterns will emerge. So it could be nice if we could set the window of the context using a parameter.

  • @handsanitizer2457
    @handsanitizer2457 6 หลายเดือนก่อน

    Can you make a new fine tunign video. Speak on preping the data pleas

  • @Romathefirst
    @Romathefirst 6 หลายเดือนก่อน +1

    do you think open ai will use similar approach to their next model

    • @1littlecoder
      @1littlecoder  6 หลายเดือนก่อน +4

      there have been multiple rumors that that the currently hosted GPT-4 is actually an MoE!

  • @mshonle
    @mshonle 6 หลายเดือนก่อน +1

    So, how does the router get trained?

    • @mysticshadow4561
      @mysticshadow4561 6 หลายเดือนก่อน

      router itself is a small dense nueral net and softmax activation is applied at the end layer and then top 2 highest prob are taken and then the token is passed to these two experts

  • @jmirodg7094
    @jmirodg7094 6 หลายเดือนก่อน

    I"m really surprised that it can be an efficient way... but it apparently work

  • @IvarDaigon
    @IvarDaigon 6 หลายเดือนก่อน +1

    I've been using mixtral 8x7B for real world coding and I have to say I am not very impressed with it's performance when compared to ChatGPT 3.5. This may be due to the training set they used as it doesn't seem to have the same level of knowledge about commonly used C# libraries as gpt 3.5 does.
    Still early day's yet but so far it's not living up to the hype generated by the synthetic benchmark results but then again it is 1/4 the size of gpt 3.5 so I'd be surprised if it did.

    • @theaugur1373
      @theaugur1373 6 หลายเดือนก่อน

      I like the Hermes models better for code than the raw Mistral, since Hermes is finetuned for code.

  • @SR-zi1pw
    @SR-zi1pw 6 หลายเดือนก่อน

    100%RAG they use something similar like claude prompting

  • @KevinKreger
    @KevinKreger 6 หลายเดือนก่อน

    'I love Abdul' to which all the experts replied "❤"

  • @senju2024
    @senju2024 6 หลายเดือนก่อน

    This is very interesting. I like to argue that the name "experts" is not really expert in anything from what I read. It is just a block of knowledge that talks to other 8 blocks and reports to the router. When the router sees the token, it must do some type of lookup to determine what expert block it goes to. Not sure if the so-called expert blocks update and report back to the router in a dynamic manner. I need to reread the paper but I want to know more about how the router actually works and its decision making.

  • @yl95
    @yl95 6 หลายเดือนก่อน +2

    In the future there’ll be two branches of ai 🤖 Mistral’s technique will become super ml. OpenAI’s q* will be agi or super intelligent branch

    • @Custodian123
      @Custodian123 6 หลายเดือนก่อน +1

      Until q* is confirmed and released by OpenAI, q* is a nothing burger.

    • @yl95
      @yl95 6 หลายเดือนก่อน +2

      @@Custodian123 it’s clear q* or at least something similar is key to agi or super intelligent from ilya’s past speeches on TH-cam. He had this idea years ago

  • @PerfectArmonic
    @PerfectArmonic 6 หลายเดือนก่อน +7

    As long as we, the mortals, don’t have any means to train our own models (to train a model in present days one need 4000-5000 hours on 500 pieces of NVIDIA A100 GPU… setup that is available only to corporations level) all these papers are meaningless… they contain bunch of info which doesn’t help at all… at this point the best option which is at hand for a “mortal” is to pay 20 dollars per month and to benefit from all multimodal behaviors offered by chatGPT. Things will become different when every person will be able to train its own model in 20-30 hours on a regular laptop or desktop or iMac machine, even for the old ones, working with only CPU, and probably helped by a public network of cloud GPU’s…

    • @xxxNERIxxx1994
      @xxxNERIxxx1994 6 หลายเดือนก่อน +5

      Chill in one year mixtral open sourced gpt 3.5. with the next architecture and also when chine go really competetive with this i dont believe that giant orgs that must aligne the models carefully (governments also must ) its really looking promising. A year ago there was gpt 3.5 and dogshit now. My gtx 1660 super runs with 30 tokens per sec 4 bit mistral model which makes sens when speaking to me. I believe open source its much better yhan i thought.

    • @redone823
      @redone823 6 หลายเดือนก่อน +3

      There was a time when a 1tb ssd was $20k USD as well, so i dont mind hearing this stuff now.

    • @clashgamers4072
      @clashgamers4072 6 หลายเดือนก่อน +1

      I don't think finetuning to get a custom model is super necessary for a lot of use cases . Things that will make a big difference are inference cost/time , context window length , better reasoning , less hallucination . A bigger context window size is the biggest current limitation of transformer architecture IMO once that is solved (MAMBA - a good candidate) personalized chat with bigger docs or several web page contents at once will make it so much better.

    • @random-xl3zm
      @random-xl3zm 6 หลายเดือนก่อน

      That prob will b fixed by mojo! Bam!

    • @rakeshsahni_
      @rakeshsahni_ 6 หลายเดือนก่อน +1

      There is an alternative way to do smartly using RAG if we have properly documented that u think it's far better then fine tune in terms of computational cost to train or even accuracy.

  • @JazevoAudiosurf
    @JazevoAudiosurf 6 หลายเดือนก่อน

    only is spelled "ownly" and not "onely", just sayin