Mixture of Models (MoM) - SHOCKING Results on Hard LLM Problems!

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ก.ย. 2024

ความคิดเห็น • 114

  • @tylerhatch8962
    @tylerhatch8962 4 หลายเดือนก่อน +18

    I loved the "crucial insights" into gravity 😂

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +2

      😂😂

  • @shawnfromportland
    @shawnfromportland 4 หลายเดือนก่อน +20

    the holy grail problem to prompt it with is one in which each contributor gives different advice, and the decider model produces output that's different from all the contributors and different than what the same model would have produced alone without MoM

    • @mwdcodeninja
      @mwdcodeninja 4 หลายเดือนก่อน +3

      Give each actor its own motivation. Like a murder mystery. The king is the detective. Have it play Clue.

    • @techpiller2558
      @techpiller2558 4 หลายเดือนก่อน

      Different personalities would be good.

    • @EinarPetersen
      @EinarPetersen 4 หลายเดือนก่อน

      I assume this is all on the GitHub repo to play with

    • @EinarPetersen
      @EinarPetersen 4 หลายเดือนก่อน

      Wrong thread

  • @YT_Jx
    @YT_Jx 4 หลายเดือนก่อน +10

    Kris, That’s an extraordinary project with surprising results. Thank you for sharing your talents.

  • @trud811
    @trud811 4 หลายเดือนก่อน +10

    Good, but as noted in the comments it will be interesting to see the results for "plain" GPT4 and Claude for the same set of problems

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +5

      yeah, kinda regret not inc a basetest

    • @jamesjonnes
      @jamesjonnes 3 หลายเดือนก่อน

      ​@@AllAboutAIYou can add another video. I'm very interested.

  • @DarrenAllatt
    @DarrenAllatt 4 หลายเดือนก่อน +7

    You could combine tree of thought into this mixture of models
    Where instead of having one model pretend to be three different experts, you use a tree of thought process across these different architectures to generate a more accurate output across the entire task

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      great idea :) tnx

  • @justtiredthings
    @justtiredthings 4 หลายเดือนก่อน +2

    Cool experiment! I'd like to see a sort of "deliberative democracy" architecture--where the models have a chance to discuss back-and-forth or to see one another's initial answers before providing a final answer and then voting

  • @tubaguy0
    @tubaguy0 4 หลายเดือนก่อน +5

    This was really interesting, I really like your thought process. One suggestion I have to validate your results is to run each of your tests 3-5 times and see how consistent each architecture is, since you’re going to get different outputs each time. I also didn’t see whether you were controlling model temperature in your setup, and I wonder if putting a temp 0 for at least one participating model as a sanity check would improve output (another opportunity to run a batch test battery).
    This cuts really close to what I’ve been imagining for agent framework cooperative problem solving, so thank you again for showing us how to put ideas into code.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +2

      yeah good idea, thnx :)

  • @USBEN.
    @USBEN. 4 หลายเดือนก่อน +2

    I love these experiments soo much.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      thnx :)

  • @LeeBrenton
    @LeeBrenton 4 หลายเดือนก่อน +2

    that's epic man! well done.

  • @DesignDesigns
    @DesignDesigns 4 หลายเดือนก่อน +3

    Excellent...It'd be great if the models could have been run in parallel. This way, less time would be needed.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      great idea, will def check into this!

  • @footube3
    @footube3 4 หลายเดือนก่อน +2

    Amazing idea, and excellent result. Well done!!

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +1

      thnx :)

  • @Pregidth
    @Pregidth 4 หลายเดือนก่อน +2

    Excellent idea! I understand Opus and OpenAI as the overseer, but if local LLM could proof being the heroes here, would be supberb!

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      true!

  • @prolamer7
    @prolamer7 4 หลายเดือนก่อน +4

    Very interesting you are not only thinking about such solutions but actually programming and testing them. Which is really great!
    What about this architecture:
    King architecture BUT with a twist that you will execute question batch with king being first GPT4, then Opus, then llama3 70b
    and you will then chose only answers with majority of votes ie if GPT, Opus agree.

    • @tubaguy0
      @tubaguy0 4 หลายเดือนก่อน

      And if they don’t agree, loop back and resolve the differences via Socratic method until there’s consensus on either a solution or a recommended plan of action to improve the input for the models to try again.
      I wonder if the local models could reason out the problem super cheaply if given enough reflection and conversation iterations, or if the general models are just too effective to leave out. Probably use case dependent.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +1

      cool idea, noted :) tnx

    • @prolamer7
      @prolamer7 4 หลายเดือนก่อน

      @@AllAboutAI :-)

  • @MrSuntask
    @MrSuntask 4 หลายเดือนก่อน +1

    Great idea! Like your experiments. Keep up the great work!

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +1

      thn mate :)

  • @stormyRust
    @stormyRust 4 หลายเดือนก่อน +2

    Really interesting! Like others mentioned, even if the added reliability of a MoM architecture, it may not be worth the computing power and additional context length, and GPT-4 might be strong enough to figure most of these out on its own. I think a good compromise would be using less LLMs in the hierarchy

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +1

      yeah, or you could just use OS models with llama.cpp or ollama i guess

  • @john849ww
    @john849ww 4 หลายเดือนก่อน

    Thanks for sharing your research! I had an idea about pairing a large model with a tiny model, where the tiny model is asked a question and the large model evaluates the response. The large model identifies any issues and iteratively adjusts the prompt to help the tiny model get the right answer. Not sure if it would work in practice, but if it did maybe a tiny (and cheap) model could be made effective to solve more problems than it otherwise could have?

  • @HistoryIsAbsurd
    @HistoryIsAbsurd 4 หลายเดือนก่อน

    Damn this was a beautiful experiment! Really really can see this being used effectively on a grand scale. I mean even think of the little LLMs if we had those all deciding and the larger one is Llama3 or something we can run 100% locally. This can be really powerful for those with less GPU capabilities and low funding if they run them sequentially.

  • @jamesroth7852
    @jamesroth7852 4 หลายเดือนก่อน +2

    King is King!

  • @MuplerHi
    @MuplerHi 4 หลายเดือนก่อน +4

    Looks like the Ai jury is about to be seated in a courtroom. and voilà!

  • @frankjohannessen6383
    @frankjohannessen6383 4 หลายเดือนก่อน +5

    The democratic arch should probably have a step where you have one strong model that decides which answers are identical and remove all but one of them before voting. Otherwise you will get a situation where there are maybe 7 correct answers and one wrong, then the wrong answer has an unfair advantage since the models will spread their votes on the 7 correct answers, but not the one wrong answer.
    Also, solutions like this is where Groq with 300 tokens/s really shines.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      great idea, will try this. tnx

  • @elevated_souls
    @elevated_souls 4 หลายเดือนก่อน +1

    Seems like the second and third method are both kind of getting the best answer of each model, which makes it kind of average. The first architecture on the other side takes all answers into a consideration, adds all insights and creates the best answer from them all. So in a sense it gets only the good stuff and excels the most. This is the way.

    • @justtiredthings
      @justtiredthings 4 หลายเดือนก่อน

      You could prob achieve that with a more deliberative democratic architecture

  • @youtuberschannel12
    @youtuberschannel12 4 หลายเดือนก่อน +4

    You should've add a baseline with just gpt-4 turbo and compare.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +2

      100%

  • @VR_Wizard
    @VR_Wizard 4 หลายเดือนก่อน +3

    Next projekt is a Mo(MoM) model where we compare the result of all 3 MoM systems with these systems again. Think of recursively going down the rabbit hole.

    • @YT_Jx
      @YT_Jx 4 หลายเดือนก่อน

      lol

  • @jiyuhen
    @jiyuhen 4 หลายเดือนก่อน +4

    🤣😂 I just have to say it. Monarchy seems to be the solution for most problems.
    However jokes aside, this experiment turned out really great, and fascinating how the models behave in the different setups and interactions.

    • @someverycool4552
      @someverycool4552 4 หลายเดือนก่อน +1

      Monarchy would be best only if the king is also by far above others in mental capacity, in all areas.

    • @jiyuhen
      @jiyuhen 4 หลายเดือนก่อน

      @@someverycool4552 Agreed

    • @youtuberschannel12
      @youtuberschannel12 4 หลายเดือนก่อน +1

      ​​@@jiyuhenHis demo is not really a monarchy. In a real life a king is able to have his own ideas or solution, he has the freedom to completely ignore proposal put forth to him. But in this demo the king doesn't have any of those freedoms and powers. He has to judge and decide what others put forth to him. Let the king have his own ideas and the freedom I mentioned earlier and I bet the results will be different.

  • @bennie_pie
    @bennie_pie 4 หลายเดือนก่อน

    This was a fantastic video thank you...brought me up to speed on the various models and structures of their hierachys was fascinating. Not dabbled in agents for a while so was nice to be surprised at progress in this area vs my early attempts that spent tokens and produced nonsense.... question....do you have any teams of agents setup doing actual real world useful stuff? (I'm sure I'll find out by watching more). Thanks for the video.

  • @JohnDoe-zx8bu
    @JohnDoe-zx8bu 4 หลายเดือนก่อน +1

    I think that it would be better to estimate performance when you have some simple model at the output level.
    GPT-4 can operate with all of provided problems without previous models solutions

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      yeah true, will try diff configs for this setup

  • @Dabes88
    @Dabes88 4 หลายเดือนก่อน

    Makes sence we create theoretically using something between fast and slow thinking. Maybe something like q* needs to use Claude 3

  • @settlece
    @settlece 4 หลายเดือนก่อน +1

    how exciting what an awesome video thank you

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      thnx :)

  • @igord7272
    @igord7272 4 หลายเดือนก่อน

    Great content!

  • @tomtyiu
    @tomtyiu 4 หลายเดือนก่อน +1

    wow, you have an innovative mind. 🤓

  • @Not_A_Robot_LOL
    @Not_A_Robot_LOL 4 หลายเดือนก่อน

    22:38 it only got one vote because each model provided 10 unique answers. This shows that the democracy model isn’t a good architecture for all kinds of problems (specifically, the ones where the answer is determinate).

  • @EinarPetersen
    @EinarPetersen 4 หลายเดือนก่อน

    I would probably choose to do the tally locally using an optimized solution a simple python script bypassing any AI that way you eliminate the models sucking at math and use architecture specialized for solving the math problem of counting 😊

  • @Copa20777
    @Copa20777 4 หลายเดือนก่อน +1

    I was following the explanation In the beginning about the different architectures very well, until he opened the code😂

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      haha, its not that bad

  • @yngeneer
    @yngeneer 4 หลายเดือนก่อน +1

    lovely

  • @jekkleegrace
    @jekkleegrace 4 หลายเดือนก่อน +4

    thats great can you please give us paying users access to the github

  • @abhishekrp.ai2002
    @abhishekrp.ai2002 2 หลายเดือนก่อน

    quite impressed by these MoM / MoE model architectures, however, I am not satisfied with the leetcode hard problem used in the test set. Testing the llm generated response on just 3 test cases is not a valid determinant for the architecture's capabilities. I could see the solution is itself in the order of 4 ( O(N^4)) time complexity, which essentially is Brute Force.
    A better bench mark would be testing how well it can optimise the code and handle diverse test suits IMO.

  • @EduardoJGaido
    @EduardoJGaido 4 หลายเดือนก่อน

    Hello! Thank you for a great video. I ask you or the community, I have a hard problem to solve: i want to make a chatbot using a local LLM with RAG (keep reading please!) BUT i want to use it for my business so the clients of my physiotherapy clinic can ask it and it responds JUST WITH the information that is fed. Otherwise, it says "Oh, i don't know, wait please" so the secretary can answer instead. Just with that, I would be happy. I have a lot of FAQ listed with the answers (in a json friendly format). I can't find this answer anywhere. If you have any information, would be apreciated. Cheers from Argentina.

  • @smilebig3884
    @smilebig3884 4 หลายเดือนก่อน

    The api call cost will go out of roof.

  • @EinarPetersen
    @EinarPetersen 4 หลายเดือนก่อน

    So this is all in the GitHub repo to play with I assume?

  • @marcuscronan6658
    @marcuscronan6658 4 หลายเดือนก่อน +2

    How much did each "tournament" cost in terms of tokens?

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +2

      since 8 models was local on ollma, not much =)

  • @immortalityIMT
    @immortalityIMT 4 หลายเดือนก่อน

    You feed the answer into the next model and ask it is true or false

  • @haileycollet4147
    @haileycollet4147 4 หลายเดือนก่อน +1

    All seems rather pointless without comparing to the results by gpt4/opus on their own ...

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +2

      yeah that was my regret, should have done a "basetest" to comp

  • @michaeltse321
    @michaeltse321 4 หลายเดือนก่อน +1

    means that autocratic systems are better compared to democratic - lol

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      haha lol

  • @pensiveintrovert4318
    @pensiveintrovert4318 4 หลายเดือนก่อน +6

    The marble problem proves nothing. Every model vendor cheats on some problems and sticks these problems into the training datasets. Same for other problems. You need to use new problems every time. Otherwise these reviews are b.s.

  • @finalfan321
    @finalfan321 4 หลายเดือนก่อน

    you need to go deeper into their reasoning because some of the wrong answers are actually right

  • @squiddymute
    @squiddymute 4 หลายเดือนก่อน +2

    are you sure chatgpt4 wouldn’t just solve all these problems without the other models

    • @MarkoTManninen
      @MarkoTManninen 4 หลายเดือนก่อน +2

      I was thinking same. So the problems should be more adequate. But it was a proof of concept, very nice work. I planned aomething similar with websocket agents last summer, but it became too expensive to test for a hobby. Now it is right time because ollama!

    • @YT_Jx
      @YT_Jx 4 หลายเดือนก่อน +3

      That’s a great question. In the past at least, ChatGPT4 failed the marble and the apple questions.

    • @justincase4812
      @justincase4812 4 หลายเดือนก่อน

      Do you cut wood with a hammer?

    • @jichaelmorgan3796
      @jichaelmorgan3796 4 หลายเดือนก่อน

      Different models have slightly different training, context, and perspective, so this creates a broader perspective to approach the problem at hand. This will not be practical for everyday prompts.

    • @DarrenAllatt
      @DarrenAllatt 4 หลายเดือนก่อน

      @@jichaelmorgan3796 I disagree on the practicality point. The way it is Setup in this video from a testing the methodology perspective yes it’s not practical..
      But behind it is actually very practical. Because all of this stuff that is going on can be done in code so the user doesn’t see everything that goes on behind-the-scenes and just gets the final output

  • @TheHistoryCode125
    @TheHistoryCode125 4 หลายเดือนก่อน +1

    This video is a whole lot of fluff for a very simple concept: combining outputs from multiple LLMs to get better answers. The creator makes it seem revolutionary with fancy names like "King," "Duopoly," and "Democracy," but it's just basic ensemble methods. They feed the same problem to a bunch of LLMs (including weaker ones like Llama), then use those answers as context for a stronger LLM (GPT-4) to make the final decision. The "Duopoly" is just GPT-4 and Claude arguing about the best answer, and the "Democracy" lets all the LLMs "vote" on the best answer. He doesn't go into detail about how the voting works or what constitutes a good vote. The results are a mixed bag, sometimes better, sometimes worse than just using GPT-4 alone. The real takeaway is that ensemble methods can be useful, but this video overcomplicates it with theatrical names and presentation.

  • @aaronpainting5643
    @aaronpainting5643 4 หลายเดือนก่อน +1

    Why is nobody talking about "Where's Daddy?" or "Lavender" or "Gospel"? these are the most advanced AI systems and are already committing mass murder!

    • @tellesu
      @tellesu 4 หลายเดือนก่อน +1

      Why do you bots make the same stupid fearmongering comment on every post?

    • @aaronpainting5643
      @aaronpainting5643 4 หลายเดือนก่อน +1

      @@tellesu I'm not a bot, I just want to hear smart people like this guy break it down. more people need to be talking about this, it is AI development towards Armageddon

    • @illuminated2438
      @illuminated2438 4 หลายเดือนก่อน +1

      @@tellesu that's not fear mongering it is simple fact. A certain tribal state is using advanced AI to engage in the mass murder of civilians without any human oversight.
      The where is daddy program literally watches targets leave so-called military locations and orders an attack only after they are at home with their family. It is deliberate mass murder of civilians, and it is powered by AI.

  • @william5931
    @william5931 4 หลายเดือนก่อน

    your mom

    • @YT_Jx
      @YT_Jx 4 หลายเดือนก่อน

      @william5931, your dad. :)

  • @hypercoder-gaming
    @hypercoder-gaming 4 หลายเดือนก่อน +5

    Perhaps it would be even more powerful to make a Chain of Models/Chain of Experts architecture that has a line of models and each model builds upon the previous models' responses. Making a Network of Models would be even better, where you have layers of multiple models and each next layer gets the previous layer's outputs and ultimately combines everything in the end to form an answer to your question.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      very interesting, noted, tnx:)

    • @c.d.osajotiamaraca3382
      @c.d.osajotiamaraca3382 4 หลายเดือนก่อน

      If just one agent is wrong the entire work product crumbles. Wouldn't you need a hallucination check, or some kind of redundancy at each stage?

    • @hypercoder-gaming
      @hypercoder-gaming 4 หลายเดือนก่อน

      @@c.d.osajotiamaraca3382 Yeah, I didn't think of that. Maybe in the system prompt, it could say there is a chance that the output of the previous model(s) is incorrect and to consider that before making a decision. Or between each layer it could have an internet search if applicable. Not sure how I would program that as I don't know how to do HTTP requests but it seems possible.

  • @DefaultFlame
    @DefaultFlame 4 หลายเดือนก่อน +1

    A note: the marble problem has actually been solved in Matthew Berman's testing by WizardLM-70B-V1.0-GPTQ, Mistral 7b OpenOrca, Mixtral, Mistral Next, Mistral Large, and LLaMA 3 70b Groq.
    That is however only 6 of the 32 models he's tested. I will note that in my own testing Reka Core passed the marble test while it failed when Matthew was testing it.

  • @nftawes2787
    @nftawes2787 4 หลายเดือนก่อน +2

    Add a gamifying token layer to learn which models' votes should be heeded/ignored in which situations

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน +1

      yeah cool idea, noted

  • @GetzAI
    @GetzAI 4 หลายเดือนก่อน +2

    Great experiment!!
    I wonder if adding in reasoning to their answer, if democracy would vote differently? Or if the answers could be discussed before voting. Having a bunch of AI/people voting on bad assumptions is never good. "before voting, discuss each answer for a) meeting all the requirements and b) being the most accurate".

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      interesting =) noted for future itteration

  • @OumarDicko-c5i
    @OumarDicko-c5i 4 หลายเดือนก่อน +1

    Waw this much be so expensive lo but great

  • @skylineuk1485
    @skylineuk1485 4 หลายเดือนก่อน

    I came to the same conclusion that arbiter currently works best.

  • @peterkonrad4364
    @peterkonrad4364 4 หลายเดือนก่อน +1

    actually, the most likely place for the marble to be is on the floor, because it is no easy feat to turn a cup with a marble in it upside down and put it on a table without the marble falling out before.

    • @AllAboutAI
      @AllAboutAI  4 หลายเดือนก่อน

      that would be the big brain human answer yes hehe