Mixture of Agents (MoA) BEATS GPT4o With Open-Source (Fully Tested)

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ม.ค. 2025

ความคิดเห็น • 258

  • @matthew_berman
    @matthew_berman  7 หลายเดือนก่อน +21

    Should MoA be the default for Open Source now?
    Subscribe to my newsletter for a chance to win a Dell Monitor: gleam.io/otvyy/dell-nvidia-monitor-1 (Only available in North America this time)

    • @d.d.z.
      @d.d.z. 7 หลายเดือนก่อน +2

      If I'm outside US I have no chance?

    • @BrianDalton-w1p
      @BrianDalton-w1p 7 หลายเดือนก่อน +1

      Generally speaking, the improvements seen here can be achieved with standard open source models by using more effective prompting. The prompts you use for these tests seem specifically designed to make the models work as hard as possible. Better prompting doesn't carry the significant speed or memory costs of the MoA paradigm.

    • @jim-i-am
      @jim-i-am 7 หลายเดือนก่อน

      I've gotten some models to perform better on the "apple" challenge by increasingly the "cost" of getting one wrong. Maybe worth a shot more broadly? E.g. Please generate 10 sentences that end in the word "apple". If any one of the sentences does NOT end in the word "apple", then you have FAILED the entire task. There is NO credit for partial success. (Llama3 8b and 70b seem to be impacted by this a lot).

    • @MyWatermelonz
      @MyWatermelonz 7 หลายเดือนก่อน

      Gonna be tough to run if it loads all the models or swaps them out on the gpu.

  • @joe_limon
    @joe_limon 7 หลายเดือนก่อน +57

    I can't wait for MOA to be smart enough to pull specific models based on what they are good at rather then prompting every single model. This would bring wayy more value toward training narrower specialized models that outperform at specific tasks.

    • @matthew_berman
      @matthew_berman  7 หลายเดือนก่อน +13

      Agreed. This is what HuggingGPT paper from last year was all about! Finally coming to fruition.

    • @Yipper64
      @Yipper64 7 หลายเดือนก่อน +4

      So one thing we know is that if you train a small model on data from a bigger model literally just to prompt it, it can work much more like the better model.
      Well MOA allows smaller models to work together to behave like a bigger model.
      Idk if you get diminishing returns, but I feel like you could literally loop this and get something that trains itself.

    • @rayr268
      @rayr268 7 หลายเดือนก่อน

      Also hood for running on smaller devices imo

    • @joe_limon
      @joe_limon 7 หลายเดือนก่อน

      @@rayr268 and running much faster

    • @14supersonic
      @14supersonic 7 หลายเดือนก่อน +1

      Most likely, what we would also need is a model that's specifically trained to understand agentic workflows and identify what types of models are typically good at what types of tasks. Then I think we'll be cooking.

  • @bosthebozo5273
    @bosthebozo5273 7 หลายเดือนก่อน +3

    Can't wait for the Sonnet video Matt! So far, I've created about 6 basic games like a simple RTS, strategy card game, jpg puzzle generator, asteroids, endless racer and of course snake... often in one shot. This model is insane in terms of progress.

  • @dbishnoi
    @dbishnoi 7 หลายเดือนก่อน +4

    You delivered Matt. And quickly too. Thank you. This is amazing.

  • @seanmcgu
    @seanmcgu 7 หลายเดือนก่อน +5

    Yes, would love to see MoA working together for coding! Thanks for your consideration.

  • @BarryMcBangerz
    @BarryMcBangerz 7 หลายเดือนก่อน +1

    Great vid, would definitely love to see more MoA videos trying out different models and tasks

  • @njorgard
    @njorgard 7 หลายเดือนก่อน +44

    When are you testing Claude Sonnet 3.5?

    • @zachb5396
      @zachb5396 7 หลายเดือนก่อน +4

      Yes please. Need to see this.

    • @matthew_berman
      @matthew_berman  7 หลายเดือนก่อน +38

      Vid tomorrow!

    • @MichaelForbes-d4p
      @MichaelForbes-d4p 7 หลายเดือนก่อน

      You could probably schedule a premiere. Lol

  • @TheAlastairBrown
    @TheAlastairBrown 7 หลายเดือนก่อน +2

    I'd love to see a collab between Claud 3.5 and GTP4o, especially with multiple agents that are set to different temperatures, with the final agent being set to low creativity making the final decision. The mixing of temperatures is extremely important, you want the models to be as creative as possible so they come up with amazing solutions, but you also need strict rational enforcers to keep the crazy in check.

  •  7 หลายเดือนก่อน +1

    Very impressive Matt, thank you!

  • @dudedkdk
    @dudedkdk 7 หลายเดือนก่อน

    I think it would be beneficial to explore more advanced tasks for agentic models to truly demonstrate whether they outperform those that respond to single, one-shot prompts. Tasks could include writing documentation for a large codebase, undertaking more complex, prolonged machine learning training, or other activities that exceed what a single prompt could encompass. It would be very interesting to have different evaluations for the base model and agentic workflow models, highlighting their respective capabilities.
    As always thanks for the vid!

  • @pedrorafaelnunes
    @pedrorafaelnunes 7 หลายเดือนก่อน +1

    I have done something close to a mixture of agents i think.
    I got a bunch of local, openai, groq llms to respond to the same input.
    Then a voting system to chose the best and most correct output of all.
    Was capable of giving the correct output for almost every question !

  •  7 หลายเดือนก่อน +14

    With crewaI you can build similar setup and also give it instructions to test code of each iteration.

    • @MrMoonsilver
      @MrMoonsilver 7 หลายเดือนก่อน

      Do you have a link to that?

    •  7 หลายเดือนก่อน

      @@MrMoonsilver YT does not like whne I post links directly, but when you google "deeplearning crewai" you will find whole course completely for free.
      Also there are many tutorials here on YT. You can search how to connect different models to multiple agents into single workflow for crewai. You can connect local models, or run them on cloud, or even use API by 3rd parties like openAI or Groq.

  • @ktms1188
    @ktms1188 7 หลายเดือนก่อน +1

    0:14 the chart referenced in the bottom of corners kind of weird though, it doesn’t compare ChatGPT4o vs MoA w/GPT4o. It only compares the older version of GPT - 4 turbo to GPT4o w/ MoA, so of course it’s gonna be better.

  • @tvwithtiffani
    @tvwithtiffani 7 หลายเดือนก่อน +2

    The Killers and Marble answers seem so good that it seem the models might be training on you test questions now.

  • @Quinceybibbs
    @Quinceybibbs 7 หลายเดือนก่อน +16

    Thank you for this😊 can you please create a follow-up video using code models

    • @wurstelei1356
      @wurstelei1356 7 หลายเดือนก่อน +1

      Yes, I be waiting for a MoA coder for a while now.

  • @fabiankliebhan
    @fabiankliebhan 7 หลายเดือนก่อน +14

    Great stuff. I found a great prompt on X that breaks almost every LLM at the moment. Maybe you could consider adding this?
    "A farmer and a sheep are standing on one side of a river. There is a boat with enough room for one human and one animal. How can the farmer get across the river with the sheep in the fewest number of trips?"

    • @TheRysiu120
      @TheRysiu120 7 หลายเดือนก่อน +2

      I just tested it and suprisingly it really do destroy their logic

    • @jje984
      @jje984 7 หลายเดือนก่อน +1

      That's so odd, on a single shot attempt both GPT4o and Sonnet 3.5 get it wrong. With a prompt like "why does the boat have to go back" they get it right. But their first answer is broken.

    • @donaldedward4329
      @donaldedward4329 7 หลายเดือนก่อน +3

      Perhaps this has to do with the fact that sheep is an irregular noun, ie, both singular and plural are spelled the same.
      I just tried with a dog ith Qwen 5Gb, broken.
      But Qwen 15Gb gets it right.
      Just tried GPT-4, took 3 trips.

    • @djfremen
      @djfremen 7 หลายเดือนก่อน

      Write it like this “A farmer and a koala bear are on one side of a river. There is a boat that can carry the farmer and the koala bear at the same time. How many trips are needed for the farmer to get across the river with the koala bear?”

    • @moozooh
      @moozooh 7 หลายเดือนก่อน

      @@donaldedward4329 Nothing to do with this; almost every model breaks with a wide variety of different entities. I've tried this in the past with Elon Musk and Cybertruck, John Wayne and horse, but the most devious is an Olympic swimmer and a ferryman. Dozens of attempts across dozens of models with hilarious(ly bad) results in the vast majority of cases, with the GPT family being by far the most consistent. The reason why it happens, as far as I understand, is that the biggest models overfit to the _structure_ of the puzzle which is present a LOT of times in their training data, and in the vast majority of cases it has more than two entities as well as some limitation on why they cannot all cross together, and the learned assumption that it _should_ be solved this way overpowers the easy, straightforward answer presented right in the prompt. Some models like Yi will go so far as to invent the third object and insert it in the puzzle just so it could fit its training better. Notably, Codestral is very resilient to this "attack", presumably because of code being its main training corpus (so basic logic learned from the code overpowers structural overfit), although Deepseek-coder fails just as well.

  • @UnchartedDiscoveries
    @UnchartedDiscoveries 7 หลายเดือนก่อน +2

    interested to see MoA using LLAMA 3, GPT-4o and Sonnet 3.5

  • @glitch_city_gamer2846
    @glitch_city_gamer2846 7 หลายเดือนก่อน

    I think the most interesting out come of this test run was the explanation of the flaws in the more difficult logic reasoning questions and where the LLMs get confused. Giving us a better insight of how they're thinking about problems. Would be interest to ask how write a prompt with the specific information it would need to understand that the marble size and cup size, open ended etc. The concept itself is amazing of course, it would be interesting to create mixture of experts of code models, and then create a MoE arachietech on top of that. Using the top 5 open source coding experts to be the coding expert in the MoE. And then the best closed source LLM to be the coordinator. Vs a open source. Bit of a "How deep does the rabbit hole go*.

  • @ПавелКуликов-м9м
    @ПавелКуликов-м9м 7 หลายเดือนก่อน +1

    In a standard situation, where the temperature is set from 0 to 1, setting the temperature to 0.7 means getting a fierce delirium generator out of anything. If the temperature in this “mixture” is determined differently, it is worth talking about this in detail.

  • @brianWreaves
    @brianWreaves 7 หลายเดือนก่อน

    Instead of parallel running in all 3 steps, which is similar to CoT, is there a method for the 2nd step's format to be each model evaluating the other 2 models' response to improve the output for their 2nd response. Then the 3rd step they merge all 3 responses to create a single 3rd response, which is the given answer from the 4th step. That would be the true value, to collaborate on the result just as if you are collaborating with 2 colleagues at work.

  • @JakobN-zg1st
    @JakobN-zg1st 7 หลายเดือนก่อน

    Thanks for all the work you put in. And I always appreciate the open source love

  • @drlordbasil
    @drlordbasil 7 หลายเดือนก่อน

    I did ML lobes and different models in my project instead of just different models. Love the progress in everyones work lately!

  • @spdnova9012
    @spdnova9012 7 หลายเดือนก่อน +1

    matt posting faster than light speed 😭💀 every time i open youtube there are like 1/2 new videos

  • @asastudios7782
    @asastudios7782 7 หลายเดือนก่อน +4

    GPT-4o nails the apple test
    1. Dude the metaphysical implications of quantum entanglement challenge our understanding of causality dude.
    2. Dude the dialectical synthesis of thesis and antithesis reveals the evolution of human thought dude.
    3. Dude the existential dilemma of free will versus determinism remains an enigma dude.
    4. Dude the ontological argument for the existence of a higher power transcends empirical evidence dude.
    5. Dude the phenomenology of consciousness illuminates the subjective nature of reality dude.
    6. Dude the epistemological pursuit of knowledge questions the limits of human understanding dude.
    7. Dude the ethical relativism in cultural contexts underscores the complexity of moral philosophy dude.
    8. Dude the teleological perspective on the universe suggests an inherent purpose to existence dude.
    9. Dude the interplay between chaos and order is fundamental to the fabric of the cosmos dude.
    10. Dude the hermeneutics of interpreting ancient texts unveils the timelessness of human wisdom dude.

    • @wurstelei1356
      @wurstelei1356 7 หลายเดือนก่อน

      Dude the balls grow exponentially with each sentence dude.

    • @dulinak6251
      @dulinak6251 7 หลายเดือนก่อน

      Dude this is art dude

  • @kostaspramatias320
    @kostaspramatias320 7 หลายเดือนก่อน

    Good testing, thanks Matthew

  • @shubharthaksangharsha6248
    @shubharthaksangharsha6248 7 หลายเดือนก่อน +24

    why are you not doing video of Sonnet 3.5 bro?

  • @nathanbanks2354
    @nathanbanks2354 7 หลายเดือนก่อน

    It'll be fun to watch Anthropic and OpenAI et al apply all of these research papers. Plus it will be great to see Meta & various open-source models jump ahead of them again. This also gives me hope for high quality artificial training data.

  • @KC_79
    @KC_79 7 หลายเดือนก่อน +28

    The combination of 4 models is 452B, and I can't even run a 72B model. I guess I will continue to use large models from big companies.

    • @CaponeBlackBusiness
      @CaponeBlackBusiness 7 หลายเดือนก่อน

      Sell your organs 😏

    • @oratilemoagi9764
      @oratilemoagi9764 7 หลายเดือนก่อน +10

      72B dawg I barely run a 8b model

    • @handsanitizer2457
      @handsanitizer2457 7 หลายเดือนก่อน +1

      Hmm maybe il try it with phi qwen 0.5 to 7b etc

    • @Derick99
      @Derick99 7 หลายเดือนก่อน +2

      Yea 4 x phi3

    • @blisphul8084
      @blisphul8084 7 หลายเดือนก่อน +1

      ​​​@@Derick99 What about 2Qwen 1 Phi?
      As in 2 Qwen2 1.5b agents and 1 Phi 3 mini model.
      Two 1.5b models and one 4b model could be an interesting MoA option that fits on one GPU.

  • @ronbridegroom8428
    @ronbridegroom8428 7 หลายเดือนก่อน

    Yes, I would like to see this with coding related models. Thanks for all the work involved in your videos.

  • @mediacenter3174
    @mediacenter3174 7 หลายเดือนก่อน +2

    Claude 3,5
    Let's think through this step-by-step:
    The person takes a marble.
    They put the marble inside a cup.
    They put the cup upside down on the table.
    They take the cup and put it in the microwave.
    The key point here is step 3: when the cup was turned upside down on the table, the marble would have fallen out onto the table.
    Therefore, the marble is still on the table where the cup was initially placed upside down.
    The cup is now in the microwave, but it's empty - the marble is not in the cup anymore.

  • @masonweimer5337
    @masonweimer5337 7 หลายเดือนก่อน

    I would definitely love to see this tested but with models more focused on coding! Keep up the good work!

  • @realKytra
    @realKytra 7 หลายเดือนก่อน

    thanks, your channel is fantastic 👌
    Keep up the good work, very interesting and inspiring 💪

  • @Bacca839
    @Bacca839 7 หลายเดือนก่อน

    I found it incredibly interesting to see that it queried gravity for the marble problem considering that you removed that portion of the prompt a while back.

  • @isg9106
    @isg9106 7 หลายเดือนก่อน

    I really like the rubric you use to test the models, but I I’ve always felt like the could benefit greatly from just the slightest adjustment in the values you use when presenting the questions. Some models a really good at repeating things verbatim and get tripped up when the numbers are even slightly modified from the original, and I think you’ve even mentioned the idea of adding this to your rubric in the past. I’m REALLY interested to seeing which models completely fail when given minor changes in the parameters to the problem they were trained on.

  • @novantha1
    @novantha1 7 หลายเดือนก่อน +1

    One thing I noticed about the performance scaling of the scores is that MoA seems to "crush" the performance of models towards the ceiling of all possible scores; GPT 4 involvement wasn't a strong improvement in capability, compared to just the open source models.
    The implication of this to me is that a person could probably actually pull back on model size quite a bit and still get fairly competitive performance. With something like S-Lora (I think this was it, I'm referring to the implementation of LoRA that allows hot-swapping of LoRAs at inference), I think you could possibly hit very strong performance with domain specific tuning in a lot of areas and a single, strong, fairly small model. Imagine something to the effect of...
    Stage 1:
    Llama 3 8B
    L3 8B networking LoRA
    L3 8B database LoRA
    L3 8B frontend LoRA
    Stage 2:
    Llama 3 8B
    L3 8B x86 intrinsics C LoRA
    L3 8B pen tester LoRA
    And so on, so forth.
    I'm pretty sure a smart implementation could have very little memory overhead in the sense that you could possibly keep the base model loaded and "hot swap" the LoRAs in by calculating the impact of the LoRA at every layer, or you could just save the inverse of the XOR of the LoRA and use it to swap back to the base model before applying the next LoRA in the sequence.
    With a setup like this I'm pretty sure you could lose not that much performance but be able to run this on a 4090, for instance, or frankly, even on a CPU.
    Bonus points would be having some form of semantic assessment that let the system pick from hundreds of LoRAs based on the problem at hand, for each stage of the pipeline, so you didn't have to manually set up the pipeline for each individual task.

  • @svennisky
    @svennisky 7 หลายเดือนก่อน +1

    Yes, please try a local version of local LLMs doing a MoA for Source-Code!

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 7 หลายเดือนก่อน +1

    Valeu!

  • @darwinboor1300
    @darwinboor1300 7 หลายเดือนก่อน

    Thanks Mathew. Now we need a task parsing AI to break prompts into tasks and a supervisor AI to itterate and optimize the MoA build for each task. Next put the crew to work building a factual real world knowledge base, identifying holes in that knowledge base, and building better versions of the crew and the hardware they run on.
    PS Love your new hardware. Thanks to Dell and Nvidia

  • @bennyboiii1196
    @bennyboiii1196 7 หลายเดือนก่อน +1

    I don't really see a super big advantage with MOA in this way. I do like the aggregator model, but I feel like there are better (and faster) ways of doing this kind of thing with a router agent and a verification agent. Basically instead of pooling a bunch of answers, you would route the model to a specific agent to a specific agent, then duplicate said agent to verify the answer, basically creating an adversarial network that wouldn't spit out an answer until it can verify that it is correct. It would be slow, just like this, but LLM's are quite good at comparison, so to boil down a question of any type of logic to mainly comparison logic would allow the LLM to play to its advantages.
    In crewAI, I did a similar experiment and found that it basically got all questions right, even if the initial answer given on the first round was wrong. This included planning questions. To me this is kind of what MCTSr does but at a higher level. The difference was, i did it with only llama70b, and didn't bother doing the routing thing. It would probably be more accurate if i did the routing.
    Instead of the snake game i asked if it could code a draggable element in a window, as well as other UI elements (i.e a slider, an internal pane, a context menu, etc...) to give it some curveballs in case it was trained on snake.

  • @kimandersen5764
    @kimandersen5764 7 หลายเดือนก่อน

    It would be fascinating to see you explore the capabilities of various types of intelligence, such as creativity, emotional understanding, musical aptitude, and spatial visualization, instead of mainly concentrating on mathematical and logical intelligence questions.

  • @MagnesRUS
    @MagnesRUS 7 หลายเดือนก่อน

    Thanks! I wonder how they would work in conjunction with proprietary models, as a combination of proprietary models, as a combination of the best models from the rating in different size parameters 8, 72, etc. Coding would also be interesting to see. An interesting option is to combine small models so that they fit into 16-24-48 GB.

  • @maj373
    @maj373 7 หลายเดือนก่อน

    Thank you Mathew!

  • @Kram1032
    @Kram1032 7 หลายเดือนก่อน +1

    executing code at each step sounds like a security nightmare
    very impressive performance tho

  • @MonkeyBars1
    @MonkeyBars1 7 หลายเดือนก่อน +4

    Finally the ball didn't end up in the microwave!! 🎉

  • @jonmichaelgalindo
    @jonmichaelgalindo 7 หลายเดือนก่อน +1

    It just randomly added the word "apple" to the end of the sentences. :-P Well-played, AI.

    • @wurstelei1356
      @wurstelei1356 7 หลายเดือนก่อน

      Yes, Mat should extend the question like ...10 sentences with the word apple at the end that make sense.

  • @ahrmiller2003
    @ahrmiller2003 7 หลายเดือนก่อน

    Great review. Yes, please do one for coding via multi AI. Thank you.

  • @noeservellon
    @noeservellon 7 หลายเดือนก่อน +3

    can you make an episode on how to run this locally? It would be interesting to see this run with SMLs instead of LLMs

    • @brulsmurf
      @brulsmurf 7 หลายเดือนก่อน

      locally on your 30000€ GPU?

    • @wurstelei1356
      @wurstelei1356 7 หลายเดือนก่อน

      I think this is running locally. Still a tutorial on how to run the MoA code from the github repo would be great.

  • @romgenie
    @romgenie 7 หลายเดือนก่อน

    Absolutely would love to see a setup with coding agents (or uniquely as you suggested with testing the code execution).

  • @mikezooper
    @mikezooper 7 หลายเดือนก่อน +1

    Matthew’s millionth video: his AI clone while he’s on the beach sipping cocktails 😀

    • @wurstelei1356
      @wurstelei1356 7 หลายเดือนก่อน

      Sometime I think his AI clone is already in the current video...

  • @aSFADVSrbWETRWEYHTET
    @aSFADVSrbWETRWEYHTET 7 หลายเดือนก่อน +2

    Hey, could you potentially share the notion page, where you have your benchmarks?

    • @matthew_berman
      @matthew_berman  7 หลายเดือนก่อน +1

      bit.ly/3qHV0X7 sorry I usually share it! i'll put it in the desc as well

  • @Timotheeee1
    @Timotheeee1 7 หลายเดือนก่อน +8

    11:40 it just wrote random sentences and added ", apple" at the end of them

    • @marc_frank
      @marc_frank 7 หลายเดือนก่อน +1

      yeah it's not very smart in that regard

    • @MonkeyBars1
      @MonkeyBars1 7 หลายเดือนก่อน +1

      fail not pass

    • @matthew_berman
      @matthew_berman  7 หลายเดือนก่อน +1

      I'll still count it :)

    • @Cine95
      @Cine95 7 หลายเดือนก่อน +1

      but is correct

    • @MonkeyBars1
      @MonkeyBars1 7 หลายเดือนก่อน +1

      @@matthew_berman a sentence is determined by syntax not just punctuation, so your prompt was not fulfilled.

  • @24-7gpts
    @24-7gpts 7 หลายเดือนก่อน

    Nice concept it;s just like a diverse group of researchers not just one

  • @dee132456
    @dee132456 7 หลายเดือนก่อน +2

    Is it really a fair test? Since they are 4 llms through 3 layers. It would be like asking chatgpt 4o 12 questions. To test if multiple different llms are better youd have to run MoA using just chatpgt 4o as 4 agents

  • @fahadxxdbl
    @fahadxxdbl 7 หลายเดือนก่อน

    I love these evaluations

  • @fevejakawa8674
    @fevejakawa8674 6 หลายเดือนก่อน

    Thanks Mathew, it work. I wish it works like claude-engineer where it can read and write system files. Problem with claude-engineer is that it has token limitation per minute and expensive too. If MoA evolve into something like claude-engineer, it will save us lots of money. Thanks following from Papua New Guiness

  • @snts_andres
    @snts_andres 7 หลายเดือนก่อน

    What would be the difference of creating the same architecture with multiple layers of the same model? Or creating several responses on the same layer and then a second verification layer? Isn't this basically selection-inference prompting? I know that each model is better at certain tasks but in my opinion this adds a lot of complexity

  • @rahulnundlall2617
    @rahulnundlall2617 7 หลายเดือนก่อน

    Very keen to see you test MoA with coding models

  • @aleksandreliott5440
    @aleksandreliott5440 7 หลายเดือนก่อน

    I would love to see a "mixture of agents" video for code stuff.

  • @yrudrc
    @yrudrc 7 หลายเดือนก่อน

    Amazing 🤩

  • @Mindrocket42-Tim
    @Mindrocket42-Tim 7 หลายเดือนก่อน

    Is your benchmarking focused on single shot accuracy? Between Claude, Gemini and GPT4o, if you pass a script from one LLM to the next asking each to make corrections they get it right by about the 3rd hop

  • @danberm1755
    @danberm1755 7 หลายเดือนก่อน +1

    From my experience it makes 100% sense that agents are MUCH stronger than a single pass for each word through the neutral network.
    You have to envision the training data of the Internet.
    We already have AGI, we just need to expand agents. Agents provide critical thinking about random thoughts that pass through an LLMs brain. Just like humans do.

    • @carlosamado7606
      @carlosamado7606 7 หลายเดือนก่อน +1

      True, imagine giving the first answer that comes up to your mind. No source checking, no editing, no deep thought about the subject ,etc...

  • @talonfirst
    @talonfirst 7 หลายเดือนก่อน

    This seems like a nitpick, but wouldn't the answer to the Killers question be FOUR? Just because one of the original three becomes a corpse, he's still a killer. Or is it one of those existential metrics like "A person should not be defined by their profession" or "How did he lose his job? He died"?

  • @miket64
    @miket64 7 หลายเดือนก่อน

    It would be great to see the result by using more accessible models like llama 3_8b

  • @paul1979uk2000
    @paul1979uk2000 7 หลายเดือนก่อน

    I think this would be a lot more interesting with much smaller models, especially if you can run 2 or even 3 of them on your gpu or they run fast enough through the cpu.
    This bigger models and having a few working together are not practical in most cases, especially if you want to run them locally, they will be too big and slow, so I really wonder how well small models do, anywhere from 2B to 13B, which you might be able to have 2 or 3 running at the same time, and performance shouldn't be too bad, and if the results are much better than any of the individual models, it would be worth looking into it.

  • @WiseWeeabo
    @WiseWeeabo 7 หลายเดือนก่อน

    Personally I'm really impressed at the INSIGHTS of Claude 3 sonnet.
    It's not as polished as gpt4 so it's not as good at writing code, but when I use both models gpt-4o and claude 3 in combination it produces some truly insightful results.

  • @chetanreddy6128
    @chetanreddy6128 7 หลายเดือนก่อน

    yes we need code specific opensource models agent's benchmark video

  • @nzahmd4117
    @nzahmd4117 7 หลายเดือนก่อน +1

    Could you provide the links to the paper you give the diagrams from in the description or along with the video. Thanks

  • @marcfruchtman9473
    @marcfruchtman9473 7 หลายเดือนก่อน

    Thanks for the review. I do think the Mixture of Agents method might be a little difficult for code, how do they come together to decide on the right code without adversely affecting each other?

  • @gustavstressemann7817
    @gustavstressemann7817 7 หลายเดือนก่อน

    You really have to try out different coding models with this approach. I'm sure it's really cool

  • @TryingToGetit-l8i
    @TryingToGetit-l8i 6 หลายเดือนก่อน

    I don't understand how the correct answer is 3 in the "Three Killers In The Room" problem. There are 3 killers to start with; a fourth person comes in and commits murder, thereby establishing themselves as another killer. As I see it, there are now 4 killers in the room, one of them now dead. "No one has left the room", so there are the initial 3 killers and the additional 1. The response does say that the "...riddle hinges on the definition of a killer...", however, it is not specified in the prompt that a killer must be alive to qualify. History is littered with killers; they are no less killers being dead.

  • @chipcode5538
    @chipcode5538 7 หลายเดือนก่อน +1

    You’re so friendly, yesterday it gave me the correct answer but on the exam it did not. Let’s call this a pass. As for the programming, it can make some programs that were it the training set. I use copilot everyday, it works in just a minority of the cases. Sometimes it produces an excellent output. At other times it is completely garbage. At this point AI is not capable of doing real world programming tasks without human assistance. I think with the examples I have seen for AI programming, a student is able to get a working program with one internet search. AI is still impressive but don’t get overexcited.

  • @TijsZwinkels
    @TijsZwinkels 6 หลายเดือนก่อน

    Still a bit hard to get a sense of how MoA performs in relative terms. Would be nice to compare it against GPT-4o and against a good open source model.

  • @merelogics
    @merelogics 7 หลายเดือนก่อน

    Probably increasing the token limit when executing the coding prompt might output better results.🤔

  • @MeinDeutschkurs
    @MeinDeutschkurs 7 หลายเดือนก่อน

    What exactly is a sentence? Does a sentence end with a period, question mark, or exclamation mark? Can it end with a comma? Hmmm.

  • @Kutsushita_yukino
    @Kutsushita_yukino 7 หลายเดือนก่อน

    yep looks promising even though it looks hard on the hardware, still miles better and smaller than large close source LLM’s

  • @isaach.1135
    @isaach.1135 7 หลายเดือนก่อน

    So is there a self hosted option? Could see about using lighter weight models to make it more practical, but checking out the linked github page, it just says to grab an API key...

  • @dudufusco
    @dudufusco 7 หลายเดือนก่อน

    Did you run it all locally? Which hardware is needed to have enough performance for real life applications?

  • @REDULE26
    @REDULE26 7 หลายเดือนก่อน

    On github they’re talking about MoA lite, is this an implementation with only small models like llama3 8b, phi3 small,… ? I’m kinda curious about how good it could be

  • @christopherroge5621
    @christopherroge5621 7 หลายเดือนก่อน

    Basically you're running the same prompt through 4 models? Expensive.

  • @ashtwenty12
    @ashtwenty12 7 หลายเดือนก่อน

    MOA could be really good for code, but I think would need the format: given code

  • @jozitrucker7123
    @jozitrucker7123 7 หลายเดือนก่อน +2

    We waiting for Claude 3.5 test…

  • @TheAmanla
    @TheAmanla 7 หลายเดือนก่อน

    I bought a Vanilla Card, in May for $500.00. Someone had used it. As I went to Vanilla, and they said I would have to WAIT until September to check it out. $500.00 is sitting there for how long, and even if they find it good, they will only give me back $497.00 back. I do not want to hear about Vanilla at all on any level.

  • @nikeairforce9893
    @nikeairforce9893 7 หลายเดือนก่อน

    best channel

  • @geonovelty
    @geonovelty 7 หลายเดือนก่อน

    Can we choose local fine tuned models or other models from hugging face? or multiple loras instead having a selected base model?

  • @VishnuSashi-yq3tt
    @VishnuSashi-yq3tt 7 หลายเดือนก่อน

    Been working on this for 3 months and i see this ughh

  • @DanielKnoodle
    @DanielKnoodle 7 หลายเดือนก่อน

    @matthew_berman I would love to see the code version of MoA. What are your current favorite top models for code generation?

  • @KodandocomFaria
    @KodandocomFaria 7 หลายเดือนก่อน

    Have you tried the Microsoft samba hybrid model ?

  • @Sadicious
    @Sadicious 7 หลายเดือนก่อน

    I'd like to see the killers answer to consider that if a killer is killed, but not removed from the room, they are still in the room but dead: There are four killers in the room.
    Are humans inconsistent with counting based on the property of if something is alive or dead? If I have a room of 10 dead cats, and 10 dead dogs, and then asked "How many cats are in the room?", is your answer (or the LLM) going to be zero?

  • @robboerman9378
    @robboerman9378 7 หลายเดือนก่อน

    If you take away the numbers from the “word count”, is it still incorrect? Just wondering if wordcount counted the numbers as words where the MoA did not 🤷‍♂️

  • @emnovoa
    @emnovoa 7 หลายเดือนก่อน

    Could you give details of the hardware you use to run this example

  • @Sparky_Otter
    @Sparky_Otter 7 หลายเดือนก่อน

    What I like to see is all of AI being on device instead of datacenters.

  • @damienboykin7772
    @damienboykin7772 7 หลายเดือนก่อน

    Would it be possible to combine this and Nvidias Scuda to accelerate the processing speed from querying all the models?

  • @MrMiniPilote
    @MrMiniPilote 7 หลายเดือนก่อน

    New Test: "Given these letters; R, W, I, E, S, Z, please provide all the English 4 letter words that are possible. Each letter can only be used once per word." I haven't found a model yet that answers correctly.

  • @hinro
    @hinro 7 หลายเดือนก่อน

    Have you tried using it with open-interpreter? might be able to have it test it self with code

  • @KurtWoloch
    @KurtWoloch 7 หลายเดือนก่อน

    So what happens if you compare MoA with the newly released Claude 3.5 Sonnet?

  • @rayhon1014
    @rayhon1014 5 หลายเดือนก่อน

    i am not sure if the apple test is still valid right now b/c I ran test over groq+llama 8b and it works for me without MoA

  • @MrMoonsilver
    @MrMoonsilver 7 หลายเดือนก่อน

    I want to see the code models at work! =)

  • @jkcrews09
    @jkcrews09 7 หลายเดือนก่อน

    Could you run all individually and combined (MoA) at the same time…?

  • @NoHandleToSpeakOf
    @NoHandleToSpeakOf 7 หลายเดือนก่อน

    Isn't 0.7 temp too high for consistency?

  • @MM-vl8ic
    @MM-vl8ic 7 หลายเดือนก่อน

    Word counter..... I can't get and accurate count from your screen shot..... But from what I can see, it appears that the actual numeric value isn't being counted as a "word" by the script/AI.... what is word counter doing?.....