Best LLM for Parallel Function Calling: 14 LLM, 420 Prompt, 1 Winner Benchmark

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ธ.ค. 2024

ความคิดเห็น • 97

  • @sd5853
    @sd5853 หลายเดือนก่อน +18

    i dont usually comment but i want the youtube algorithm to know i want more stuff like that

  • @techfren
    @techfren หลายเดือนก่อน +46

    would love to see qwen 2.5 coder on these videos

    • @kora5
      @kora5 หลายเดือนก่อน +5

      Agree. Qwen2.5-Coder 32B should work well too.

    • @Techonsapevole
      @Techonsapevole หลายเดือนก่อน +5

      me too. I'm more interested in local LLMs

    • @andyjm2k
      @andyjm2k หลายเดือนก่อน

      Have been using qwen2.5 coder 7B for tool calls with my assistant and it works great

    • @lancerben4551
      @lancerben4551 หลายเดือนก่อน +3

      Agreed looks promising for a local llm.

    • @lokeshart3340
      @lokeshart3340 หลายเดือนก่อน +1

      Suiii u here also

  • @MuhammadFaisal_Iqbal
    @MuhammadFaisal_Iqbal หลายเดือนก่อน +13

    I am SOOO excited for the AI coding course!!! 🎉🤖

  • @solyarisoftware
    @solyarisoftware หลายเดือนก่อน +11

    Hi Dan,
    Thanks for the interesting benchmark.
    As you mentioned during the video, it could be interesting to see the same benchmarks comparing small-size LLMs on Ollama!
    Giorgio

    • @loudsquad2324
      @loudsquad2324 หลายเดือนก่อน

      yes please, would be excited to see open source and SLM's !

    • @senecalouck2335
      @senecalouck2335 หลายเดือนก่อน +1

      Came here to say the same. LM Studio now supports function calling in its latest beta as well.

    • @indydevdan
      @indydevdan  หลายเดือนก่อน +1

      YW! You got it. I'll cover locals in future videos.

  • @callumarul6322
    @callumarul6322 หลายเดือนก่อน +2

    You are a legend, can't wait for the course

  • @audioreworkvisions
    @audioreworkvisions หลายเดือนก่อน +1

    Thank You IndyDevDan! You build so great Things n Stuff with almost every Time a great value to me. Thanxxx and we gonna Rock...

  • @TheAIBlueprint
    @TheAIBlueprint 27 วันที่ผ่านมา

    whoa.... this is so insanely tense... really good research, and testing... would love to watch a tutorial on how you built this.
    Also, for the video, those hands moving in the background and no background music has me glued and psychologically stressing out over who is going to win... awesome work on the video.

  • @puneet1977
    @puneet1977 หลายเดือนก่อน

    What interesting way to benchmark. Thank you for doing this. It matches my personal experience with all of these LLM’s when it comes to tool calling. Although the failure rate usually goes up when you have a complicated /longer list of parameters.

  • @BillBaran
    @BillBaran หลายเดือนก่อน

    Thank you! THIS is the benchmark that really matters!

  • @MacS7n
    @MacS7n หลายเดือนก่อน

    I feel like I subscribed to the wrong channel. You really know what you are talking about with great understanding. I wish I was smart like you. I love your channel but I'm not a software engineer so some time I don't really understand but at the end of the video I'm definetly much more smarter. I'm part of the new wave of prompt engineer coding with prompts.

  • @brianmorin5547
    @brianmorin5547 หลายเดือนก่อน +1

    You had me in the first 60-sec. Precisely. At the moment all my long chains take a structured output from a model then to a function acting as traffic controller to determine what agent to call next

  • @IslandDave007
    @IslandDave007 หลายเดือนก่อน +1

    Love to see the new Mistral model included too - big updates today

  • @k22marie
    @k22marie หลายเดือนก่อน +1

    content so good I watch it at 1x speed

  • @vincentjean6756
    @vincentjean6756 หลายเดือนก่อน +5

    The failure of the new Sonnet is very surprising. I always use flash now, it's super fast, super cheap with an epic context size. Good job Dan!🎉

    • @indydevdan
      @indydevdan  หลายเดือนก่อน

      ikr I was shocked. Flash is so underrated.

  • @captaincode6241
    @captaincode6241 หลายเดือนก่อน

    Thanks for doing this benchmarking. You saved me tens of $ that I was spending on Sonnet. NOW, what I would love to see is different combinations of that 15 step process. ;)

  • @johannes-johannsen
    @johannes-johannsen หลายเดือนก่อน

    great stuff, would be interesting to see what sorts of things break the models, and how much more complex tool calls impact results

  • @techfren
    @techfren หลายเดือนก่อน +7

    I never used function calling, always json format then my own 'function calling' or whatever with that json

    • @lokeshart3340
      @lokeshart3340 หลายเดือนก่อน

      Pls give the src code for this

  • @jameshizon4861
    @jameshizon4861 28 วันที่ผ่านมา

    Great LLM analysis. I am looking forward to applying gemini soon for building AI applications.

  • @ScottLahteine
    @ScottLahteine 24 วันที่ผ่านมา

    Very useful information, thanks! As we’re just beginning to get into agents and tool-calling it will be very important to know which models are trustworthy so we can tell when to blame our code or the model. It would definitely be helpful to see a followup that tests all the most highly rated models we can run in Ollama. Two or three new models dropped just this week, including a new Qwen model called QwQ.

    • @indydevdan
      @indydevdan  21 วันที่ผ่านมา

      QwQ video is going live - this model is insane. More local LLM live benchmarks coming on the channel.

  • @dustineagar1999
    @dustineagar1999 หลายเดือนก่อน

    Awesome video, I'm diving into building around tool calls and loving your channel and the resources you've been putting out there.
    I noticed in benchy/server/modules/tools.py that the tool descriptions for the gemeni_tools_list are somewhat more detailed and expressive than those for openai_tools_list and anthropic_tools_list.
    New to the tools concept and might be way off here, but my understanding is that those descriptions act as kind of a semantic surface for queries to "connect" with and trigger the tool call, and I wonder if the differences in descriptions might have had a bearing on your test results.

  • @DemetriusZhomir
    @DemetriusZhomir หลายเดือนก่อน +1

    This benchmark doesn't feel perfect, but gives some surprising results. Didn't expect Gemini to perform this good!
    And oh boy, I'm so curious about results that open-sourced models would bring!

  • @HassanAllaham
    @HassanAllaham หลายเดือนก่อน +2

    Since one of the main targets is PERSONAL assistant, and since we are talking about agents and function calling, then the main targeted LLMs should be those which can be used on edge not on APIs, no one will like the idea of giving and sharing personal data ... Anyway, thanks for the good content 🌹

  • @stephenterry6372
    @stephenterry6372 หลายเดือนก่อน

    Great stuff. Using claude-sonnet-3-5 with Cline has been problematic recently, but I'm wondering if Anthropic vary the model when it's busy. It would be good to see results at different times of day.

  • @perschistence2651
    @perschistence2651 หลายเดือนก่อน +2

    Flash 1.5 in my experience is great for small contexts, but as soon as you get into the 10k+ tokens context lengths, its performance plummets.

    • @lancerben4551
      @lancerben4551 หลายเดือนก่อน +3

      My experience as well. It gets confused very easily. Even the big model is like that. Once google figures out how to keep it from hallucinating and improves accuracy and coherence it will be much better. For now I'm stuck paying the big price for Claude and o1 models for any complex task.

  • @vladrm1
    @vladrm1 หลายเดือนก่อน +1

    Great benchmark and great video, thank you for sharing!
    What temperature setting did the llms run in this benchmark? Are there any (other) paramenters you found relevant for function calling? In my experience, i found that temperature 0 makes a big difference.
    Also, do you have a way to benchmark the quality of the tool input paramenters? This is where I found that smaller modules struggle and become impractical for function calling in some cases - in scenarios where tool input params require some reasoning.

    • @indydevdan
      @indydevdan  หลายเดือนก่อน

      YW! Checking the quality of the 'prompt' input param is a great future direction. Intentionally left out for simplicity.

  • @caseystar_
    @caseystar_ หลายเดือนก่อน

    What’s a couple use cases for this? Would this be used in a multi-agent system? Or how do you use this most effectively?

  • @WenRolland
    @WenRolland หลายเดือนก่อน

    This is a great tool. Would love to see how Llama models do and also how smaller models like 1b, 3b and 8b would do on local systems which is a likely scenario for privacy purposes.

  • @saabirmohamed636
    @saabirmohamed636 หลายเดือนก่อน

    this is soo good.
    can check 1st ... check performance price and commit. I am struggling with tool calls i feel like im really having to talk my cheap models into it.

  • @TheAIBlueprint
    @TheAIBlueprint 27 วันที่ผ่านมา

    I also wonder if you would consider using DSPY for building the system of prompts, as each system reacts differently to the way the prompts are written, so it's hard to get accurate benchmarks with the same prompt that isn't optimized for each LLM... would that be a possibility for future videos?

  • @joshuafadiji8253
    @joshuafadiji8253 หลายเดือนก่อน

    Great work 👍🏽, love it.
    Could you also add xAI's grok? It is OpenAI API compatible, and currently in free beta testing

  • @bukitsorrento
    @bukitsorrento หลายเดือนก่อน

    Oh and for the AI course, would love to suggest purchasing power parity pricing, gumroad has a feature for this.

  • @cashvo
    @cashvo หลายเดือนก่อน

    Cool benchmarks but I was hoping to learn how to do tool calling in my own agnetic code. Do you have a video on how to do that?

  • @watchdog163
    @watchdog163 17 วันที่ผ่านมา

    Is there any visual studio code extension like Cline that supports gemini 1.5 flash?

  • @extremelylucky999
    @extremelylucky999 หลายเดือนก่อน +1

    The ironic thing is, the thumbnail has a typo. Talk about accuracy! 😂

  • @vermitsu
    @vermitsu หลายเดือนก่อน

    Perfect Accurac-t-y. Nice play on words

  • @arekkusub6877
    @arekkusub6877 หลายเดือนก่อน

    How those tools/functions were defined? Modell agnostic in some sort of ad-hoc JSON format?

    • @indydevdan
      @indydevdan  หลายเดือนก่อน

      One set of functions and roughly one json schema per model provider. See server/modules/tools.py. LID

  • @jimmc448
    @jimmc448 6 วันที่ผ่านมา

    I love your content, but you have to admit the typo in your thumbnail is pretty funny

  • @mrpocock
    @mrpocock หลายเดือนก่อน

    Really nice benchmark. Can you do the same with things that can be run on a single 12gb card locally through ollama?

  • @phanquochung3924
    @phanquochung3924 หลายเดือนก่อน

    awesome work

  • @toxy805
    @toxy805 หลายเดือนก่อน

    hey, can you make an introductory video, from the basics for the North Star goal of this channel? for benchmarking and everything you're doing

  • @NLPprompter
    @NLPprompter หลายเดือนก่อน

    dan can you share your audio recording setup? your audio quality soo cool i like it so much so i want to create all my presentation audio as yours alike,please?

  • @MuhammadFaisal_Iqbal
    @MuhammadFaisal_Iqbal หลายเดือนก่อน

    Sir, please make a video on the Software Engineer Roadmap, providing guidance on how to become proficient engineers in the age of AI.

  • @paulyflynn
    @paulyflynn 20 วันที่ผ่านมา

    Deep seek with deep think. Can it run agents?

  • @mikew2883
    @mikew2883 หลายเดือนก่อน

    Very very cool! 👏

  • @toxy805
    @toxy805 หลายเดือนก่อน

    excited for the AI coding course

  • @andyb4828
    @andyb4828 หลายเดือนก่อน

    Great! Thanks.

  • @MagagnaJayzxui
    @MagagnaJayzxui หลายเดือนก่อน +1

    Qwen models and Mistral please

  • @brennan123
    @brennan123 หลายเดือนก่อน

    @IndyDevDan, please add Groq and some of the Llama models to your tests.

    • @indydevdan
      @indydevdan  หลายเดือนก่อน

      almost added groq in this video, will add into next benchy vid

  • @AnansiTrading
    @AnansiTrading หลายเดือนก่อน

    Bravo🎉

  • @lokeshart3340
    @lokeshart3340 หลายเดือนก่อน

    Where is the src code of this pls?

  • @toxy805
    @toxy805 หลายเดือนก่อน

    Like a also a course on agentic engineering with tests, evals so we can learn how to run on our own.

  • @samizdat_eth
    @samizdat_eth หลายเดือนก่อน

    Perfect accuracty?

  • @jaysonp9426
    @jaysonp9426 หลายเดือนก่อน +1

    Gpt4o mini is my favorite model right now. It's mostly perfect for this kind of stuff and basically free. I'll never trust a Gemini model lol

    • @orthodox_gentleman
      @orthodox_gentleman หลายเดือนก่อน

      Why wouldn’t you trust a Gemini model?

    • @indydevdan
      @indydevdan  หลายเดือนก่อน +1

      4o-mini is insane - solid choice

    • @jaysonp9426
      @jaysonp9426 หลายเดือนก่อน

      @@orthodox_gentleman more of a joke. His tests showed flash is really good and Gemini 1.5 is on the top of lmm sys right now. Until this though, their models have been trash and they keep acting like they're amazing (looking at you Gemma)

  • @TryingThink
    @TryingThink หลายเดือนก่อน

    I wonder about Gemini Flash 1.5-8B

  • @MekMoney79
    @MekMoney79 หลายเดือนก่อน

    Outstading!

  • @MaJetiGizzle
    @MaJetiGizzle หลายเดือนก่อน

    Exporting the results would be useful for reporting purposes as well.

  • @tomaszzielinski4521
    @tomaszzielinski4521 หลายเดือนก่อน

    The order is not enough. From my experience only GPT-4o was able to call tools with multiple arguments (such as filter options for search query).
    At the same time, order doesn't always matter. For instance, if I want my Notion agent to fetch a number of pages or blocks, it doesn't matter in which order it gets them as long as all of them end up in context window.
    Also, I see no need of calling that may tools in a row, you'd probably be better to run one or two, validate the output, then proceed with prompt chain. Calling 15 tools in a row is no good if you get some hallucinations or icorrect call half way.

  • @drowningpenguin1588
    @drowningpenguin1588 หลายเดือนก่อน

    Given how poorly haiku 3 and 4o were both doing, it seems valuable to include haiku-3-json for comparison. It’s still not as inexpensive as flash or 4o mini so maybe not cost effective still but haiku 3 was used for a lot of aider tools so I’m surprised it performs so poorly

  • @NooSpheere
    @NooSpheere หลายเดือนก่อน

    Thx à lot.

  • @bukitsorrento
    @bukitsorrento หลายเดือนก่อน

    Multi-modal benchmark, audio, image, video(images) as input.

  • @lancerben4551
    @lancerben4551 หลายเดือนก่อน

    My experience with the google models is that they hallucinate a lot and very mistake prone in their answers, very forgetful and not very good with instructions. My go to model is Claude 3-5 and o1-mini for planning or more complex coding. But it's nice to see the flash model being good at running the tools I will integrate it in my app for certain simpler repetitive tasks. But the rate of hallucination for me is a big problem as with more complex tool calls good reasoning will be essential. Another worry of mine is their service agreement. It is extremely strict.

    • @indydevdan
      @indydevdan  หลายเดือนก่อน +1

      "...I will integrate it in my app for certain simpler repetitive tasks." - I think you're spot on here with how to best use flash / 4o-mini like models. Simple repetitive tasks where you can create simple prompts.

  • @davidcampos9768
    @davidcampos9768 หลายเดือนก่อน

    Please test local LLMs. Obviously these could be cheaper (free) and faster hosted on a local beefy machine.

  • @techfren
    @techfren หลายเดือนก่อน +2

    Firstt lesgoo new idd vid

    • @adriangpuiu
      @adriangpuiu หลายเดือนก่อน

      that model is expensive bro :D

  • @nonefvnfvnjnjnjevjenjvonej3384
    @nonefvnfvnjnjnjevjenjvonej3384 13 วันที่ผ่านมา

    wow wow wow

  • @Aristocle
    @Aristocle หลายเดือนก่อน

    Maybe sonnet use XML.

  • @asi_karel
    @asi_karel หลายเดือนก่อน

    best

  • @stevensexton5801
    @stevensexton5801 หลายเดือนก่อน

    Hmmm, looks like you need a benchmarking agent.

  • @SC-ck8pb
    @SC-ck8pb หลายเดือนก่อน

    Hey dude don't disable transcripts on your video. I really can't spend 23 minutes right to figure out which model performs better, so I'm trying to summarize the transcript of your video but you have transcripts disabled.

  • @bolte5987
    @bolte5987 23 วันที่ผ่านมา

    Hate to be that guy, but "perfect accuracty"?
    Hint: Look at your thumbnail

  • @tom-et-jerry
    @tom-et-jerry หลายเดือนก่อน

    blablabla and we never see any result of your prompts... weird