Hi Dan, Thanks for the interesting benchmark. As you mentioned during the video, it could be interesting to see the same benchmarks comparing small-size LLMs on Ollama! Giorgio
whoa.... this is so insanely tense... really good research, and testing... would love to watch a tutorial on how you built this. Also, for the video, those hands moving in the background and no background music has me glued and psychologically stressing out over who is going to win... awesome work on the video.
What interesting way to benchmark. Thank you for doing this. It matches my personal experience with all of these LLM’s when it comes to tool calling. Although the failure rate usually goes up when you have a complicated /longer list of parameters.
I feel like I subscribed to the wrong channel. You really know what you are talking about with great understanding. I wish I was smart like you. I love your channel but I'm not a software engineer so some time I don't really understand but at the end of the video I'm definetly much more smarter. I'm part of the new wave of prompt engineer coding with prompts.
You had me in the first 60-sec. Precisely. At the moment all my long chains take a structured output from a model then to a function acting as traffic controller to determine what agent to call next
Thanks for doing this benchmarking. You saved me tens of $ that I was spending on Sonnet. NOW, what I would love to see is different combinations of that 15 step process. ;)
Very useful information, thanks! As we’re just beginning to get into agents and tool-calling it will be very important to know which models are trustworthy so we can tell when to blame our code or the model. It would definitely be helpful to see a followup that tests all the most highly rated models we can run in Ollama. Two or three new models dropped just this week, including a new Qwen model called QwQ.
Awesome video, I'm diving into building around tool calls and loving your channel and the resources you've been putting out there. I noticed in benchy/server/modules/tools.py that the tool descriptions for the gemeni_tools_list are somewhat more detailed and expressive than those for openai_tools_list and anthropic_tools_list. New to the tools concept and might be way off here, but my understanding is that those descriptions act as kind of a semantic surface for queries to "connect" with and trigger the tool call, and I wonder if the differences in descriptions might have had a bearing on your test results.
This benchmark doesn't feel perfect, but gives some surprising results. Didn't expect Gemini to perform this good! And oh boy, I'm so curious about results that open-sourced models would bring!
Since one of the main targets is PERSONAL assistant, and since we are talking about agents and function calling, then the main targeted LLMs should be those which can be used on edge not on APIs, no one will like the idea of giving and sharing personal data ... Anyway, thanks for the good content 🌹
Great stuff. Using claude-sonnet-3-5 with Cline has been problematic recently, but I'm wondering if Anthropic vary the model when it's busy. It would be good to see results at different times of day.
My experience as well. It gets confused very easily. Even the big model is like that. Once google figures out how to keep it from hallucinating and improves accuracy and coherence it will be much better. For now I'm stuck paying the big price for Claude and o1 models for any complex task.
Great benchmark and great video, thank you for sharing! What temperature setting did the llms run in this benchmark? Are there any (other) paramenters you found relevant for function calling? In my experience, i found that temperature 0 makes a big difference. Also, do you have a way to benchmark the quality of the tool input paramenters? This is where I found that smaller modules struggle and become impractical for function calling in some cases - in scenarios where tool input params require some reasoning.
This is a great tool. Would love to see how Llama models do and also how smaller models like 1b, 3b and 8b would do on local systems which is a likely scenario for privacy purposes.
this is soo good. can check 1st ... check performance price and commit. I am struggling with tool calls i feel like im really having to talk my cheap models into it.
I also wonder if you would consider using DSPY for building the system of prompts, as each system reacts differently to the way the prompts are written, so it's hard to get accurate benchmarks with the same prompt that isn't optimized for each LLM... would that be a possibility for future videos?
dan can you share your audio recording setup? your audio quality soo cool i like it so much so i want to create all my presentation audio as yours alike,please?
@@orthodox_gentleman more of a joke. His tests showed flash is really good and Gemini 1.5 is on the top of lmm sys right now. Until this though, their models have been trash and they keep acting like they're amazing (looking at you Gemma)
The order is not enough. From my experience only GPT-4o was able to call tools with multiple arguments (such as filter options for search query). At the same time, order doesn't always matter. For instance, if I want my Notion agent to fetch a number of pages or blocks, it doesn't matter in which order it gets them as long as all of them end up in context window. Also, I see no need of calling that may tools in a row, you'd probably be better to run one or two, validate the output, then proceed with prompt chain. Calling 15 tools in a row is no good if you get some hallucinations or icorrect call half way.
Given how poorly haiku 3 and 4o were both doing, it seems valuable to include haiku-3-json for comparison. It’s still not as inexpensive as flash or 4o mini so maybe not cost effective still but haiku 3 was used for a lot of aider tools so I’m surprised it performs so poorly
My experience with the google models is that they hallucinate a lot and very mistake prone in their answers, very forgetful and not very good with instructions. My go to model is Claude 3-5 and o1-mini for planning or more complex coding. But it's nice to see the flash model being good at running the tools I will integrate it in my app for certain simpler repetitive tasks. But the rate of hallucination for me is a big problem as with more complex tool calls good reasoning will be essential. Another worry of mine is their service agreement. It is extremely strict.
"...I will integrate it in my app for certain simpler repetitive tasks." - I think you're spot on here with how to best use flash / 4o-mini like models. Simple repetitive tasks where you can create simple prompts.
Hey dude don't disable transcripts on your video. I really can't spend 23 minutes right to figure out which model performs better, so I'm trying to summarize the transcript of your video but you have transcripts disabled.
i dont usually comment but i want the youtube algorithm to know i want more stuff like that
Ditto
would love to see qwen 2.5 coder on these videos
Agree. Qwen2.5-Coder 32B should work well too.
me too. I'm more interested in local LLMs
Have been using qwen2.5 coder 7B for tool calls with my assistant and it works great
Agreed looks promising for a local llm.
Suiii u here also
I am SOOO excited for the AI coding course!!! 🎉🤖
Hi Dan,
Thanks for the interesting benchmark.
As you mentioned during the video, it could be interesting to see the same benchmarks comparing small-size LLMs on Ollama!
Giorgio
yes please, would be excited to see open source and SLM's !
Came here to say the same. LM Studio now supports function calling in its latest beta as well.
YW! You got it. I'll cover locals in future videos.
You are a legend, can't wait for the course
Thank You IndyDevDan! You build so great Things n Stuff with almost every Time a great value to me. Thanxxx and we gonna Rock...
whoa.... this is so insanely tense... really good research, and testing... would love to watch a tutorial on how you built this.
Also, for the video, those hands moving in the background and no background music has me glued and psychologically stressing out over who is going to win... awesome work on the video.
What interesting way to benchmark. Thank you for doing this. It matches my personal experience with all of these LLM’s when it comes to tool calling. Although the failure rate usually goes up when you have a complicated /longer list of parameters.
Thank you! THIS is the benchmark that really matters!
I feel like I subscribed to the wrong channel. You really know what you are talking about with great understanding. I wish I was smart like you. I love your channel but I'm not a software engineer so some time I don't really understand but at the end of the video I'm definetly much more smarter. I'm part of the new wave of prompt engineer coding with prompts.
You had me in the first 60-sec. Precisely. At the moment all my long chains take a structured output from a model then to a function acting as traffic controller to determine what agent to call next
Love to see the new Mistral model included too - big updates today
content so good I watch it at 1x speed
The failure of the new Sonnet is very surprising. I always use flash now, it's super fast, super cheap with an epic context size. Good job Dan!🎉
ikr I was shocked. Flash is so underrated.
Thanks for doing this benchmarking. You saved me tens of $ that I was spending on Sonnet. NOW, what I would love to see is different combinations of that 15 step process. ;)
great stuff, would be interesting to see what sorts of things break the models, and how much more complex tool calls impact results
I never used function calling, always json format then my own 'function calling' or whatever with that json
Pls give the src code for this
Great LLM analysis. I am looking forward to applying gemini soon for building AI applications.
Very useful information, thanks! As we’re just beginning to get into agents and tool-calling it will be very important to know which models are trustworthy so we can tell when to blame our code or the model. It would definitely be helpful to see a followup that tests all the most highly rated models we can run in Ollama. Two or three new models dropped just this week, including a new Qwen model called QwQ.
QwQ video is going live - this model is insane. More local LLM live benchmarks coming on the channel.
Awesome video, I'm diving into building around tool calls and loving your channel and the resources you've been putting out there.
I noticed in benchy/server/modules/tools.py that the tool descriptions for the gemeni_tools_list are somewhat more detailed and expressive than those for openai_tools_list and anthropic_tools_list.
New to the tools concept and might be way off here, but my understanding is that those descriptions act as kind of a semantic surface for queries to "connect" with and trigger the tool call, and I wonder if the differences in descriptions might have had a bearing on your test results.
This benchmark doesn't feel perfect, but gives some surprising results. Didn't expect Gemini to perform this good!
And oh boy, I'm so curious about results that open-sourced models would bring!
Since one of the main targets is PERSONAL assistant, and since we are talking about agents and function calling, then the main targeted LLMs should be those which can be used on edge not on APIs, no one will like the idea of giving and sharing personal data ... Anyway, thanks for the good content 🌹
Great stuff. Using claude-sonnet-3-5 with Cline has been problematic recently, but I'm wondering if Anthropic vary the model when it's busy. It would be good to see results at different times of day.
Flash 1.5 in my experience is great for small contexts, but as soon as you get into the 10k+ tokens context lengths, its performance plummets.
My experience as well. It gets confused very easily. Even the big model is like that. Once google figures out how to keep it from hallucinating and improves accuracy and coherence it will be much better. For now I'm stuck paying the big price for Claude and o1 models for any complex task.
Great benchmark and great video, thank you for sharing!
What temperature setting did the llms run in this benchmark? Are there any (other) paramenters you found relevant for function calling? In my experience, i found that temperature 0 makes a big difference.
Also, do you have a way to benchmark the quality of the tool input paramenters? This is where I found that smaller modules struggle and become impractical for function calling in some cases - in scenarios where tool input params require some reasoning.
YW! Checking the quality of the 'prompt' input param is a great future direction. Intentionally left out for simplicity.
What’s a couple use cases for this? Would this be used in a multi-agent system? Or how do you use this most effectively?
This is a great tool. Would love to see how Llama models do and also how smaller models like 1b, 3b and 8b would do on local systems which is a likely scenario for privacy purposes.
this is soo good.
can check 1st ... check performance price and commit. I am struggling with tool calls i feel like im really having to talk my cheap models into it.
I also wonder if you would consider using DSPY for building the system of prompts, as each system reacts differently to the way the prompts are written, so it's hard to get accurate benchmarks with the same prompt that isn't optimized for each LLM... would that be a possibility for future videos?
Great work 👍🏽, love it.
Could you also add xAI's grok? It is OpenAI API compatible, and currently in free beta testing
Oh and for the AI course, would love to suggest purchasing power parity pricing, gumroad has a feature for this.
Cool benchmarks but I was hoping to learn how to do tool calling in my own agnetic code. Do you have a video on how to do that?
Is there any visual studio code extension like Cline that supports gemini 1.5 flash?
The ironic thing is, the thumbnail has a typo. Talk about accuracy! 😂
;)
Perfect Accurac-t-y. Nice play on words
XD
How those tools/functions were defined? Modell agnostic in some sort of ad-hoc JSON format?
One set of functions and roughly one json schema per model provider. See server/modules/tools.py. LID
I love your content, but you have to admit the typo in your thumbnail is pretty funny
Really nice benchmark. Can you do the same with things that can be run on a single 12gb card locally through ollama?
awesome work
hey, can you make an introductory video, from the basics for the North Star goal of this channel? for benchmarking and everything you're doing
dan can you share your audio recording setup? your audio quality soo cool i like it so much so i want to create all my presentation audio as yours alike,please?
Sir, please make a video on the Software Engineer Roadmap, providing guidance on how to become proficient engineers in the age of AI.
Deep seek with deep think. Can it run agents?
Very very cool! 👏
excited for the AI coding course
Great! Thanks.
Qwen models and Mistral please
@IndyDevDan, please add Groq and some of the Llama models to your tests.
almost added groq in this video, will add into next benchy vid
Bravo🎉
Where is the src code of this pls?
Like a also a course on agentic engineering with tests, evals so we can learn how to run on our own.
Perfect accuracty?
Gpt4o mini is my favorite model right now. It's mostly perfect for this kind of stuff and basically free. I'll never trust a Gemini model lol
Why wouldn’t you trust a Gemini model?
4o-mini is insane - solid choice
@@orthodox_gentleman more of a joke. His tests showed flash is really good and Gemini 1.5 is on the top of lmm sys right now. Until this though, their models have been trash and they keep acting like they're amazing (looking at you Gemma)
I wonder about Gemini Flash 1.5-8B
Outstading!
Exporting the results would be useful for reporting purposes as well.
100%
The order is not enough. From my experience only GPT-4o was able to call tools with multiple arguments (such as filter options for search query).
At the same time, order doesn't always matter. For instance, if I want my Notion agent to fetch a number of pages or blocks, it doesn't matter in which order it gets them as long as all of them end up in context window.
Also, I see no need of calling that may tools in a row, you'd probably be better to run one or two, validate the output, then proceed with prompt chain. Calling 15 tools in a row is no good if you get some hallucinations or icorrect call half way.
Given how poorly haiku 3 and 4o were both doing, it seems valuable to include haiku-3-json for comparison. It’s still not as inexpensive as flash or 4o mini so maybe not cost effective still but haiku 3 was used for a lot of aider tools so I’m surprised it performs so poorly
Thx à lot.
Multi-modal benchmark, audio, image, video(images) as input.
My experience with the google models is that they hallucinate a lot and very mistake prone in their answers, very forgetful and not very good with instructions. My go to model is Claude 3-5 and o1-mini for planning or more complex coding. But it's nice to see the flash model being good at running the tools I will integrate it in my app for certain simpler repetitive tasks. But the rate of hallucination for me is a big problem as with more complex tool calls good reasoning will be essential. Another worry of mine is their service agreement. It is extremely strict.
"...I will integrate it in my app for certain simpler repetitive tasks." - I think you're spot on here with how to best use flash / 4o-mini like models. Simple repetitive tasks where you can create simple prompts.
Please test local LLMs. Obviously these could be cheaper (free) and faster hosted on a local beefy machine.
Firstt lesgoo new idd vid
that model is expensive bro :D
wow wow wow
Maybe sonnet use XML.
best
Hmmm, looks like you need a benchmarking agent.
Hey dude don't disable transcripts on your video. I really can't spend 23 minutes right to figure out which model performs better, so I'm trying to summarize the transcript of your video but you have transcripts disabled.
Hate to be that guy, but "perfect accuracty"?
Hint: Look at your thumbnail
blablabla and we never see any result of your prompts... weird