I like your channel; it's straight to the point. There is likely a lot of competition among the AI with tons of hype. We will see how many of them will survive the next 5 years.
Thanks for sharing this with us, your content is gold! I tried Qwen 2.5 coder yesterday on my Intel Core I7, 16GB RAM DDR4, RTX 3050 (4GB VRAM) and it struggled with Bolt. So I guess that I should only use Open-Source Local AI models for generating text, for now...
Looking at your output it almost seems as if you or the model provider you're using is using the wrong chat templates + inference parameters that aren't configured for coding tasks. What about the temperature - it should be set to 0 for coding, and you should use a top_p of no higher than about 0.85. Did you set the context size to something reasonable? I've found the 32b model to be really impressive, certainly the best open weight model out there by far. In in my experience Cline specially it's not very good with any models other than Claude which it was originally written for.
I did try it with Fireworks and it was the same results. It might be that Cline is not okay with the model. But, even if you consider the aider results.. It's too buggy and not good at all if you're working on bigger application with mutliple context of files..
@@AICodeKingthanks for the extra info. I might try a couple of your common prompts running the model directly without aider or cline in the mix to see if it's a templating issue. It could be something like then using the default chatml template and not the proper updated Qwen 2.5 toolcalling template - or something along those lines.
I also did a test before seeing your video and my conclusion was "trash", at least for my use case. After seeing your video, I see that I am not the only one! It's not worth the hype.
same 😂😂 man I'd be mad too if AI is asking me to do something I asked it to do for me in the first place. It's like who is the master and who is the slave here goddamnit?
I am guessing that the benchmarking use a carefully engineered prompting to beat other models. I have always questions validity of each model benchmark claim. There should be a formal body with standard test sets to run the benchmark.
Thank you very much. Well explained and informative as always, and in this case it has definitely “seperated the wheat from the chaff” … qwen 2.5 coder seems very disappointing.
I find Cline just doesn't work with OLLAMA local model very well. Their developer appears to blame these OLLAMA models are heavily quantized which I do agree, but I run Q8 and FP16 models but still getting same shitty result
Hi there, In your first prompt qwen was trying to generate the build files, and node_modules, maybe if you had the project setup wouldn’t try to generate that much code? Can you try?
@@tomwawer5714 I want to run a good coding model like qwen 2.5 coder.. maybe I can run 14b variant or 7b don't wanna run a heavily quantized version... Also Google colab gives 12+ GB vram so maybe can somehow run on colab there are few videos showing how to do that...
hey is it possible you can add Qwen2.5 32B to OpenHands? I tried a million different ways with the help of claude and copilot and chatgpt but couldnt get it running
@@AICodeKing That really sucks. Benchmark chasing should be an immediate disqualification. I wonder if there are ways to structure benchmarks so that they produce a randomised but equivalent task. Or alternatively, flood the market with so many benchmarks that it is not practical to over-fit to them all.
@@AICodeKing It would be better if model publishers were expected to submit their models to 3rd party benchmarking rather than doing it in-house. We used to have this problem with protein 3d reconstructions. People would publish papers on cooked benchmarks. That's why the CASP protein structure prediction competition was set up.
Dude. Can you review Blackbox AI? It has Gemini Pro, GPT 4o, Claude Sonnet 3.5 and it's own Blackbox model. It's mostly a chat app AI like anything else but there is also VS Code and JetBrains extension.
@@AICodeKing bro there is qwen 2.5 72b and i looked over ai and google i didnt get the base url or how to use it exactly Qwen/...instruct wow instruct and boom work, u good developer
Strange, though I think this is a milestone for a local model to be able to even create something using Aider. From my testing. Properly used Aider, cline did not work in my testing. I have a 3090 and it did run at workable speeds.
I did everything right but I get this error: # VSCode Visible Files (No visible files) # VSCode Open Tabs (No open tabs) # Current Working Directory (d:/Mert - Workspace/test-ai-project) Files No files found.
Great video! Real tests in real apps. I would like to see a full workflow test, from figma design to tested product. Done with NextJS, TS, TailwindCSS and assisted coding with AI all the way from setup to testing, reviewing and deployment.
It's such a small model and the hype to try compare it with sonnet is where all these start to fail. It should do what a small model should do in some specialized cases. Not to run a general coding agent. It is also specialized for code generation while powering an aider is much more demanding on versital intelligence
benchmarks with smaller models usually are completely bs. They probably distill the bigger models into it, making it memorize benchmark like questions without actually making them smarter.
Depends on your choice.. I see no use of that model for me as of now.. I just use SmolLM2 which is better and can actually be used locally at great speeds on my machine. There's no one size fits all or anything like that.
To make matters worse, outside of coding Qwen2.5 is far worse than Qwen2. Most notably, it hallucinates far more across all domains of knowledge. I really do think you're right that Qwen is optimizing their LLMs for tests at the expense of overall performance. Qwen2 72b used to be almost as good as Llama 3.1 70b, but now Qwen2.5 72b is far worse despite climbing higher on benchmarks.
I have had some issues with cline using models that are not claude/gpt since i think cline requires a model with proper agentic features. It could be a reason why the performance was so poor with it. I think testing qwen using a chatting interface could change the results.
I tried QWEN 2.5 for math because I am taking part in the AIMO Kaggle competition. I cant say that with certainty but I feel they train their models on the benchmarks. In one weird case it did a function calling but also provided me the result (without actually performing the function calling).
If this video is a true reflection of its capabilities, benchmarks aren't just bad, they are broken.
this 100%
I like your channel; it's straight to the point.
There is likely a lot of competition among the AI with tons of hype.
We will see how many of them will survive the next 5 years.
Thank you for saving our time! :)
i know right. really the AI king
yup saved me a big fat download today
yup, saved me a chunk of my time this week.
Wrong, that's a great model according to other youtubers and the open source community.
@@Quitcool are they just saying that for clicks though? if I see a video with this model not sucking ass then I'll try it
the maximum achievable with Qwen2.5-Coder32b (131k context window) was a around 100k tokens. Then it slowed down to a timeout. But impressive...
true , just tested , and with 24gb gpu offload too on a machine with 192gb of ram. 131k context want too much memory
Thanks for sharing this with us, your content is gold!
I tried Qwen 2.5 coder yesterday on my Intel Core I7, 16GB RAM DDR4, RTX 3050 (4GB VRAM) and it struggled with Bolt.
So I guess that I should only use Open-Source Local AI models for generating text, for now...
YOU NEED AT LEAST MAC WITH 32GB RAM M3-M4 I THINK BUT BETTER 2-3 3090 MINIMUM FOR +- GOOD WORK BUT ALSO OPENROUTER CHEAP
@@aleksanderspiridonov7251 WOULD THE NEW MACBOOK PRO WITH 40 GPU CORES AND 48GB RAM WORK WELL ENOUGH OR SHOULD I OPT FOR MORE RAM?
@@aleksanderspiridonov7251 y u screamin son
BTW, you had me laughing so hard at the whole "why the hell am I using it then!" comment. Truly priceless.
Others: It answers the benchmark questions well so no need to run it.
AICodeKing: Hold my beer.👑
AICodeKing is actually a deity
Those dancing pokemons clearly stole the spotlight of the vidoe
Looking at your output it almost seems as if you or the model provider you're using is using the wrong chat templates + inference parameters that aren't configured for coding tasks.
What about the temperature - it should be set to 0 for coding, and you should use a top_p of no higher than about 0.85.
Did you set the context size to something reasonable?
I've found the 32b model to be really impressive, certainly the best open weight model out there by far.
In in my experience Cline specially it's not very good with any models other than Claude which it was originally written for.
I did try it with Fireworks and it was the same results. It might be that Cline is not okay with the model. But, even if you consider the aider results.. It's too buggy and not good at all if you're working on bigger application with mutliple context of files..
@@AICodeKingthanks for the extra info. I might try a couple of your common prompts running the model directly without aider or cline in the mix to see if it's a templating issue. It could be something like then using the default chatml template and not the proper updated Qwen 2.5 toolcalling template - or something along those lines.
@@sammcj2000 would love to see what analysis you come up with - thanks for double checking. super helpful
He didnt even downloaded it as it seems. This video is about some crap online service
I love your style. Go on like this. AI coding a great use case for LLM. I'm learning a lot with your videos
I also did a test before seeing your video and my conclusion was "trash", at least for my use case. After seeing your video, I see that I am not the only one! It's not worth the hype.
hahahah, "man if i have to implement it myself, why the hell am I using this". This made me laugh (9:50)
same 😂😂 man I'd be mad too if AI is asking me to do something I asked it to do for me in the first place. It's like who is the master and who is the slave here goddamnit?
Model in hyperbolic use 128k context window?
Strange, Cole Medin got great results and did Simon Willison. Both were extremely impressed.
I am guessing that the benchmarking use a carefully engineered prompting to beat other models. I have always questions validity of each model benchmark claim. There should be a formal body with standard test sets to run the benchmark.
Is Hyperbolic using the Instruct model or the Base one?
Instruct and unquantized as well.
Thank you very much. Well explained and informative as always, and in this case it has definitely “seperated the wheat from the chaff” … qwen 2.5 coder seems very disappointing.
I find Cline just doesn't work with OLLAMA local model very well. Their developer appears to blame these OLLAMA models are heavily quantized which I do agree, but I run Q8 and FP16 models but still getting same shitty result
Interesting....thanks for the video.
thanks for this hyperbolic webite.. it helped me
Thank you for being honest! I wanted to love Qwen 2.5 Coder as well, but it just can't actually do anything useful beyond VERY simple applications.
Hi there, In your first prompt qwen was trying to generate the build files, and node_modules, maybe if you had the project setup wouldn’t try to generate that much code? Can you try?
Ok after seeing the whole video I understand that it wouldn’t matter.
I had created the NextJS App before hand.
Ahahah your hate for cursor is hilarious 😂
I run 32b on 6GB vram it’s slow about token/s but works.
But, even though it's a token/sec the quality of the output is not the same as compared to adequate hardware, isn't it?
@@amit4rougood question. I didn’t test it too much as it’s too slow.
@@tomwawer5714 I have 8GB vram, I am not sure if it can run 14b or 7b...
I run 23b without any issues quite fast with q4 quantisation on 6Gb. You can even run flux medium for image gen.
@@tomwawer5714 I want to run a good coding model like qwen 2.5 coder.. maybe I can run 14b variant or 7b don't wanna run a heavily quantized version...
Also Google colab gives 12+ GB vram so maybe can somehow run on colab there are few videos showing how to do that...
what is the smaller local LLM model which you think is better than Qwen 2.5 coder 32b, thanks, you didn't mention which video I should take a look.
1) Qwen 2.5 coder 32b
2) Deepseek v2.5 205b
3) Nope
Truth testing = reality
Great job, as usual.
Congratulations 🎉
hey is it possible you can add Qwen2.5 32B to OpenHands? I tried a million different ways with the help of claude and copilot and chatgpt but couldnt get it running
openRouter?
lmao good testing king! will you change pokemon one day?
could you make a guide to use cline with the local qwen ?
I have it running on a single 3090, how do i check how much context window it has?
I think it should be mentioned on Hugging Face.
here for the low frequency roasts
The point is done, we need better benchmarks 😢😅
So why does it score well in benchmarks if it can't function in these ide or agentic contexts?
You can basically just train models on specific benchmark questions and make them score well in benchmarks but in real life this approach fails.
@@AICodeKing That really sucks. Benchmark chasing should be an immediate disqualification. I wonder if there are ways to structure benchmarks so that they produce a randomised but equivalent task. Or alternatively, flood the market with so many benchmarks that it is not practical to over-fit to them all.
There are actually many benchmarks but you just need to select 5 or 10 and just compare the results with that..
@@AICodeKing It would be better if model publishers were expected to submit their models to 3rd party benchmarking rather than doing it in-house. We used to have this problem with protein 3d reconstructions. People would publish papers on cooked benchmarks. That's why the CASP protein structure prediction competition was set up.
Wow!! This is it!!
Dude. Can you review Blackbox AI? It has Gemini Pro, GPT 4o, Claude Sonnet 3.5 and it's own Blackbox model. It's mostly a chat app AI like anything else but there is also VS Code and JetBrains extension.
Thank you man!
please answer how u get to know the base url ? hyperbolic ?
You can see it by going to Hyperbolic API Script thing
@@AICodeKing alright thanks bro , you good developer
@@AICodeKing bro there is qwen 2.5 72b and i looked over ai and google i didnt get the base url or how to use it exactly Qwen/...instruct wow instruct and boom work, u good developer
is this available on Open Bolt?
me @3:50 hell yeah, dancing Pokémon
what a bummer. I had high hopes for this model
Strange, though I think this is a milestone for a local model to be able to even create something using Aider. From my testing. Properly used Aider, cline did not work in my testing. I have a 3090 and it did run at workable speeds.
Yes, but claiming unbelievable things is never good
vscode combo with aider +qwencoder ; cline +claude ; continue + opencoder ?
I did everything right but I get this error:
# VSCode Visible Files
(No visible files)
# VSCode Open Tabs
(No open tabs)
# Current Working Directory (d:/Mert - Workspace/test-ai-project) Files
No files found.
Thank you!
I like your objectivity, these small model hypes + marketing are pretty annoying.
You made me hate cursor 😅 and to be honest you're right about cline being better 😅
so true bro, i hate it when ppl do that! also aider and cline is way better at everything!
thanks for this 👍
Great video! Real tests in real apps. I would like to see a full workflow test, from figma design to tested product. Done with NextJS, TS, TailwindCSS and assisted coding with AI all the way from setup to testing, reviewing and deployment.
It's such a small model and the hype to try compare it with sonnet is where all these start to fail. It should do what a small model should do in some specialized cases. Not to run a general coding agent. It is also specialized for code generation while powering an aider is much more demanding on versital intelligence
bro please us etutorial on electron or tauri or any open source one
Benchmarks always come out 'pretty,' but in real life, I've found that it's far behind even claude-3-5-haiku and gpt-4o-mini.
benchmarks with smaller models usually are completely bs. They probably distill the bigger models into it, making it memorize benchmark like questions without actually making them smarter.
So I should not use Qwen2.5 Coder 7B anymore?
Depends on your choice.. I see no use of that model for me as of now.. I just use SmolLM2 which is better and can actually be used locally at great speeds on my machine. There's no one size fits all or anything like that.
To make matters worse, outside of coding Qwen2.5 is far worse than Qwen2. Most notably, it hallucinates far more across all domains of knowledge. I really do think you're right that Qwen is optimizing their LLMs for tests at the expense of overall performance. Qwen2 72b used to be almost as good as Llama 3.1 70b, but now Qwen2.5 72b is far worse despite climbing higher on benchmarks.
Alibaba has qwen max model (not opensoure) which is far better then the open source version. But.. strangely they dont show it off. I suspect ...
Hyberpolc free or no?
Free $10 credits
can you make the dragons twerk?
16 million tokens uploaded just to generate 3 files??!!! 6:30
I think that it's a bug in cline and that's why it displays that.
@AICodeKing hmm
Dude make video about g4f (gpt4free) API + Cline
Every LLM model except Sonnet disappoints
I have had some issues with cline using models that are not claude/gpt since i think cline requires a model with proper agentic features. It could be a reason why the performance was so poor with it. I think testing qwen using a chatting interface could change the results.
I tried QWEN 2.5 for math because I am taking part in the AIMO Kaggle competition. I cant say that with certainty but I feel they train their models on the benchmarks. In one weird case it did a function calling but also provided me the result (without actually performing the function calling).
That’s common for LLMs, try using a wider variety of them and you’ll have an intuition for how LLMs behave.
thank you fully free??
It doesn't follow instruction
The same with python, garbage produced without end. Maybe it's a problem with ollama?
Time saving !!!
I'm pretty sure that Qwen is Chinese, right? That may explain the questionable benchmarking.
test it with cursor
Very powerful model!!!
EE
very bad at instruction following. it has something in common with my wife there.
🙄
Thanks!
Thanks a lot for the support!