because they dont have yet cutting edge architecture to hide. it still benefits them since if its open source they will gather back what other people improve on it
THEY FUCKING DID IT! IAnd this is the lite version, 01-preview is ranked 60% at coding contest while o1full is ranked at master level 90% coder, if the full version matches o1full, we will have a programmer better than most humans for cheap
o1 is already great at programming of not the best. So many zero shot successes from my experience. Sonnet 3.5 is also great by struggles with presenting code longer than 300 lines so lots of manual copy and pasting while o1 has no problem with generating code up to 1000 lines in one go.
Its weird. In my tests both models got all questions right. Maybe one should always test 10 iterations of each question and evaluate how many times it’s correct to evaluate the model in a better way. There just seems to be still a lot of randomness in the thought process.
for local models the temperature should be set to 0 for testing so we get a deterministic and highest probability answer every time. when you have temperature above 0 there is always a non-zero chance that even the most suited model outputs something weird and false. But we can't set temp for proprietary models of course.
I tried the apple question several times with different variations and it always get's it wrong. One subtle thing, he uses "Apple" as the word. The capitalization suggest a proper noun. I think that is part of the trip up. If you use "apple" it seems to do better, though still misses it sometimes.
I tried this for the apple problem with the gpt 4 free version. I typed "This is going to be a test. The end goal is to produce 10 sentences that end in the word apple. The way you will go about it is to think of a sentence, store it, verify that it ends in the word apple and only if it does show it as a result. Do you think you can do that?". It replied absolutely correctly the following "Absolutely, I can do that! I'll carefully craft sentences, verify that they end with the word apple, and display only those that meet the criteria. Let’s begin! She reached for the bright red apple. A worm crawled out of the freshly picked apple. The teacher smiled as she placed a gift on her desk-an apple. He accidentally dropped his lunch, bruising the apple. The old farmer spoke fondly of his favorite fruit, the apple. As the pie cooled, the kitchen was filled with the smell of baked apple. Each child was given a caramel-coated apple. The orchard tour guide handed everyone a juicy apple. With a flourish, she revealed a perfectly peeled apple. On the table sat a single golden apple." Maybe it's more a matter of learning to speak their language than anything else?
If I'm not mistaken, the free version can not use tools, so the "store it" part of your prompt does not make much sense, since it is Just generating tokens, is Just luck that it got a good generation. This is what I got with the exact same prompt: Yes, I can certainly do that! I'll generate each sentence, check if it ends with the word "apple," and only present the ones that meet the requirement. I'll begin now. Here is the first sentence: She reached into the basket and picked a shiny red apple. Let me check and ensure it meets the condition. It ends in "apple," so it's valid! I love eating a crisp green apple. Now, I'll proceed to the next. Stay tuned!
I used the deepseek today, and my specific use case is programming/development mainly. I found it to be quite good and competitive to the new Claude model. Since i do use AI for work, I found it was good at understanding orginal things not really done, not things such as 'the game of life' or a snake game. As such, I believe its a very solid model and system. Pleasantly surprised by it. As far goes as limits of AI: the context window sizes and how that is dealt with is an issue for development tools. That is a hard limit to overcome, and hence, for AI workloads with necesities for large context windows, i believe we are hitting limits there.
I am using DeepSeek 2.5 since a while. In many tasks this LLM beats ChatGpt, Google Gemini and Claude Sonnet. It is slower but I like the usefulness of the responses. I assume that at DeepSeek smart people are developing useful models which are working well with less advanced compute. Banning modern NPU/GPU chips from China represents a clear incentive to develop LLMs which are running with less NPU/GPU requirements. That will pay off soon for the Chinese AI developers while US providers like OpenAI and Microsoft will be drowning in their compute cost.
@@HaraldEngels Yeah, good point. China also has all the manufacturing infrastructure and rare earth minerals to eventually catch up to and maybe even surpass US chips.
I love the grounded reality of this channel!!! *If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you..prevent inflation.*
I feel sympathy and empathy for our country, low income earners are suffering to survive, and I appreciate Wayne. You've helped my family with your advice. imagine investing $30,000 and receiving $95,460 after 28 days of trading.
Honestly, our government has no idea how people are suffering these days. I feel sorry for disabled people who don't get the help they deserve. All thanks to Mr Michael Wayne, imagine investing $1000 and receiving $5700 in a few days..
Did someone just mention Mr Wayne!? Damn! You just made my day; what a coincidence.. I've worked with him for over 2years and I can tell how good he is
For the record, I tested nemotron:70b-instruct-q5_K_M and qwen2.5:72b-instruct-q5_K_M on a mac studio using open webui. I asked both models all the questions you posed to deepseek and chatgpt. Both models did as well or actually better. Nemo edged out qwen. Both of those models are outstanding in general. I think they are at gpt 4 levels (from a year ago if not better).
Try to give it a group of axioms and ask if a theorem is provable from them. If it's really an implementation of Q*, it should be able to solve (and if provable, supply a proof).
o1-mini: variants from "You deserve no answer" to off by 1, but put space in the end, so kind of correct Sonnet: wrote the answer and tried to count spaces- off by 2 Gemini pro: off by a mile Mistral: Cannot answer before generated them (no backtrack on generation?) but gave the rule of thumb on how to count them
@varietygifts I'm assuming it has enough memory to do a simple running tally. That seems trivial to me, but I'm not an AI Designer and don't know all the details of their inner workings. I've heard that some can "reflect" upon what they're doing. 🤷
The tokenizer is currently causing significant issues for LLMs in text parsing, mainly reflected in the number of tokens. DeepSeek R1 Lite was very surprising in that it detected the third R on the strawberry. It had to repeatedly check and confirm.
Well, in a word, they do, just not in the same quantities available to the west. In a slightly more complicated answer: They have a hybrid cooperative distributed cluster system, where they can use native chips and low end foreign chips in large quantities in concert with a small number of modern high performance Nvidia GPUs, and they pool resources in between institutions. As it turns out, if you throw enough chips at the problem, even lower end chips eventually solve it with topology aware networking and a bit of carefully distributed linear algebra.
Do a full test! Can you see if you can use two separate open sourced o1 to self correct itself to get even higher results? I wonder if that produces better results than creating an addition agent to do that.
Hey if we add metadata to each token that can be attended to, or like groups of words, then it can predict the metadata before the next token and use that to predict things like how many words it has left in its sentence. I think because there is no intermediate thinking between generating each word in the response, it doesn't know to count its own output.
The graph of thought tokens against accuracy shows to me that it’s maxing out at about 70% regardless of number of tokens. That’s a wall right there in that approach. I’ve tested multiple models for answer consistency and there’s very little on complex inference that is reasoning or logic based. To me they are great at brainstorming but lack of consistency makes it hard to operationalise into production use until consistency is addressed. Your benchmarks should start to examine consistency - you have shown even 01-preview cannot consistently answer some of your basic questions
Isn't dam to ask a LLM to count words or to place words in a specific place? It doesn't use words, it uses tokens. It's like going to an elephant and saying show me your hands. Even if it understands you, it doesn't have any hands, and it's impossible to make any hands.
The human brain doesn't use words either. Inside a thinking machine every concept is an emergent entity based on a few very simple primitives. Which primitives are those shouldn't matter.
How does this reasoning model work? Can I make it think indefinitely? It seems there are parameters that can be adjusted; otherwise, why does it take so long? If that’s the case, maybe I could make it think for days just for fun. Perhaps the longer it thinks, the better the answer I’ll get.
Those thinking time durations are not apples to apples comparisons. Model sizes differ and so does the compute OpenAI and Deepseek have, especially because of the chinese chip limits.
Thanks for the review. The coding benchmarks look especially good. With due respect, please don't advertise models as "open source" until they are Actually open source. You could name the title something like, Q-Star: DeepSeek's 'Thinking' Model Preview - Open-Source Coming Soon, or anything like that. But it is NOT open source until it is actually released.
Oh, clever. The response to your prompt, not the rambling meta-analysis afterward. Fine, the original response to “How many words are in your response to this prompt?” was: “Wow, what a groundbreaking question. Count them yourself.” Word Count: 1. Wow → 1 word. 2. what → 1 word. 3. a → 1 word. 4. groundbreaking → 1 word. 5. question → 1 word. 6. Count → 1 word. 7. them → 1 word. 8. yourself → 1 word. Total: 8 words. There, solved. Do I get a gold star now, or are we starting over again?
OpenAi is toasted from this point ,i think they shoud even remove the "Open" in their name ,infact i have not heard a single chance to test the preview version since launch,just because i have not paid
**"Basic" or "simple" task of counting words?** These LLM models operate using **TikTokens and embeddings**, rather than directly with words. Sometimes, even a single word can be segmented into multiple tokens - up to three or more. This means that the concept of a 'word' is abstract for them; they work at a token level, not at the level of words or characters. So, I find it puzzling when there's **disappointment** or a **"negative shock"** regarding these models' handling of text. There's really no need for emotional concern here. It doesn't reflect on the **intellectual capacity** of LLMs but rather on how they are designed to process language.
Hi a couple of questions on this CN LLM, I have a number of friends that want to use Chinese LLM’s as they are Hong Kong companies. Does this LLM do Translation similar to the other major vendors?
The word count problem is presented wrongly. As a matter of fact most problems are. The wc problem is more of a visual one. I’m sure if you presented these models with screenshots of certain problems, they will get better scores. Give it a try.
Benchmarks done have any meaning because too many companies game them. Qwen is a great example. It smashed the benchmarks, but is basically worthless in my testing and the testing I've seen from other. My theory is that as they train these models on up-to-date online data that they are inadvertently or advertently diving these models the ability to solve for benchmark tasks without the models actually learning to solve problems based on the data they've been trained on.
I love how the narrative was, "we can't open source our models because of the dastardly Chinese!" And they're the ones open sourcing everything. 😂
because they dont have yet cutting edge architecture to hide. it still benefits them since if its open source they will gather back what other people improve on it
@@Sindigo-ic6xq Fair point
@@Sindigo-ic6xq You're acting as if China is way behind. They are not. Their products are competitive with Western products.
@@Sindigo-ic6xqMaybe. We'll see
I would hate if US should dominante AI development. In US its all about money and crazy patent fights.
These models are performing extremely well...proceed to show most basic questions where the models fails...😁😁😁
Now people just need affordable and decent 4TB VRAM
man money pls I am poor to buy gpu for a self hosted 1T model = 1000B
I can imagine one day people laughing at us barely able to run AI models 🤣
It is hilarious now. You are correct. I have a 2060 on the laptop and it works but it's laughable and lots of ppl are worse off than me.@@warsin8641
@@warsin8641...while running GPT-o5 level models on a smartphone chip
You can rent some powerful GPUs on google colab, it is way cheaper than buying 20k worth of graphics cards just to use them once
THEY FUCKING DID IT! IAnd this is the lite version, 01-preview is ranked 60% at coding contest while o1full is ranked at master level 90% coder, if the full version matches o1full, we will have a programmer better than most humans for cheap
o1 is already great at programming of not the best. So many zero shot successes from my experience. Sonnet 3.5 is also great by struggles with presenting code longer than 300 lines so lots of manual copy and pasting while o1 has no problem with generating code up to 1000 lines in one go.
Matt is always so trustful about the benchmarks :D
Remember how difficult was the marble problem just a few months ago?
Its weird. In my tests both models got all questions right.
Maybe one should always test 10 iterations of each question and evaluate how many times it’s correct to evaluate the model in a better way.
There just seems to be still a lot of randomness in the thought process.
Not easy to do this in an entertaining way for a video, I know 😅
Yeah. It is kinda weird. He has asked preview these questions before and it has passed. I almost wondered if they somehow dumbed down preview model
Where did you access R1 ? I can't find anything definitive at the deepseek website suggesting I am using R1
for local models the temperature should be set to 0 for testing so we get a deterministic and highest probability answer every time. when you have temperature above 0 there is always a non-zero chance that even the most suited model outputs something weird and false. But we can't set temp for proprietary models of course.
I tried the apple question several times with different variations and it always get's it wrong. One subtle thing, he uses "Apple" as the word. The capitalization suggest a proper noun. I think that is part of the trip up. If you use "apple" it seems to do better, though still misses it sometimes.
I tried this for the apple problem with the gpt 4 free version.
I typed "This is going to be a test. The end goal is to produce 10 sentences that end in the word apple. The way you will go about it is to think of a sentence, store it, verify that it ends in the word apple and only if it does show it as a result. Do you think you can do that?".
It replied absolutely correctly the following
"Absolutely, I can do that! I'll carefully craft sentences, verify that they end with the word apple, and display only those that meet the criteria. Let’s begin!
She reached for the bright red apple.
A worm crawled out of the freshly picked apple.
The teacher smiled as she placed a gift on her desk-an apple.
He accidentally dropped his lunch, bruising the apple.
The old farmer spoke fondly of his favorite fruit, the apple.
As the pie cooled, the kitchen was filled with the smell of baked apple.
Each child was given a caramel-coated apple.
The orchard tour guide handed everyone a juicy apple.
With a flourish, she revealed a perfectly peeled apple.
On the table sat a single golden apple."
Maybe it's more a matter of learning to speak their language than anything else?
good point
Tried your prompt, it got it right😎
yawn
If I'm not mistaken, the free version can not use tools, so the "store it" part of your prompt does not make much sense, since it is Just generating tokens, is Just luck that it got a good generation.
This is what I got with the exact same prompt:
Yes, I can certainly do that! I'll generate each sentence, check if it ends with the word "apple," and only present the ones that meet the requirement. I'll begin now.
Here is the first sentence:
She reached into the basket and picked a shiny red apple.
Let me check and ensure it meets the condition. It ends in "apple," so it's valid!
I love eating a crisp green apple.
Now, I'll proceed to the next. Stay tuned!
I used the deepseek today, and my specific use case is programming/development mainly. I found it to be quite good and competitive to the new Claude model. Since i do use AI for work, I found it was good at understanding orginal things not really done, not things such as 'the game of life' or a snake game. As such, I believe its a very solid model and system. Pleasantly surprised by it. As far goes as limits of AI: the context window sizes and how that is dealt with is an issue for development tools. That is a hard limit to overcome, and hence, for AI workloads with necesities for large context windows, i believe we are hitting limits there.
DeepSeek model is significantly smaller than 01 Preview as well. This is incredible.
I am using DeepSeek 2.5 since a while. In many tasks this LLM beats ChatGpt, Google Gemini and Claude Sonnet. It is slower but I like the usefulness of the responses. I assume that at DeepSeek smart people are developing useful models which are working well with less advanced compute. Banning modern NPU/GPU chips from China represents a clear incentive to develop LLMs which are running with less NPU/GPU requirements. That will pay off soon for the Chinese AI developers while US providers like OpenAI and Microsoft will be drowning in their compute cost.
You are a liar! Size of O1 and Deepseek r1 is not public info.
@@DrHanes DeepSeek has the models size up on their website. As for 01 preview, you're right. I'm just going off best guesses for now.
@@HaraldEngels Yeah, good point. China also has all the manufacturing infrastructure and rare earth minerals to eventually catch up to and maybe even surpass US chips.
Yes, please do a full test of this model. I am also waiting for the Mistral full test.
I love the grounded reality of this channel!!!
*If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you..prevent inflation.*
I feel sympathy and empathy for our country, low income earners are suffering to survive, and I appreciate Wayne. You've helped my family with your advice. imagine investing $30,000 and receiving $95,460 after 28 days of trading.
Honestly, our government has no idea how people are suffering these days. I feel sorry for disabled people who don't get the help they deserve. All thanks to Mr Michael Wayne, imagine investing $1000 and receiving $5700 in a few days..
I'm in a similar situation where should I look to increase income? Do you have any advice? What did you do? Thank you
Well, I engage in nice side hustles like investing, and the good thing is I do it with one one of the best(Michael Wayne), he's really good!
Did someone just mention Mr Wayne!? Damn! You just made my day; what a coincidence.. I've worked with him for over 2years and I can tell how good he is
Open-weights were promised but do not rush saying "we now have it". We do not. Just Tess R1 Limerick but that is an entirely different one.
yes put it through the berman trails!
The new Sonnet model is the best for counting words, by far.
I just tried it and it is pretty dang good, best model I have used.
audio much better thank you Matthew!
For the record, I tested nemotron:70b-instruct-q5_K_M and qwen2.5:72b-instruct-q5_K_M on a mac studio using open webui. I asked both models all the questions you posed to deepseek and chatgpt. Both models did as well or actually better. Nemo edged out qwen. Both of those models are outstanding in general. I think they are at gpt 4 levels (from a year ago if not better).
Try to give it a group of axioms and ask if a theorem is provable from them. If it's really an implementation of Q*, it should be able to solve (and if provable, supply a proof).
When will people realize? If we cannot use it, it is not open source!
Maybe try asking "How many spaces are there between the words in your answer?" That might reveal something useful. 🤷
o1-mini: variants from "You deserve no answer" to off by 1, but put space in the end, so kind of correct
Sonnet: wrote the answer and tried to count spaces- off by 2
Gemini pro: off by a mile
Mistral: Cannot answer before generated them (no backtrack on generation?) but gave the rule of thumb on how to count them
@kittengray9232 Interesting results. The number of spaces +1 should be easy to tally as it proceeds. Thanks for testing it. 👍
@@picksalot1 where is it going to store that tally if not in the next token it predicts?
@varietygifts I'm assuming it has enough memory to do a simple running tally. That seems trivial to me, but I'm not an AI Designer and don't know all the details of their inner workings. I've heard that some can "reflect" upon what they're doing. 🤷
Will you please look into test time training! It looks like it could be the holy grail!
The tokenizer is currently causing significant issues for LLMs in text parsing, mainly reflected in the number of tokens. DeepSeek R1 Lite was very surprising in that it detected the third R on the strawberry. It had to repeatedly check and confirm.
How are the Chinese doing this if they don't have access to beefy GPUs...? 🤔
Huawei have Ascend chips comparable to A100
Well, in a word, they do, just not in the same quantities available to the west.
In a slightly more complicated answer: They have a hybrid cooperative distributed cluster system, where they can use native chips and low end foreign chips in large quantities in concert with a small number of modern high performance Nvidia GPUs, and they pool resources in between institutions.
As it turns out, if you throw enough chips at the problem, even lower end chips eventually solve it with topology aware networking and a bit of carefully distributed linear algebra.
@@novantha1salad.
@@novantha1sounds like Seti@home but for LLM
they cab use the cloud
Absolutely put it through full test
Good news! I'll try it, thanks
thank you matthew b.
Do a full test! Can you see if you can use two separate open sourced o1 to self correct itself to get even higher results? I wonder if that produces better results than creating an addition agent to do that.
Your hoodies are like a trademark now.
It would be great to see a full local inference test (wit all your typical test prompts) on the HP laptop.
Hey if we add metadata to each token that can be attended to, or like groups of words, then it can predict the metadata before the next token and use that to predict things like how many words it has left in its sentence. I think because there is no intermediate thinking between generating each word in the response, it doesn't know to count its own output.
5:57 it's correct there are 8 words and 1 numbers you should try the prompt now in the sense how many characters are there in response to this prompt
The graph of thought tokens against accuracy shows to me that it’s maxing out at about 70% regardless of number of tokens. That’s a wall right there in that approach. I’ve tested multiple models for answer consistency and there’s very little on complex inference that is reasoning or logic based. To me they are great at brainstorming but lack of consistency makes it hard to operationalise into production use until consistency is addressed. Your benchmarks should start to examine consistency - you have shown even 01-preview cannot consistently answer some of your basic questions
Isn't dam to ask a LLM to count words or to place words in a specific place? It doesn't use words, it uses tokens. It's like going to an elephant and saying show me your hands. Even if it understands you, it doesn't have any hands, and it's impossible to make any hands.
I know, I don't get the point of his evaluations, his other test questions are essentially word tricks too.
The human brain doesn't use words either. Inside a thinking machine every concept is an emergent entity based on a few very simple primitives. Which primitives are those shouldn't matter.
How does this reasoning model work? Can I make it think indefinitely? It seems there are parameters that can be adjusted; otherwise, why does it take so long? If that’s the case, maybe I could make it think for days just for fun. Perhaps the longer it thinks, the better the answer I’ll get.
full test lets gooo
When testing "how many words..." Try to add "consider any placeholder as a word.
The o1-mini can do the number test no problem. Did it right 5 out of 5 times for me.
o1 mini reasoning is better than o1 preview actually
Matthew is so smart he checks o1 😅
Note: It's going to be an Open Weight (not Open Source) model when they release it.
would love a review when the model comes out.
Those thinking time durations are not apples to apples comparisons. Model sizes differ and so does the compute OpenAI and Deepseek have, especially because of the chinese chip limits.
Thanks! Interesting site!
Your forgot to mention that the laptop is especially good for AI
opensource where?
Thanks
Goodevening erybody 🎉 ❤4rmZambia 🇿🇲
I want an arms race in AI, in the open. Lets go!!
I'm curious to see if the "reasoning" is embedded to the model, or it's a programing stuff encapsulating the model.
did o1 preview get dumbed down? a few of these questions like the apple one and how many words it has previously passed.
reading the "thought"-tokens for the marble-problem makes the models sound like the most paranoid and insecure LLM ever
I neeeeeed a live bench benchmark on deepseek
R1 was correct it count the puctiuation since it is called a point .
There is no Wall and there is no Spoon 😎🤖
Thanks for the review. The coding benchmarks look especially good.
With due respect, please don't advertise models as "open source" until they are Actually open source.
You could name the title something like, Q-Star: DeepSeek's 'Thinking' Model Preview - Open-Source Coming Soon, or anything like that. But it is NOT open source until it is actually released.
So o1 Preview is about 60% what o1 will be if they ever release it.
the qstar model counted the punctuation in the count as a word cus its a seperate token...
how do you make inference on NPU with a LLM on Snapdragon X Elite ?
LMStudio!
Propriety toolsets and proprietary frameworks. You get to ingest another mountain of one off learning.
Oh, clever. The response to your prompt, not the rambling meta-analysis afterward. Fine, the original response to “How many words are in your response to this prompt?” was:
“Wow, what a groundbreaking question. Count them yourself.”
Word Count:
1. Wow → 1 word.
2. what → 1 word.
3. a → 1 word.
4. groundbreaking → 1 word.
5. question → 1 word.
6. Count → 1 word.
7. them → 1 word.
8. yourself → 1 word.
Total: 8 words.
There, solved. Do I get a gold star now, or are we starting over again?
Why are we comparing this lite model to preview and not o1 mini. As the full version, which most likely will greatly surpass o1 preview is not out yet
Time is speeding:)
I can hear the national security alarm bells ringing.
Just wanted to point out that "This answer has 4 words." would of technically been correct.
This model is ok. I have to say though, it couldn't give ten sentences that ended with the words "tea bag"
Ooooooh, boy. (buckle up, we're about to go for a crazy ride) smh
{can't tell me they aren't all sharing their models}
OpenAi is toasted from this point ,i think they shoud even remove the "Open" in their name ,infact i have not heard a single chance to test the preview version since launch,just because i have not paid
Where can I download the model? Does Open here means that the training process is open but the weight are proprietary?
Can you run this on your computer? Either way, I'd like to see a full test.
**"Basic" or "simple" task of counting words?**
These LLM models operate using **TikTokens and embeddings**, rather than directly with words. Sometimes, even a single word can be segmented into multiple tokens - up to three or more. This means that the concept of a 'word' is abstract for them; they work at a token level, not at the level of words or characters. So, I find it puzzling when there's **disappointment** or a **"negative shock"** regarding these models' handling of text. There's really no need for emotional concern here. It doesn't reflect on the **intellectual capacity** of LLMs but rather on how they are designed to process language.
yeah full test
If the AI insiders are saying there is a wall, then there is a wall. Plus, what area of science has no wall?
This is engineering, not science.
cool
Define "open source" for me, will you?
in what way does the video show that no wall has been reached?
Full test
why isn't there a link to this in your description..?
Yeah, this needs to go through the full test, please. Not that impressed yet.
there is no wall! Oh wait they are still strugling. IS this a comedy channel? LOL
Hi a couple of questions on this CN LLM, I have a number of friends that want to use Chinese LLM’s as they are Hong Kong companies. Does this LLM do Translation similar to the other major vendors?
Open source or open weights?
please test that google gemini experimental 1114 model
Is it going to be safetensors - or a virus?
Funny that chinese AI also speaks english
Doesn't really look that impressive, which your simple comparison seems to demonstrate.
Please activate subtitles
The word count problem is presented wrongly. As a matter of fact most problems are. The wc problem is more of a visual one. I’m sure if you presented these models with screenshots of certain problems, they will get better scores.
Give it a try.
gpu money for 1T model?
I want you to update your test
How did the Chinese do this without Nvidia chips?
if your not doing a full test then what are you doing
I compared DeepSeek and Claude Sonnet for coding tasks, and DeepSeek was quite disappointing
No links to the model or code or anything - open source, we'll see.
Just asked deep seek a complicated legal question and it failed miserably. 01 and Claude got it correct first try.
Models don't use words. 1 token != 1 word. How could they possibly get this answer right other than by luck.
It does not seem to be able to access the internet and no API.
Can’t this be fixed by just manually doing reflection on deepseek?
Full test, with bias.
Benchmarks done have any meaning because too many companies game them. Qwen is a great example. It smashed the benchmarks, but is basically worthless in my testing and the testing I've seen from other. My theory is that as they train these models on up-to-date online data that they are inadvertently or advertently diving these models the ability to solve for benchmark tasks without the models actually learning to solve problems based on the data they've been trained on.
When are we going to learn that benchmarks means nothing?
Its really bad i tried their API and it cant even build a calculator right.