Nice deep dive. These models are great, and are actually doing something I wasn't sure was possible. Now that I see it, I'm not sure why I thought this would be difficult. 🤷
Curious about multilingual capability here, will definitely play around soon! Also, for testing reasoning i would suggest a large complex task and treat it like a one shot solver, not a chat model. At least that seems to be the trick and strength of openai O models right now. Best!
Reasoning combined with test time training would be killer for local OSS models. We need models with these techniques combined together somehow. I believe at that point we'd be beyond AGI, but we'd probably be at ASI at that point.
AFAIK they haven't released the data but I talked about the distillation in the video. the basically just do a FT on 800k examples sampled from R1 and DeepSeekv3 for non reasoning tasks.
I expect they may end up doing this as in the paper they said they did not do RL on reasoning for engineering/coding tasks - thus R1 doesnt have a huge improvement over V3 for coding. Once they do the RL for coding i suspect they may release something like this.
I wouldn't say they mean "nothing" a model that performs middling or bad on benchmarks are usually not good. Actually most of the time not good. However, I agree when we are using SOTA models, it becomes less useful. We need some empirical metrics, like benchmarks, but we also have to know that doesn't tell the whole story.
The benchmarks that are really interesting here are the DeepSeek-R1 compared to the DeepSeekv3 as they are the exact same base model but mean the different is showing the strength of their new post training compared to a more standard post training regime.
Most of my tests of the 70b model resulted in a chain of vomited text. It’s easy to say that it is the wrong model to prompt for “Please write an overview about the German tense Plusquamperfekt.” There is a lot to think about, and yes, it is far away from anything correct. There is no wrong question or wrong model for a certain question.
Thank you. I’ve read on LinkedIn that the terms & conditions of Deepseek are that they have copyrights on the applications that are developed using their models. Is it true? Then it’s not really MIT license, is it?
Conspiracy theory time! Put on your foil hats! I don't actually know anything, but I gave DS3 and Clause 3.5 a prompt asking for a paragraph of corporate jargon that uses cliché catchy business phrases, without actually saying anything useful. There were slight variations in the words, but the paragraph structure and phrases were beat-for-beat the same. Same phrases, same order. Wouldn't it be hilarious if DS3 was a slightly modified wrapper around Claude? A single data point is all you need for a conspiracy theory, right?
This open model is so good, hard to believe that this is MIT license.
Well, with TikTok getting regulated, there needs to be a new hole.
When you read the paper Deepseek says themselves there is a lot more meat left on the bone. Expect a follow up model pretty quick.
This is the greatest gift for for the upcoming Chinese New Year holiday.
That’s why there is a discount for API. I am going to use it during the holiday.
you mean lunar new year
@@JH-bb8in I specifically refer to the Chinese starting date, not every lunar calendar is the same, the Indian's starts at March 22 for example.
I always like your assessments. No hype
Thanks this is exactly what I am going for
The most useful video about ds r1 in youtube. I enjoy the concise and approachable technical details in your videos. Please never stop posting.
Nice deep dive. These models are great, and are actually doing something I wasn't sure was possible. Now that I see it, I'm not sure why I thought this would be difficult. 🤷
You make a really good point, when you actually see what they're doing, it's not as complicated as a lot of people would think.
Thank you and greets from Germany! love your videos
Curious about multilingual capability here, will definitely play around soon! Also, for testing reasoning i would suggest a large complex task and treat it like a one shot solver, not a chat model. At least that seems to be the trick and strength of openai O models right now. Best!
Reasoning combined with test time training would be killer for local OSS models. We need models with these techniques combined together somehow. I believe at that point we'd be beyond AGI, but we'd probably be at ASI at that point.
Always concise explanation and right to the point. Thank you Sam :D Great video!
Thanks much appreaciated
Do you know if they released the distillation procedure?
So that we can, for instance, distill it onto qwen2.5-coder
AFAIK they haven't released the data but I talked about the distillation in the video. the basically just do a FT on 800k examples sampled from R1 and DeepSeekv3 for non reasoning tasks.
@samwitteveenai oh yeah I could reproduce that in a hot minute! I'll get on it
I expect they may end up doing this as in the paper they said they did not do RL on reasoning for engineering/coding tasks - thus R1 doesnt have a huge improvement over V3 for coding. Once they do the RL for coding i suspect they may release something like this.
dude, we already passed the point that bench marks mean nothing!
I wouldn't say they mean "nothing" a model that performs middling or bad on benchmarks are usually not good. Actually most of the time not good.
However, I agree when we are using SOTA models, it becomes less useful.
We need some empirical metrics, like benchmarks, but we also have to know that doesn't tell the whole story.
The benchmarks that are really interesting here are the DeepSeek-R1 compared to the DeepSeekv3 as they are the exact same base model but mean the different is showing the strength of their new post training compared to a more standard post training regime.
Most of my tests of the 70b model resulted in a chain of vomited text. It’s easy to say that it is the wrong model to prompt for “Please write an overview about the German tense Plusquamperfekt.” There is a lot to think about, and yes, it is far away from anything correct. There is no wrong question or wrong model for a certain question.
R for Remarkable
Thank you.
I’ve read on LinkedIn that the terms & conditions of Deepseek are that they have copyrights on the applications that are developed using their models. Is it true? Then it’s not really MIT license, is it?
Consipracy theory crap, other labs are panicking and spreading bs all over the net
if the context length is 2million+ then it would desoy the competition
And it'll cost a small fortune to run (at that scale)...
Conspiracy theory time! Put on your foil hats!
I don't actually know anything, but I gave DS3 and Clause 3.5 a prompt asking for a paragraph of corporate jargon that uses cliché catchy business phrases, without actually saying anything useful. There were slight variations in the words, but the paragraph structure and phrases were beat-for-beat the same. Same phrases, same order. Wouldn't it be hilarious if DS3 was a slightly modified wrapper around Claude?
A single data point is all you need for a conspiracy theory, right?
Ok but if it was and they sold it this cheap they’d be losing a ton of money.