DeepSeek R1 Theory Overview | GRPO + RL + SFT

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.พ. 2025

ความคิดเห็น • 103

  • @danielhemmati
    @danielhemmati 8 วันที่ผ่านมา +89

    i like this part of the internet

  • @adityavipradas3252
    @adityavipradas3252 4 วันที่ผ่านมา +8

    Read the paper and then saw your flowchart. This really helped a lot in understanding the workflow. Thanks.

    • @deeplearningexplained
      @deeplearningexplained  3 วันที่ผ่านมา +2

      Glad it was, don’t forget to check out the other videos in the description for having a full context!

    • @adityavipradas3252
      @adityavipradas3252 วันที่ผ่านมา +1

      @@deeplearningexplained Yannic's video really helps with the RL part. Thanks for the recommendation.

  • @sheldonsebastian7232
    @sheldonsebastian7232 8 วันที่ผ่านมา +13

    That map is lit! Its easy to follow the big picture.

    • @deeplearningexplained
      @deeplearningexplained  8 วันที่ผ่านมา +3

      Yes, the map should have been included directly in the paper.
      Would have made this already great paper awesome.

  • @Maicolacola
    @Maicolacola 2 วันที่ผ่านมา +1

    This was very well explained from a layperson’s perspective. I’m not an expert in this field, but very curious about it. You did a great job breaking things down, and I’m excited to go back to the paper and read it, and maybe understand more of it. Like others have mentioned, I’d love to see you explain the formula with code, so I could follow along at home. Cheers!

    • @deeplearningexplained
      @deeplearningexplained  2 วันที่ผ่านมา

      I'm glad you found it useful!
      I've finished a formula and code walk through of GRPO over here:
      th-cam.com/video/Yi1UCrAsf4o/w-d-xo.html
      I'm using HuggingFace implementation of it with their GRPOTrainer!

  • @ansitun
    @ansitun 2 วันที่ผ่านมา +1

    Thank you! This is extremely helpful. Now binging through all of your videos!

  • @Dresstosweatdottv
    @Dresstosweatdottv 6 วันที่ผ่านมา +4

    this waaaaas awesome, im learning ml and you break things down soo well, thank you.

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา

      Hey thanks for the kind feedback! I’m glad the content was useful :)

  • @dheocahyo7721
    @dheocahyo7721 4 วันที่ผ่านมา +1

    Valar Morghulis. Thanks for explaining this paper! I will have a paper reading session about DeepSeek R1 in my office this coming Friday. This really helps to understand it better.

    • @deeplearningexplained
      @deeplearningexplained  3 วันที่ผ่านมา

      Hope your reading session is fruitful!
      Valar morghulis!

  • @jiaxinkou5654
    @jiaxinkou5654 6 วันที่ผ่านมา +7

    the road map is neat!

  • @fsaudm
    @fsaudm 5 วันที่ผ่านมา +3

    Amazing video 👏🗣️ just subscribed, and totally looking forward to watching many many more of your videos!!

    • @deeplearningexplained
      @deeplearningexplained  5 วันที่ผ่านมา

      Hey there, thanks for the kind words and glad to have you as a subscriber! 🌹

  • @kylezou7040
    @kylezou7040 2 วันที่ผ่านมา +2

    早期人类驯服LLM的珍贵分析资料.

  • @AshaVishwanathan-u9o
    @AshaVishwanathan-u9o 2 วันที่ผ่านมา +1

    Great and simple explanation !

  • @ElianHerby
    @ElianHerby 7 วันที่ผ่านมา +4

    Je commente pas souvent sur youtube, mais je viens de découvrir ta chaine et c'est impeccable. Abonné direct, continue comme ça c'est génial

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา

      Ah merci c’est super gentil. Je suis bien content que les vidéos soient utiles!

  • @stephanembatchou5300
    @stephanembatchou5300 4 วันที่ผ่านมา +1

    Excellent breakdown

  • @continuallearning0
    @continuallearning0 5 วันที่ผ่านมา +6

    Great explainer video! Thanks! The reason why the smaller models don’t gain as much from RL compared to larger models is because they lack the “capacity” (number of parameters needed) to model reasoning

    • @deeplearningexplained
      @deeplearningexplained  5 วันที่ผ่านมา +3

      Very interesting thought! The weird bit is that they do gain this capacity with the same amount of parameters when they are being fine tuned through distillation!

    • @continuallearning0
      @continuallearning0 5 วันที่ผ่านมา +3

      @deeplearningexplained good point! almost like the small models have trouble discovering the reasoning by themselves but can easily replicate it once discovered. I think it has to do with the fact that bigger overparameterized models have a higher probability of developing additional subnetwork representations, extra capacity to discover. Then the smaller model can use heuristics or simpler principles to replicate

    • @deeplearningexplained
      @deeplearningexplained  4 วันที่ผ่านมา +1

      I like that interpretation. Generally, larger models seems to behaving a tad differently than smaller models in term of emerging capability. Like they even do in-context learning differently than smaller models and are able to "learn on the fly".

  • @MrMoonsilver
    @MrMoonsilver 6 วันที่ผ่านมา +3

    Best discovery in terms of llm channels in a while! Great content!

  • @ben8718
    @ben8718 6 วันที่ผ่านมา +2

    You are the only one on youtube explaining the maths behind this monster AI, what in the world??

    • @deeplearningexplained
      @deeplearningexplained  4 วันที่ผ่านมา +1

      There's actually two that I found that does this quite well, check them out:
      📌 th-cam.com/video/XMnxKGVnEUc/w-d-xo.html&ab_channel=UmarJamil
      📌 th-cam.com/video/bAWV_yrqx4w/w-d-xo.html&ab_channel=YannicKilcher

  • @GradientChunk
    @GradientChunk 6 วันที่ผ่านมา +2

    Thank you so much!! Super helpful

  • @philtrem
    @philtrem 6 วันที่ผ่านมา +3

    Excellent break down.

  • @clipstok788
    @clipstok788 8 วันที่ผ่านมา +7

    a big W for Yachine

  • @albitaulla1448
    @albitaulla1448 5 วันที่ผ่านมา +4

    Great video! Any editor / tool recommendations to read papers. The one you have here looks great!

    • @deeplearningexplained
      @deeplearningexplained  5 วันที่ผ่านมา

      Hey thanks!
      The one I'm using is TLDRAW, it's a very simple whiteboard that I can draw on.
      Other than that I'm using the firefox reader.

  • @ELum6perML-d4e
    @ELum6perML-d4e 8 วันที่ผ่านมา +4

    I was waiting for it

    • @deeplearningexplained
      @deeplearningexplained  8 วันที่ผ่านมา

      Don't forget to check out these two other videos for complementary understanding:
      📌 th-cam.com/video/XMnxKGVnEUc/w-d-xo.html&ab_channel=UmarJamil
      📌 th-cam.com/video/bAWV_yrqx4w/w-d-xo.html&ab_channel=YannicKilcher

    • @ELum6perML-d4e
      @ELum6perML-d4e 8 วันที่ผ่านมา +1

      @ actually i’m in the half of the second video you proposed😂😂

    • @deeplearningexplained
      @deeplearningexplained  8 วันที่ผ่านมา

      @ haha keep watching it! :)

  • @eagle43257
    @eagle43257 6 วันที่ผ่านมา +2

    Thank you brother.....we enjoyed

  • @AntonioMartinezRamirez85
    @AntonioMartinezRamirez85 6 วันที่ผ่านมา +1

    Thanks for the explanation! Great video!

  • @teddyperera8531
    @teddyperera8531 7 วันที่ผ่านมา +2

    Amazing explanation. Thank you

  • @DigitalAlligator
    @DigitalAlligator 5 วันที่ผ่านมา +2

    @10:02 Are you sure? All other terms are positive, and this KL divergence is negative, so when minimize the loss, this divergence actually goes up, so it seems to me that it encourage the model being different from the reference model.

    • @deeplearningexplained
      @deeplearningexplained  5 วันที่ผ่านมา +1

      Great question, it's maximizing the objective function not minimize it: "[...] optimizes the policy model 𝜋𝜃 by maximizing the following objective."
      The min is for choosing either the normal policy*advantage or the clipped policy*advantage.

    • @DigitalAlligator
      @DigitalAlligator 4 วันที่ผ่านมา +1

      @@deeplearningexplained Sorry, yes you are right, this is not minimize of loss function, it is maximize the objective function, I was wrong, please correct my words.

    • @deeplearningexplained
      @deeplearningexplained  4 วันที่ผ่านมา

      @@DigitalAlligator no worries, thank you for your question because I also got confused first time I read it.

  • @jairguedesferreira
    @jairguedesferreira 7 วันที่ผ่านมา +1

    Congratulations!

  • @fintech1378
    @fintech1378 7 วันที่ผ่านมา +3

    best explanation

  • @lojian
    @lojian 4 วันที่ผ่านมา +1

    Thanks a lot!! Good explination.

  • @mostinho7
    @mostinho7 8 วันที่ผ่านมา +11

    I have a comp Eng undergrad degree, but what kind of math do I need to learn to be able to make sense of these math formulas? They are complete Greek to me. What am I missing :(

    • @deeplearningexplained
      @deeplearningexplained  8 วันที่ผ่านมา +11

      Awesome question, they are greek to you because you say the greek letter in your head. Not what they actually mean.
      Check out this video I made on how to read deep learning math (or other very dense math) easily: th-cam.com/video/YXWxVxQ6AeY/w-d-xo.html

  • @JohnNauman
    @JohnNauman 3 วันที่ผ่านมา +1

    i assume the harmlessness weights are how it censors certain topics like a certain place on a specific date.

    • @deeplearningexplained
      @deeplearningexplained  3 วันที่ผ่านมา

      Yes, this is strongly implied in the paper where the harmlessness is hurting a Chinese benchmark.

  • @krpcannon123
    @krpcannon123 วันที่ผ่านมา

    Are the policy updates updating a separate policy NN or directly the parameters of the underlying pretrained model?

  • @owenbianchi6729
    @owenbianchi6729 6 วันที่ผ่านมา +1

    great video, thanks!

  • @jasper4803
    @jasper4803 5 วันที่ผ่านมา +3

    Thanks for the wonderful explaination. What paper reader you are using?

  • @Clipaholick
    @Clipaholick 8 วันที่ผ่านมา +2

    Sick!

  • @祖国翔
    @祖国翔 7 วันที่ผ่านมา +3

    I have a doubt related to "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples." At this step, fine-tune DeepSeek-V3-Base is the original DeepSeek-V3-Base or the DeepSeek-V3-Base after Cold Start and Reasoning-oriented Reinforcement Learning?
    I asked above doubts to deepseek, it reply me as below:
    At the step where it mentions "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples," the model being fine-tuned is not the original DeepSeek-V3-Base. Instead, it is the DeepSeek-V3-Base after the Cold Start and Reasoning-oriented Reinforcement Learning stages.
    I don't knowif the answer is correct. Could you help double check? Thanks a lot!

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา +1

      Very good question, it’s the original DeepSeek-V3-Base! It’s quite confusing.
      At this point, all other models used in the R1 path were used to create the 800K dataset in some sort of way.

  • @SevenErhan
    @SevenErhan 6 วันที่ผ่านมา +1

    Thank you very much

  • @caseyyeow1649
    @caseyyeow1649 2 วันที่ผ่านมา +1

    Your map is superb! Where to download your map? Thanks.

    • @deeplearningexplained
      @deeplearningexplained  2 วันที่ผ่านมา

      It is really useful, it's not mine though I found it here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @enlyly7510
    @enlyly7510 2 วันที่ผ่านมา +1

    Should I get your picture about how to training R1 workflow graph?

    • @deeplearningexplained
      @deeplearningexplained  2 วันที่ผ่านมา

      For sure, I found it over here: www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @quippy8402
    @quippy8402 6 วันที่ผ่านมา +1

    Thank you for the explanation. So, one can't do distillation from OpenAI without the availability of their models' logits.

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา +2

      No they can’t, but they can use OpenAI to generate reasoning and non-reasoning data which as we have seen in the paper is important step in the pipeline for R1.

  • @SuperSoliton
    @SuperSoliton 3 วันที่ผ่านมา +1

    Great video, thanks.
    Now I believe people in China are indeed good at math and very generous.

  • @johngrabner
    @johngrabner 7 วันที่ผ่านมา +2

    GRPO reward is for any text between tags? Is this done after whole sequence is generated or as soon as text appears in the area?

    • @deeplearningexplained
      @deeplearningexplained  7 วันที่ผ่านมา

      Great question, the details are a bit vague on that front for the formatting but I believe it’s for a full reward loop.
      Aka you need to have an answer that is verifiable to have the reward signals being propagated back to the whole sequence at the same time.
      Some bits of the reward will pertain to what’s in the think tag (like the consistency and formatting rewards). Other like the accuracy will check the answer only.

  • @kushkaptain4205
    @kushkaptain4205 วันที่ผ่านมา

    How good is the 1.5b version at let's say top high school math?

  • @larjunmnath
    @larjunmnath 7 วันที่ผ่านมา +1

    thanks for the overview, i was to lazy to read it😝

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา

      Glad it was useful! Do read the paper though, it’s quite well written!

  • @BizRid
    @BizRid 4 วันที่ผ่านมา +1

    🐐

  • @jebprime
    @jebprime 7 วันที่ผ่านมา +1

    I wonder if mixing languages allows it to think in subtle, slightly different ways belonging to different cultures / languages ? And that's why aligning it to stick to one language resulted in a slight degradation in performance.

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา

      That part is one of the most fascinating. I think it has to do with how the knowledge is encoded within its weights.
      A concept might be easier to reach for the model with some token sequence belonging to Chinese, while other might be easier in English.
      It wouldn’t surprise me if some of the token sequence aren’t even readable but more broad ideas stitched together.

  • @fintech1378
    @fintech1378 7 วันที่ผ่านมา +1

    please do the detailed math via paper like what you suggested

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา +1

      Yes, I’m preparing a detailed breakdown of GRPO and I’ll try to get some code to follow along too.

  • @xinformatics
    @xinformatics 7 วันที่ผ่านมา +3

    Kache - yacineMTB?

  • @fintech1378
    @fintech1378 7 วันที่ผ่านมา +2

    is this kache on X?

  • @xiakj
    @xiakj 7 วันที่ผ่านมา +1

    can you share the chart in the video

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา +1

      Yes for sure, it’s here:
      www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/

  • @ben8718
    @ben8718 6 วันที่ผ่านมา +1

    You channel's name fits quite well with the new AI model, coincidence?

  • @loicndo8469
    @loicndo8469 8 วันที่ผ่านมา +1

    Super merci n'oublie pas de nous faire de belles vidéos en french aussi.

    • @deeplearningexplained
      @deeplearningexplained  7 วันที่ผ่านมา +1

      Ahah je vais essayer! La pluspart de mon audience est anglophone, mais je n’oublie pas mes français et québécois!

  • @amunif_
    @amunif_ 7 วันที่ผ่านมา +1

    Yacine, would it be possible to get on a zoom call with you to discuss about AI research?

    • @deeplearningexplained
      @deeplearningexplained  7 วันที่ผ่านมา +1

      Hey there, for sure.
      Shoot me an email at mail@yacinemahdid.com and I'll organize it.

    • @amunif_
      @amunif_ 7 วันที่ผ่านมา

      @ thanks will do.

  • @DigitalAlligator
    @DigitalAlligator 5 วันที่ผ่านมา +1

    Wow, I didn't know John snow is also an AI expert

  • @beaniegamer9163
    @beaniegamer9163 6 วันที่ผ่านมา +1

    Well, eh...i can only read a b c. 😅

    • @deeplearningexplained
      @deeplearningexplained  6 วันที่ผ่านมา

      Haha yeah it’s a bit difficult to read the GRPO formula.
      If you are interested in improving your math reading skills, I got a video that cover the technique I use for complicated formula:
      m.th-cam.com/video/YXWxVxQ6AeY/w-d-xo.html

  • @ps3301
    @ps3301 7 วันที่ผ่านมา +1

    Lazy video: if u want to teach , translate the formula into the codes to demonstrate it