Optimization for Deep Learning (Momentum, RMSprop, AdaGrad, Adam)

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 มิ.ย. 2024
  • Here we cover six optimization schemes for deep neural networks: stochastic gradient descent (SGD), SGD with momentum, SGD with Nesterov momentum, RMSprop, AdaGrad and Adam.
    Chapters
    ---------------
    Introduction 00:00
    Brief refresher 00:27
    Stochastic gradient descent (SGD) 03:16
    SGD with momentum 05:01
    SGD with Nesterov momentum 07:02
    AdaGrad 09:46
    RMSprop 12:20
    Adam 13:23
    SGD vs Adam 15:03

ความคิดเห็น • 33

  • @rhugvedchaudhari4584
    @rhugvedchaudhari4584 6 หลายเดือนก่อน +9

    The best explanation I've seen till now!

  • @AkhilKrishnaatg
    @AkhilKrishnaatg 2 หลายเดือนก่อน +1

    Beautifully explained. Thank you!

  • @idiosinkrazijske.rutine
    @idiosinkrazijske.rutine 10 หลายเดือนก่อน +2

    Very nice explanation!

  • @dongthinh2001
    @dongthinh2001 5 หลายเดือนก่อน

    Clearly explained indeed! Great video!

  • @zhang_han
    @zhang_han 7 หลายเดือนก่อน +7

    Most mind blowing thing in this video was what Cauchy did in 1847.

  • @markr9640
    @markr9640 4 หลายเดือนก่อน

    Fantastic video and graphics. Please find time to make more. Subscribed 👍

  • @saqibsarwarkhan5549
    @saqibsarwarkhan5549 29 วันที่ผ่านมา

    That's a great video with clear explanations in such a short time. Thanks a lot.

  • @luiskraker807
    @luiskraker807 4 หลายเดือนก่อน

    Many thanks, clear explanation!!!

  • @Justin-zw1hx
    @Justin-zw1hx 10 หลายเดือนก่อน +2

    keep doing the awesome work, you deserve more subs

  • @rasha8541
    @rasha8541 5 หลายเดือนก่อน

    really well explained

  • @benwinstanleymusic
    @benwinstanleymusic 2 หลายเดือนก่อน

    Great video thank you!

  • @makgaiduk
    @makgaiduk 6 หลายเดือนก่อน

    Well explained!

  • @physis6356
    @physis6356 หลายเดือนก่อน

    great video, thanks!

  • @leohuang-sz2rf
    @leohuang-sz2rf หลายเดือนก่อน

    I love your explaination

  • @TheTimtimtimtam
    @TheTimtimtimtam 11 หลายเดือนก่อน +1

    Thank you this is really well put together and presented !

  • @tempetedecafe7416
    @tempetedecafe7416 5 หลายเดือนก่อน +1

    Very good explanation!
    15:03 Arguably, I would say that it's not the responsibility of the optimization algorithm to ensure good generalization. I feel like it would be more fair to judge optimizers only on their fit of the training data, and leave the responsibility of generalization out of their benchmark. In your example, I think it would be the responsibility of model architecture design to get rid of this sharp minimum (by having dropout, fewer parameters, etc...), rather than the responsibility of Adam not to fall inside of it.

  • @wishIKnewHowToLove
    @wishIKnewHowToLove ปีที่แล้ว

    thank you so much :)

  • @MikeSieko17
    @MikeSieko17 2 หลายเดือนก่อน

    why didnt you explain the (1-\beta_1) term?

  • @Stopinvadingmyhardware
    @Stopinvadingmyhardware ปีที่แล้ว +1

    nom nom nom learn to program.

  • @wishIKnewHowToLove
    @wishIKnewHowToLove ปีที่แล้ว

    Really? i didn't know SGD generalized better than ADAM

    • @deepbean
      @deepbean  ปีที่แล้ว

      Thank you for your comments Sebastian! This result doesn't seem completely clear cut so may be open to refutation in some cases. For instance, one Medium article concludes that "fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters", which means the problem is one of hyperparameter optimization, which can be more difficult with Adam. Let me know what you think!
      medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008

    • @wishIKnewHowToLove
      @wishIKnewHowToLove ปีที่แล้ว

      @@deepbean it's sebastiEn with E.Learn how to read carefully :)

    • @deepbean
      @deepbean  ปีที่แล้ว +3

      🤣

    • @deepbean
      @deepbean  ปีที่แล้ว

      @@wishIKnewHowToLove my bad

    • @dgnu
      @dgnu 11 หลายเดือนก่อน +8

      @@wishIKnewHowToLove bruh cmon the man is being nice enough to u just by replying jesus

  • @donmiguel4848
    @donmiguel4848 2 หลายเดือนก่อน

    Nesterov is silly. You have the gradient g(w(t)) because the weight w is calculating in the forward the activation of the neuron and contributes to the loss. You don't have the gradient g(w(t)+pV(t)) because at this fictive position of the weight the inference was not calculated and so you don't have any information about what the loss contribution at that weight position would have been. It's PURE NONSENSE. But it only cost a few more calculations without doing much damage, so no one really seems to complain about it.

    • @Nerdimo
      @Nerdimo หลายเดือนก่อน

      This does not make sense…at all. The intuition is that you’re making an educated guess for the gradient in the future; you’re already going to compute g(w(t) + pV(t)) anyway, so why not correct for that and move in that direction instead on the current step?

    • @donmiguel4848
      @donmiguel4848 หลายเดือนก่อน

      @@Nerdimo Let's remember that the actual correct gradient of w is computed as the average gradient over ALL samples. So for runtime complexity reason we already make a "educated guess", or better a stochastic approximation, with our present per sample or per batch gradient by using a running gradient or a batch gradient. But these approximations are based on actual inference that we have calculated. Adding to that uncertainty some guessing about what in future will happen is not a correction based on facts, it's pure fiction. Of course you will find for every training process a configuration of hyper parameter, with which this fiction is beneficial as well you will find configurations, with which it is not. But you get this knowledge only by experiment instead of having an algorithm, that is beneficial in general.

    • @Nerdimo
      @Nerdimo หลายเดือนก่อน

      @@donmiguel4848 Starting to wonder if this is AI generated “pure fiction” 😂.

    • @Nerdimo
      @Nerdimo หลายเดือนก่อน

      @@donmiguel4848 I understand your point, however, I think it’s unfair to discount it as something that’s “fiction”. My main argument is just that there’s intuitions in why doing this could improve taking good steps in the direction towards the local minimum of the loss function.

    • @donmiguel4848
      @donmiguel4848 หลายเดือนก่อน

      @@Nerdimo These "intuitions" are based on assumptions about the NN which don't match with reality. We humans understand a hill and a sink or a mountain or a canyon and we assume the loss function being like that, but the real power of NeuralNetworks is the non-linearity of the activation and the flexibility of a lot of interacting non-linear components. If our intuition would match what actually is going on in the NN we could write an algorithm which would be much faster than the NN. But NN are fare more complex and beyond human imagination, so I think we have to be very careful with our assumptions and "intuitions", even though it seems to be "unfair".😉