25. Stochastic Gradient Descent

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 มิ.ย. 2024
  • MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018
    Instructor: Suvrit Sra
    View the complete course: ocw.mit.edu/18-065S18
    TH-cam Playlist: • MIT 18.065 Matrix Meth...
    Professor Suvrit Sra gives this guest lecture on stochastic gradient descent (SGD), which randomly selects a minibatch of data at each step. The SGD is still the primary method for training large-scale machine learning systems.
    License: Creative Commons BY-NC-SA
    More information at ocw.mit.edu/terms
    More courses at ocw.mit.edu

ความคิดเห็น • 71

  • @elyepes19
    @elyepes19 3 ปีที่แล้ว +18

    For those of us who are newcomers in ML, it's most enlightening to know that unlike "pure optimization" that aims to find the most exact minimum possible, ML aims instead to be "close enough" to the minimum in order to train the ML engine, if you get " too close" to the minimum an over-fit of your training data might occur. Thank you so much for the clarification

  • @rogiervdw
    @rogiervdw 4 ปีที่แล้ว +33

    This is truly remarkabe teaching. Greatly helps understanding and intuition of what SGD actually does. prof. Sra's proof of SGD convergence for non-convex optimization is in prof. Strang's excellent book "Linear Algebra & Learning From Data", p.365

  • @BananthahallyVijay
    @BananthahallyVijay 2 ปีที่แล้ว

    Wow! That was one great talk. Prof. Suvrit Sra's done a great job in giving examples just light enough to drive the key ideas of SGD.

  • @schobihh2703
    @schobihh2703 9 หลายเดือนก่อน +1

    MIT is simply the best teaching around. Really deep insights again. Thank you.

  • @Vikram-wx4hg
    @Vikram-wx4hg 3 ปีที่แล้ว +4

    What a beautiful beautiful lecture!
    Thank you Prof. Suvrit!

  • @JatinThakur-dv7mt
    @JatinThakur-dv7mt ปีที่แล้ว +8

    Sir you are a student from lalpani school shimla. You were the topper in +2. I am very happy for you. You have reached at a level where you truly belonged to. I wish you more and more success.

    • @ASHISHDHIMAN1610
      @ASHISHDHIMAN1610 ปีที่แล้ว

      I am from Nahan, and I’m watching this from Ga Tech :)

  • @rembautimes8808
    @rembautimes8808 2 ปีที่แล้ว +4

    Amazing for MIT to make such high quality lectures available worldwide. Well worth time investment to go thru these lectures. Thanks Prof Strang & Prof Suvrit & MIT

  • @cobrasetup703
    @cobrasetup703 2 ปีที่แล้ว +1

    Amazing lecture, i am delighted by the smooth explanation of this complex topic! Thanks

  • @sukhjinderkumar2723
    @sukhjinderkumar2723 2 ปีที่แล้ว +2

    Hands Down one of the most intersting lectures, The way Professor showed reseach ideas here and there and almost everywhere just blows me away, It was very very intersting, and best part is it is afforable to non-Math guys too, (thought its coming from a maths guy, however I feel like math part of very little, it was more towards intuitive side of SGD)

  • @minimumlikelihood6552
    @minimumlikelihood6552 ปีที่แล้ว

    That was the kind of lecture that deserved applause!

  • @nayanvats3424
    @nayanvats3424 4 ปีที่แล้ว +1

    couldn't have been better....great lecture.... :)

  • @scorpio19771111
    @scorpio19771111 2 ปีที่แล้ว

    Good lecture. Intuitive explanations with specific illustrations

  • @tmusic99
    @tmusic99 ปีที่แล้ว

    Thank you for an excellent lecture! Give me a clear track for development.

  • @RAJIBLOCHANDAS
    @RAJIBLOCHANDAS 2 ปีที่แล้ว +1

    Really extraordinary lecture. Very lucid but highly interesting. My research is on 'Adaptive signal processing'. However, I enjoyed this lecture most. Thank you.

  • @jfjfcjcjchcjcjcj9947
    @jfjfcjcjchcjcjcj9947 4 ปีที่แล้ว +1

    Very clear and nice, to the point.

  • @rababmaroc3354
    @rababmaroc3354 4 ปีที่แล้ว

    well explained, thank you very much professor

  • @holographicsol2747
    @holographicsol2747 2 ปีที่แล้ว

    Thank you, you are an excellent teacher and I learned, thank you

  • @NinjaNJH
    @NinjaNJH 4 ปีที่แล้ว +2

    Very helpful, thanks! ✌️

  • @KumarHemjeet
    @KumarHemjeet 3 ปีที่แล้ว

    What an amazing lecture !!

  • @georgesadler7830
    @georgesadler7830 2 ปีที่แล้ว +1

    Professor Suvrit Sra, thank for a beautiful lecture on Stochastic Gradient Descent and it's impact on machine learning. This powerful lecture help me understand something about machine learning and it's overall impact on large companies.

  • @taasgiova8190
    @taasgiova8190 2 ปีที่แล้ว

    Fantastic, excellent lecture thank you.

  • @benjaminw.2838
    @benjaminw.2838 7 หลายเดือนก่อน

    Amazing class!!!!!!!!!!!! not only for ML researchers but also for ML practitioners.

  • @josemariagarcia9322
    @josemariagarcia9322 4 ปีที่แล้ว

    Simply brilliant

  • @anadianBaconator
    @anadianBaconator 3 ปีที่แล้ว +1

    this guy is fantastic!

  • @hj-core
    @hj-core 8 หลายเดือนก่อน

    An amazing lecture!

  • @notgabby604
    @notgabby604 ปีที่แล้ว

    Very nice lecture. I will seeming go off topic here and say that an electrical switch is one-to-one when on and zero out when off. When on 1 volt in gives 1 volt out, 2 volts in gives 2 volts out etc.
    ReLU is one-to-one when its input x is >=0 and zero out otherwise.
    To convent a switch to ReLU you just need a attached switching decision x>=0.
    Then a ReLU neural networks is composed of weighted sums that are connected and disconnected from each other by the switch decisions. Once the switch states are known then you can simplify the weighted sum composits using simple linear algebra. Each neuron output anywhere in the net is some simple weighted sum of the input vector.
    AI462 blog.

  • @TrinhPham-um6tl
    @TrinhPham-um6tl 3 ปีที่แล้ว

    Just a litte typo that I came across throught out this perfect lecture is the "confusion region": min(a_i/b_i) and max (a_i/b_i) should be min(b_i/a_i) and max (b_i/a_i).
    Generally speaking, this lecture is the best explanation on SGD I have ever seen. Again, thank you prof. Sra and thank you MITOpenCourseWare so so much 👍👏
    P/s: Any other resources that I've read explained SGD so complicatedly 😔

  • @gwonchanyoon7748
    @gwonchanyoon7748 หลายเดือนก่อน

    beautiful class room!

  • @haru-1788
    @haru-1788 2 ปีที่แล้ว

    Marvellous!!!

  • @cevic2191
    @cevic2191 2 ปีที่แล้ว

    Many thanks Great!!!

  • @pbawa2003
    @pbawa2003 2 ปีที่แล้ว

    This is Gr lecture though took me little time to prove the gradient descent lies in range of region of confusion with min and max been individual sample gradients

  • @fishermen708
    @fishermen708 4 ปีที่แล้ว +1

    Great.

  • @fatmaharman3842
    @fatmaharman3842 4 ปีที่แล้ว

    excellent

  • @BorrWick
    @BorrWick 4 ปีที่แล้ว +2

    i think there is a very small mistake in the graph of (a_i*x-b)^2. The confusion area is bound is not a_i/b_i but b_i/a_i

  • @xiangyx
    @xiangyx 3 ปีที่แล้ว

    fantastic

  • @3g1991
    @3g1991 4 ปีที่แล้ว +7

    Anyone have the proof he didn't have time for regarding stochastic gradient in non-convex case.

  • @vinayreddy8683
    @vinayreddy8683 4 ปีที่แล้ว

    Prof assumed all the variables are scalars so, while moving loss towards down hill or local minimum; how does loss function is guided to minimum without any directions (scalar property)

  • @neoneo1503
    @neoneo1503 2 ปีที่แล้ว

    "shuffle" in practice or "random pick" in theory on 42:00

  • @grjesus9979
    @grjesus9979 ปีที่แล้ว

    So, when using tensorflow or keras, when you set batch size = 1, there is as many iterations as samples in the entire training dataset. So my question is where is the random in "stochastic" gradient descent coming from?

  • @Tevas25
    @Tevas25 4 ปีที่แล้ว

    A link to the Matlab simulation prof Suvrit shows would be great

    • @techdo6563
      @techdo6563 4 ปีที่แล้ว +14

      fa.bianp.net/teaching/2018/COMP-652/
      found it

    • @SaikSaketh
      @SaikSaketh 4 ปีที่แล้ว

      @@techdo6563 Awesome

    • @medad5413
      @medad5413 3 ปีที่แล้ว

      @@techdo6563 thank you

  • @MohanLal-of8io
    @MohanLal-of8io 4 ปีที่แล้ว +3

    what GUI software professor Suvrit is using to change the step size instantly?

    • @brendawilliams8062
      @brendawilliams8062 2 ปีที่แล้ว

      I don’t know but it would have to transpose numbers of a certain limit it seems to me.

  • @watcharakietewongcharoenbh6963
    @watcharakietewongcharoenbh6963 2 ปีที่แล้ว

    How can we find his 5 lines proof of why SGD works? It is fascinating.

  • @JTFOREVER26
    @JTFOREVER26 2 ปีที่แล้ว

    Can anyone here care to explain how in the example in one dimension, when choosing a scalar outside R it grants that the stochastic gradient and the full gradient has the same sign? (corresponding to 30:30 - 31:00 ish in the video) Thanks in advance!

    • @ashrithjacob4701
      @ashrithjacob4701 ปีที่แล้ว

      Since f(x) can be thought of as a sum of quadratic functions ( each function corresponding to one data point) with a minima at bi/ai. When we are outside the region R, then the minima of all the functions lies on the same side to where we are and as a result all their gradients have the same sign

  • @kethanchauhan9418
    @kethanchauhan9418 4 ปีที่แล้ว +1

    what is the best book or resource to learn the whole mathematics behind stochastic gradient descent?

    • @mitocw
      @mitocw  4 ปีที่แล้ว +4

      The textbook listed in the course is: Strang, Gilbert. Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2019. ISBN: 9780692196380. See the course on MIT OpenCourseWare for more information at: ocw.mit.edu/18-065S18.

    • @brendawilliams8062
      @brendawilliams8062 2 ปีที่แล้ว

      Does this view and leg of math believe there is an unanswered Reiman hypothesis?

  • @SHASHANKRUSTAGII
    @SHASHANKRUSTAGII 3 ปีที่แล้ว

    Andrew NG didn't explain it in this detail
    That is why MIT is MIT,
    Thanks professor.

  • @akilarasan3288
    @akilarasan3288 8 หลายเดือนก่อน

    I would use MCMC to compute n sum to answer 14:00

  • @sadeghadelkhah6310
    @sadeghadelkhah6310 2 ปีที่แล้ว

    10:31 the [INAUDIBLE] thing is "Weight".

    • @mitocw
      @mitocw  2 ปีที่แล้ว

      Thanks for the feedback! The caption has been updated.

  • @robmarks6800
    @robmarks6800 2 ปีที่แล้ว

    Leaving the proof as a cliffhanger, almost worse than Fermat…

    • @papalau6931
      @papalau6931 ปีที่แล้ว

      You can find the proof by Prof. Survit Sra from Prof. Gilbert Strang's book titled "Linear Algebra and Learning from Data".

  • @shivamsharma8874
    @shivamsharma8874 4 ปีที่แล้ว

    please share slides of this lecture.

    • @mitocw
      @mitocw  4 ปีที่แล้ว +2

      It doesn't look like there are slides available. I see a syllabus, instructor insights, problem sets, readings, and a final project. Visit the course on MIT OpenCourseWare to see what materials we have at: ocw.mit.edu/18-065S18.

    • @vinayreddy8683
      @vinayreddy8683 4 ปีที่แล้ว +3

      Take a screenshots and prepare it by yourself!!!

  • @tuongnguyen9391
    @tuongnguyen9391 ปีที่แล้ว

    Where can I obtain professor sra's slide ?

    • @mitocw
      @mitocw  ปีที่แล้ว +1

      The course does not have slides of the presentations. The materials that we do have (problem sets, readings) are available on MIT OpenCourseWare at: ocw.mit.edu/18-065S18. Best wishes on your studies!

    • @tuongnguyen9391
      @tuongnguyen9391 ปีที่แล้ว +1

      @@mitocw Thank you, I think I gues I just noted everything down

  • @brendawilliams8062
    @brendawilliams8062 2 ปีที่แล้ว

    It appears that from engineering math view that there’s the problem.

  • @ac2italy
    @ac2italy 3 ปีที่แล้ว +1

    He cited images as an example for large feature set : nobody use standard ML for images, we use Convolution.

    • @elyepes19
      @elyepes19 3 ปีที่แล้ว +1

      I understand he is referring to Convolutional Neural Networks as a tool for image analysis as a generalized example

  • @jasonandrewismail2029
    @jasonandrewismail2029 10 หลายเดือนก่อน

    DISAPPOINTING LECTURE. BRING BACK THE PROFESSOR