Tutorial 15- Adagrad Optimizers in Neural Network

แชร์
ฝัง
  • เผยแพร่เมื่อ 13 ธ.ค. 2024

ความคิดเห็น • 110

  • @winviki123
    @winviki123 5 ปีที่แล้ว +120

    6:30 just a small correction
    when alpha t becomes zero,eta dash t doesn't become zero. (because alpha t is in the denominator)
    I think the point of using epsilon along with alpha in the denominator is to avoid division by zero error while calculating eta dash t

    • @mdejazuddin4939
      @mdejazuddin4939 4 ปีที่แล้ว +5

      he corrected it in the next video

    • @alakshendrasingh3425
      @alakshendrasingh3425 4 ปีที่แล้ว +9

      You are right. I came to the comment section after recognizing that mistake and found your comment at top....

    • @shashwatdev2371
      @shashwatdev2371 4 ปีที่แล้ว +1

      I was just going to comment on this.

  • @monasaleh1452
    @monasaleh1452 3 ปีที่แล้ว +3

    I've tried a lot before to understand these concepts and I couldn't. No one managed to simply them before. But Your way of explaining the concept is unique. I couldn't stop watching your videos. Thank you so much.

  • @adityavardhan6498
    @adityavardhan6498 2 หลายเดือนก่อน

    01:40 Understanding AdaGrad optimizer for neural networks.
    03:20 Introduction to Adagrad Optimizer in Neural Network
    05:00 Dense feature vs Sparse feature in neural networks
    06:40 Adaptive optimizer equation for AdaGrad with respect to different weights and iterations.
    08:20 Adagrad optimizers prevent weights from staying the same
    10:00 Adagrad optimizer adjusts the learning rate based on the features and iteration.
    11:40 Adagrad Optimizer adjusts learning rate based on features and iterations.
    13:17 Adagrad optimizer adjusts learning rates based on the frequency of features.
    Crafted by Merlin AI.

  • @deepansh.shakya.18
    @deepansh.shakya.18 9 หลายเดือนก่อน

    First teacher on youtube whose not even one video gives me understanding problem.

  • @mdejazuddin4939
    @mdejazuddin4939 4 ปีที่แล้ว +4

    CORRECTION @12:05- squaring of a number not necessarily produce large value(if number less than 1, it will decrease the value), it is the summation which is increasing alpha-t

  • @goramnikitha5491
    @goramnikitha5491 10 หลายเดือนก่อน

    not just understanding but loving thank you

  • @VVV-wx3ui
    @VVV-wx3ui 5 ปีที่แล้ว +1

    These sessions are great with simple explanation to understand the maths behind it. Cannot thank you enough Krish. God Bless You. Thank you so very much.

  • @dipenduchoudhury9984
    @dipenduchoudhury9984 5 ปีที่แล้ว +7

    I think the biggest drawback of using Adagrade optimizer is that If alpha will very big then learning rate(eta) will also very small. For that whenever you multiply gradient with small learning rate you will face Vanishing Gradient Problem which occurs in Sigmoid Activation function.

  • @vaibhavjain1124
    @vaibhavjain1124 5 ปีที่แล้ว +4

    seriously these are so amazing keep going , frome past 3 days i am just watching your videos for ML

  • @praveenkuthuru7439
    @praveenkuthuru7439 4 หลายเดือนก่อน

    wow....amazing explanation and making my recap so smooth

  • @shaiksuleman3191
    @shaiksuleman3191 4 ปีที่แล้ว

    Simply Super - No More Questions

  • @sunilkumar-pp6eq
    @sunilkumar-pp6eq 3 ปีที่แล้ว

    Krish!! you are great at explaining the Theory part of the Math involved in these algorithms, Helped me in understanding, Keep up the great work, I can see the enthusiasm in you while explaining and is very infectious, Thank you.

  • @claudiusdsouza2379
    @claudiusdsouza2379 5 ปีที่แล้ว

    Hi Krish, I have been searching videos for complete deep learning, but, I haven't got any informative videos as yours. Keep up the good job.

  • @shakyamunghate8382
    @shakyamunghate8382 4 ปีที่แล้ว

    This is quality content! Hats off to you Krish! These tutorials make our learning more insightful and easy. Thank you!

  • @rakeshacharjya8512
    @rakeshacharjya8512 5 ปีที่แล้ว +10

    main dis advantage of adagrad is it keep on decreasing when the iteration is going on ... and it will take so long time to reach the golbal minima.. eg: n=0.01.. after 100 iterations n =0.00000000.1 still it decreases so much to reach the global minima and it might go on infinity on keep on decresing..

    • @trend_dindia6714
      @trend_dindia6714 4 ปีที่แล้ว +1

      Rakesh Acharjya vanishing gradient....

  • @thedataguyfromB
    @thedataguyfromB 4 ปีที่แล้ว +1

    Amazing Krish sir

  • @rithwikshetty
    @rithwikshetty 4 ปีที่แล้ว +2

    Correction: The equation of alpha is supposed to be summation from i=1 to i=(t-1).
    Since the derivative of Loss function w.r.t weights at time t is found only when we know the weights at time t, which is still not found. So alpha is calculated by adding squared gradients upto time (t-1).
    Let me know if i am wrong.

  • @rampravesh4065
    @rampravesh4065 4 ปีที่แล้ว

    Very Good Explanation of Adagrad.

  • @AdityaSingh-yp9jn
    @AdityaSingh-yp9jn 8 หลายเดือนก่อน

    7:18 needs a small correction. Since we can compute gradient till (t-1), lopping should happen till (t-1). Thanks for the content.

  • @anushreerungta5035
    @anushreerungta5035 4 ปีที่แล้ว

    wonderfully explained. hats off.

  • @vgaurav3011
    @vgaurav3011 4 ปีที่แล้ว +1

    Loved it and studied along with Sebastian ruders blog cleared my concepts

  • @jagpreetsingh9730
    @jagpreetsingh9730 5 ปีที่แล้ว +8

    Sir the derivative part in the alpha equation would it always be greater than one because if we use sigmoid activation function its
    derivative would be between 0 to 0.25 and if you square it you would get a smaller value.

    • @rahulm774
      @rahulm774 4 ปีที่แล้ว +2

      But if you use RELU, and if the output of that neuron is negative, then the value will be 0. Hence the derivative will also become 0.

  • @vishalgupta3175
    @vishalgupta3175 4 ปีที่แล้ว

    Sir, Your Skills are Very Good

  • @abhishekdodda3584
    @abhishekdodda3584 4 หลายเดือนก่อน

    nice. Really helpful. thank you Sir

  • @iliasouzaro3462
    @iliasouzaro3462 4 ปีที่แล้ว

    it's my first time that i comment a youtube video's but i have to say that your are the best bro ,continu !!!!!!

  • @shaiquemustafa7609
    @shaiquemustafa7609 5 ปีที่แล้ว +1

    You are uploading extremely helpful videos.Keep doing that and you'll get popular on youtube very soon.

  • @JacoboPancorboBianquetti
    @JacoboPancorboBianquetti 2 หลายเดือนก่อน

    Very good video, really love it

  • @varshitamurthy8864
    @varshitamurthy8864 4 ปีที่แล้ว +8

    Hello Sir,
    Your tutorials are very insightful.Thank you so much :)
    Also.I have a query regarding Adagrad optimizer, how is the learning rate changing w.r.t dense/sparse features? I can see it is changing only w.r.t iterations.

    • @meghanajoshi4231
      @meghanajoshi4231 4 ปีที่แล้ว +10

      Hi ! Assume that we have a sparse value vectors, so most of the values are "0". As a result, when they are multiplied with weights, they will become "0"..That's why we want learning rate to change w.r.t iterations only and NOT WEIGHTS. You see we are working here with iterations and learning rate value.. and not weights as u said bcoz of the above reason.. !
      Hope this helps !

  • @neil007sal
    @neil007sal 4 ปีที่แล้ว

    Hi Krish, every video is masterpiece as long as there are no ads in it. Please remove the ads in between videos so that I can focus on content. If you want you can add ads at the start or end not in between the lecture. Hope you will do the needful. Cheers

  • @arunmehta8234
    @arunmehta8234 3 ปีที่แล้ว +6

    Sir, You video is very knowledgeable. BUT CAMERA TAKES LOT OF ITERATIONS TO FOCUS ON YOU.

    • @Jjhvh860
      @Jjhvh860 3 ปีที่แล้ว +1

      Weights of cameras ai trained model are not updating properly while backprop

  • @VarunSharma-ym2ns
    @VarunSharma-ym2ns 4 ปีที่แล้ว +1

    10:03 what Krish is saying, I didnt get it about learning rate ........ when more and more Iteration going on?..Could anyone please explain it

    • @gowthamelangovan5967
      @gowthamelangovan5967 4 ปีที่แล้ว +1

      if it has more hidden layers increase to more back propagation so that whats he s saying more iteration

    • @priyanakakohli7248
      @priyanakakohli7248 4 ปีที่แล้ว +1

      Epochs

  • @arpitcruz
    @arpitcruz 4 ปีที่แล้ว

    Very Good explanation

  • @louerleseigneur4532
    @louerleseigneur4532 3 ปีที่แล้ว

    Thanks Krish

  • @souvikpal8436
    @souvikpal8436 5 ปีที่แล้ว +2

    The term alpha + epsilon is in denominator. So something/0 won't be zero and it will give "division by zero" error. The epsilon is there to prevent division by zero error incase alpha becomes zero.

  • @olegsilkin329
    @olegsilkin329 หลายเดือนก่อน

    Great video sir, subscribed

  • @rajaramk1993
    @rajaramk1993 5 ปีที่แล้ว +1

    Keep up the good work..

  • @680551121
    @680551121 4 ปีที่แล้ว +1

    Thank you Krish for the wonderful playlist.
    When you said Alpha-T becomes huge with number of iterations because the derivatives keep adding up. However, the loss becomes smaller with each iteration. So, this doesn't happen always. Please correct me if i am wrong.

  • @alessandrofinoro1990
    @alessandrofinoro1990 5 ปีที่แล้ว +1

    Go ahead!

  • @amanjangid6375
    @amanjangid6375 4 ปีที่แล้ว

    Kaafi Sahi 👌👌👌

  • @divyarajvs8890
    @divyarajvs8890 4 ปีที่แล้ว

    ADAM optimizers are not available inthe playlist..can u include?

  • @meanuj1
    @meanuj1 5 ปีที่แล้ว +1

    Hey krish thanks a lot for video

  • @khanwaqar7703
    @khanwaqar7703 4 ปีที่แล้ว

    So nice of you sir thank you so much

  • @jaythakur2137
    @jaythakur2137 4 ปีที่แล้ว

    one correction: For alpha(t) calculation it should be summed over t-1 not t as you dont have w(t), because for w(t) you are calculating alpha(t) first.

  • @smurtiranjansahu5657
    @smurtiranjansahu5657 5 ปีที่แล้ว

    Thanx a lot this videos are so usefull for us.......

  • @piriyaie
    @piriyaie 4 ปีที่แล้ว

    Thank you very much sir for this video

  • @ashishsangwan5925
    @ashishsangwan5925 5 ปีที่แล้ว

    Awesome explanation bro

  • @vipindube5439
    @vipindube5439 5 ปีที่แล้ว +1

    we miss case study only include sparse and dense but there are other case as well, please provide one demo on this.

  • @bhavikdudhrejiya4478
    @bhavikdudhrejiya4478 3 ปีที่แล้ว

    Excellent

  • @fitriwabula5237
    @fitriwabula5237 4 ปีที่แล้ว

    sir, ur amazing, thanks for give the best knowledge

  • @topChart_Music
    @topChart_Music 4 ปีที่แล้ว

    Hi, in alpha t equation is the square after the sum of all values (whole Sq.2) or individual value is squared and then added?

  • @dhruvgangwani469
    @dhruvgangwani469 2 ปีที่แล้ว

    If the change in weights decreasing every epoch, do it results in vanishing gradient problem?

  • @BiranchiNarayanNayak
    @BiranchiNarayanNayak 5 ปีที่แล้ว

    Nice explanation.

  • @sargun_narula
    @sargun_narula 4 ปีที่แล้ว

    Just a doubt in adagrad optimiser if in formula of alpha .. we are doing summation of square of derivative... But how does derivatives sum up to such a large value. As each derivative of derivative becomes more less and less ?

  • @vikashjyoti320
    @vikashjyoti320 5 ปีที่แล้ว

    Where can i find live project and description.

  • @shan_singh
    @shan_singh 3 ปีที่แล้ว +1

    hahaha
    every time we have a fix for previous problem, we come up with its solution with some disadvantages
    and then we come up with a solution for it which we will learn in next video
    this is fun haha

  • @huannguyentrong1249
    @huannguyentrong1249 3 ปีที่แล้ว

    Thank you

  • @16876
    @16876 4 ปีที่แล้ว

    great tuttorial!

  • @trend_dindia6714
    @trend_dindia6714 4 ปีที่แล้ว

    Main disadvantage is it may lead to a vanishing gradient .....?

  • @prassaad8189
    @prassaad8189 4 ปีที่แล้ว

    Hi Krish, Excellent videos on ML and DL, Can you please add NLP videos in playlist. Thanks in advance.

  • @abhishekkaushik9154
    @abhishekkaushik9154 5 ปีที่แล้ว +1

    Awesome

  • @Bunny-yy6fo
    @Bunny-yy6fo 3 ปีที่แล้ว

    Can anyone explain what is the intuition behind the learning rates why we are taking those learning rates?

  • @sandipansarkar9211
    @sandipansarkar9211 4 ปีที่แล้ว

    Great video.Hope to crack when it comes to calculations and mathematics in solving interview based problems.Thanks

  • @erraviv
    @erraviv 3 ปีที่แล้ว

    I think there is a problem with the subscripts. Partial derivative with respect to w_i does not make sense. W is a vector of say D different weights. Didn't you use the subscript i for the time index already?

  • @tanvishinde805
    @tanvishinde805 4 ปีที่แล้ว

    in the equation the learning rate is only changing per iteration i.e. 't' , why does Krish say it is changing for every feature or neuron as well?

    • @tanvishinde805
      @tanvishinde805 4 ปีที่แล้ว

      is it because of alpha-t , as that depends on w-i (where i=1to1) ?

  • @pnachtwey
    @pnachtwey 6 หลายเดือนก่อน

    I have AdaGrad implemented in python using real data. I have 5 variables for a differential equation that needs to estimate the temperature response to a control output. I need to estimate a gain, two time constants, a dead time and ambient temperature. The data is not perfect is neither is my model, but it is close. The AdaGrad seems to work well compared to other gradient descent techniques, but a line search is still better. Better yet and algorithms such as Nelder-Mead or Levenberg-Marquardt found in scipy's optimize package. What is worse is that my cost function is the sum of squared errors between the actual response and my estimated response. There is no formula for this cost function, so the gradient is computed by taking a small step in each direction for each parameter and finding the slopes. Finding the gradient this way is noisy. The path "down hill" has lots of zigzags. I find it hard to believe people use gradient descent. The only advantage I see is that it is super easy to implement.

  • @bhargavasavi
    @bhargavasavi 4 ปีที่แล้ว

    I think weight updates generally happen for every mini-batch ....so there will be multiple steps within an eopch
    I have a question: Will learning rate be changed for each step(i.e mini batch) or each epoch(or iteration) ??

    • @krishnaik06
      @krishnaik06  4 ปีที่แล้ว +1

      Just see the next video we can also change the learning rate

    • @bhargavasavi
      @bhargavasavi 4 ปีที่แล้ว

      @@krishnaik06 Thanks...I have understood the concept now.

  • @satyaprakashpandey6911
    @satyaprakashpandey6911 4 ปีที่แล้ว

    can anybody provide the proper link of NLP by krish ?

  • @miteshmohite7829
    @miteshmohite7829 3 ปีที่แล้ว

    alpha t =summation(dL/dW t-1) square

  • @praneethcj6544
    @praneethcj6544 4 ปีที่แล้ว

    Sir could you please put a coding session where we will be dealing with a image classsification problem using deep learning .???
    Concepts are very clear and we need some coding knowledge asweell

  • @piyalikarmakar5979
    @piyalikarmakar5979 2 ปีที่แล้ว

    Sir iteration means each epoch or each batch??

  • @satyaprakashpandey6911
    @satyaprakashpandey6911 4 ปีที่แล้ว

    Can anybody help me why I am not getting NLP playlist by Krish sir?

    • @krishnaik06
      @krishnaik06  4 ปีที่แล้ว +1

      Hi I am revamping my playlist...it will be available soon

  • @akshatsingh6036
    @akshatsingh6036 4 ปีที่แล้ว

    Sir u didnt tell that how to differentiate between or find local minima and global minima .

  • @DineshBabu-gn8cm
    @DineshBabu-gn8cm 4 ปีที่แล้ว

    if we use sigmoid AF then the value of derivative would be b/w 0-0.25 and if we square that it will be always less than 0.25. then how alpha -t value will increase. please someone explain me.

  • @etc-b-28atharvapatil2
    @etc-b-28atharvapatil2 3 ปีที่แล้ว

    Heyy krish...can u provide us dl ml and nlp certification courses as the certificate could help us in our resumes

  • @nitayg1326
    @nitayg1326 5 ปีที่แล้ว

    Can we combine adagrad with momentum?

    • @akashkewar
      @akashkewar 5 ปีที่แล้ว +4

      adagrad has a problem with slow convergence which can be solved using adaDelta. And the combination of adaDelta and momentum is ADAM optimizer.

  • @vatsalshingala3225
    @vatsalshingala3225 ปีที่แล้ว

    ❤❤❤❤❤❤❤❤❤❤

  • @subhashishbt06658
    @subhashishbt06658 4 ปีที่แล้ว

    When Alpha becomes higher number then Etta at time t becomes smaller number and this i think is not a disadvantages ...

  • @MP-wq5xe
    @MP-wq5xe 4 ปีที่แล้ว

    Videos are too good.But the number of ads per lecture are spoiling it's quality.

  • @dosdospoa
    @dosdospoa 4 ปีที่แล้ว

    This is good video but the bad thing is there are MORE THAN 2 ADS DURING THE EXPLAINATION !

  • @anandkumartm7058
    @anandkumartm7058 4 ปีที่แล้ว

    Can anyone tell me, as number of iterations increases how alpha t increases? According to the formula of alpha t, as number of iterations increases loss should decrease and alpha t should decrease.

    • @jagdishjazzy
      @jagdishjazzy 4 ปีที่แล้ว

      Yes loss decreases with each iteration but when you see the formula its kind of cumulative sum from 1st iteration to the "t" th , so with every iteration the alpha will increase from t-1 th alpha.

    • @narendersingh2851
      @narendersingh2851 4 ปีที่แล้ว

      Hey @Anand thanks for asking the question and @jagdish thanks for the reply... very well explained. Cleared my doubt.

  • @xiaoweidu4667
    @xiaoweidu4667 3 ปีที่แล้ว

    good.

  • @zaladhaval4419
    @zaladhaval4419 5 ปีที่แล้ว +1

    Can you make videos about gradient checking???

  • @karthikeyansakthivel5308
    @karthikeyansakthivel5308 4 ปีที่แล้ว

    which is not the best learning rate

  • @shashwatdev2371
    @shashwatdev2371 4 ปีที่แล้ว

    Please add subtitles in all your videos.

  • @akashgayakwad9550
    @akashgayakwad9550 5 ปีที่แล้ว +1

    What is use of going slowly by each iteration , whether we can go at same rate by each iteration ,
    Can anyone answer me?

    • @sarthakchauhan640
      @sarthakchauhan640 4 ปีที่แล้ว

      As we reach near global minima, we want to take smaller steps so that we don't diverge. But if we are far away from minima we can take bigger steps towards minima.

    • @narendersingh2851
      @narendersingh2851 4 ปีที่แล้ว

      @@sarthakchauhan640 Hi , i know you are right but i dont understand why there is any need of decreasing the learning rate if the dL/dw is getting smaller as it is reaching toward minima.

  • @shubhangiagrawal336
    @shubhangiagrawal336 4 ปีที่แล้ว

    many ads in small videos

  • @pankajkar2008
    @pankajkar2008 5 ปีที่แล้ว

    My dear sir ... kindly do some programming so that we can learn ...how it can be done in real

  • @anandhasrivi
    @anandhasrivi 4 ปีที่แล้ว

    Explanation is good with convex loss funtion. But Adagrad is not meant for convex funtion it is for deeplearning where we will have multiple local minimum and single global minimum. Our main aim is to cuming out of local minimum. You have completly ignored that concept. Intution is good.

    • @manojsamal7248
      @manojsamal7248 3 ปีที่แล้ว

      Could you please tell me how the points are coming out as I can see frm video no. Of iterations it will reach to local minima how it will jump to global maxima..?

  • @scarfacej1361
    @scarfacej1361 4 ปีที่แล้ว

    why the video is like this , focus on board and then on your face , please correct this