Neural Networks Part 7: Cross Entropy Derivatives and Backpropagation

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ธ.ค. 2024

ความคิดเห็น • 327

  • @statquest
    @statquest  3 ปีที่แล้ว +9

    The full Neural Networks playlist, from the basics to deep learning, is here: th-cam.com/video/CqOfi41LfDw/w-d-xo.html
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @bigbangdata
    @bigbangdata 3 ปีที่แล้ว +121

    Your talent for explaining these difficult concepts and organizing the topics in didactic, bite-sized, and visually compelling videos is astounding. Your channel is a great resource for beginners and advanced practitioners who need a refresher on a particular concept. Thank you for all that you do!

    • @statquest
      @statquest  3 ปีที่แล้ว +8

      Wow, thank you!

  • @Rationalist-Forever
    @Rationalist-Forever 2 ปีที่แล้ว +21

    Right now I am reading the ML book "An Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani. Many a times, I stuck at the mathematical details and could not comprehend and stopped reading. Although I love that book a lot but felt frustrated. But now I use your videos and read the book side by side and now everything start making sense in the book. You are such a great story teller. They way you explains in the video with examples, it seems like I am listening to a story "There was a king ..." It is so soothing and complex topics become easy. I feel you are my friend and teacher in my ML journey who understand my pain, and explains me the hard things with ease. BTW, I have done Master in Data Science from Northwestern University and got good ML foundation from that course. But I can tell you now I feel complete after going through most of your videos. Mr. Starmer, we are lucky to have you as such a great teacher and mentor. You are gifted to teach people. I will pledge to support your channel from my heart. Thank you.

    • @statquest
      @statquest  2 ปีที่แล้ว

      Wow! Thank you very much!!! :)

  • @naf7540
    @naf7540 ปีที่แล้ว +16

    Dear Josh, how is it at all possible to deconstruct so clearly all these concepts, just incredible, thank you very much, your videos are addictive!!

    • @statquest
      @statquest  ปีที่แล้ว

      Thank you very much! :)

  • @wennie2939
    @wennie2939 3 ปีที่แล้ว +17

    Josh Starmer is THE BEST! I really appreciate your patience in explaining the concepts step-by-step!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much! :)

  • @simhasankar3311
    @simhasankar3311 ปีที่แล้ว +2

    Imagine the leaps and bounds we could achieve in global education if this teaching method was implemented universally. We would have a plethora of students equipped with the analytical skills to tackle complex issues. Your contributions are invaluable. Thank you!

    • @statquest
      @statquest  ปีที่แล้ว

      Thank you so much!

  • @RubenMartinezCuella
    @RubenMartinezCuella 3 ปีที่แล้ว +24

    Even though there are many other youtube channels that also explain NN, your videos are unique in the sense that you break down every single process into small operations easy to understand by anyone. Keep up the great work Josh, everyone here appreciates so much your effort!! :D

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much! :)

  • @AbdulWahab-mp4vn
    @AbdulWahab-mp4vn ปีที่แล้ว +2

    WOW ! I have never seen anyone explaining topics in such minute level detail. You are an angel to us Data Science Students ! Love from Pakistan

    • @statquest
      @statquest  ปีที่แล้ว

      Thank you very much!

  • @positive_freedom
    @positive_freedom 2 ปีที่แล้ว +11

    Your videos are truly astounding. I've gone through so many youtube playlists looking to understand Neural Networks, and none of them can come close to yours in terms of simplicity & content! Please keep up this amazing work for beginners like me :)

    • @statquest
      @statquest  2 ปีที่แล้ว

      Glad you like them!

  • @YLprime
    @YLprime 10 หลายเดือนก่อน +3

    This channel is awesome, my deep learning knowledge is sky rocketing everyday.

    • @statquest
      @statquest  10 หลายเดือนก่อน

      bam!

  • @iZapz98
    @iZapz98 3 ปีที่แล้ว +13

    all your videos have helped me tremendously studying for my ML - exam, thank you

    • @statquest
      @statquest  3 ปีที่แล้ว

      Great to hear!

  • @Lucas-Camargos
    @Lucas-Camargos ปีที่แล้ว +1

    This is the best Neural Networks example video I've ever seen.

    • @statquest
      @statquest  ปีที่แล้ว +1

      Thank you very much! :)

  • @abhishekjadia1703
    @abhishekjadia1703 2 ปีที่แล้ว +1

    Incredible !! ...You are not teaching, You are revealing !!

    • @statquest
      @statquest  2 ปีที่แล้ว

      Wow, thank you!

  • @salahaldeen1751
    @salahaldeen1751 2 ปีที่แล้ว +1

    I don't know where else I could understand that like this. Thanks, you're talented!!!

  • @saurabhdeshmane8714
    @saurabhdeshmane8714 ปีที่แล้ว +1

    Incredibly done....it doesn't even feel like we are learning such complex topics...keeps me engaged for going via entire playlist..thank you for such content!!

    • @statquest
      @statquest  ปีที่แล้ว

      Glad you liked it!

  • @yourfavouritebubbletea5683
    @yourfavouritebubbletea5683 ปีที่แล้ว +3

    Incredibly well done. I'm astonished and thank you for letting me not have a traumatic start with ML

  • @farrukhzamir
    @farrukhzamir 8 หลายเดือนก่อน +2

    Brilliantly explained. You explain the concept in such a manner that it becomes very easy to understand. God bless you. I don't know how to thank you really. Nobody explains like you.❤

    • @statquest
      @statquest  8 หลายเดือนก่อน

      Thank you!

  • @vishnukumar4531
    @vishnukumar4531 2 ปีที่แล้ว +3

    0 comments left unreplied!
    Josh, you are truly one of a kind! ❣❣❣

  • @anisrabahbekhoukhe3652
    @anisrabahbekhoukhe3652 ปีที่แล้ว +3

    i literally cant stop from watching those vids, help me

  • @johannesweber9410
    @johannesweber9410 7 หลายเดือนก่อน +1

    Nice Video! First I was a little confused (like always) but than I pluged your values and the exact structure of your Neural Network into my own small framework and compared the results. After I did this, i followed your instructions and implemented the backpropagation step-by-step. Thanks for the nice video!

    • @statquest
      @statquest  7 หลายเดือนก่อน +1

      BAM!

  • @ligezhang4735
    @ligezhang4735 ปีที่แล้ว +1

    This is so impressive! Especially for the visualization of the whole process. It really makes things very easy and clear!

  • @susmitvengurlekar
    @susmitvengurlekar 2 ปีที่แล้ว +2

    "I want to remind you" helped me understand why in the world is P(setosa) involved in output of versicolor and virginica.
    Great explanation!

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Hooray!!! I'm glad the video was helpful.

  • @pietrucc1
    @pietrucc1 3 ปีที่แล้ว +1

    I started using the techniques of the machine learning from a little less than a month, I found this site and it helped me a lot, thank you very much !!

  • @samerrkhann
    @samerrkhann 3 ปีที่แล้ว +3

    A huge appreciation for all the efforts you put. Thank you josh!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thank you! :)

  • @rajpulapakura001
    @rajpulapakura001 ปีที่แล้ว +2

    Clearly and concisely explained! Thanks Josh! P.S. If you know your calculus, I would highly recommend trying to compute the derivatives yourself before seeing the solution - it helps a lot!

  • @tejaspatil3978
    @tejaspatil3978 2 ปีที่แล้ว +1

    your way of learniing is on next level. thanks for having us this best sessions..

  • @nabeelhasan6593
    @nabeelhasan6593 3 ปีที่แล้ว +1

    At last I am really thankful for all your hard effort you put in these videos immensely helped me in making a strong foundation in deeplearning

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thank you very much! :)

  • @donfeto7636
    @donfeto7636 ปีที่แล้ว +1

    You are a national treasure BAAAM. Keep doing those video they are great.

  • @GamTinjintJiang
    @GamTinjintJiang 2 ปีที่แล้ว +1

    wow~ your videos are so intuitive to me. What a precious deposits!

  • @Meditator80
    @Meditator80 3 ปีที่แล้ว +1

    Thank you so much! It is so clear for explaining the calculation of Cross Entropy Derivative and how to use it in BP

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you very much! :)

  • @Recordingization
    @Recordingization 3 ปีที่แล้ว +1

    Thanks for nice lecture!I finally understand the derivative of cross Entropy and optimization of bias.

  • @RC4boumboum
    @RC4boumboum 2 ปีที่แล้ว +2

    Your courses ara so good! Thanks a lot for your time :)

    • @statquest
      @statquest  2 ปีที่แล้ว

      You're very welcome!

  • @KayYesYouTuber
    @KayYesYouTuber ปีที่แล้ว +1

    So beautiful. Never seen anything like this!!!

  • @chethanjjj
    @chethanjjj 3 ปีที่แล้ว

    @18:20 what i've been looking for awhile. thank you!

  • @samore11
    @samore11 ปีที่แล้ว +1

    These videos are so good - the explanations and quality of production are elite. My only nitpick was it is hard for me to see "x" and not think the letter "x" as opposed to a multiplication sign - but that's a small nitpick.

    • @statquest
      @statquest  ปีที่แล้ว +1

      After years of using confusing 'x's in my videos, I've finally figured out how to get a proper multiplication sign.

  • @sergeyryabov2200
    @sergeyryabov2200 11 หลายเดือนก่อน +1

    Thanks!

    • @statquest
      @statquest  11 หลายเดือนก่อน

      TRIPLE BAM!!! Thank you so much for supporting StatQuest!!! :)

  • @bonadio60
    @bonadio60 3 ปีที่แล้ว +1

    Your explanation is fantastic!! Thanks

    • @statquest
      @statquest  3 ปีที่แล้ว

      Thank you! :)

  • @arielcohen2280
    @arielcohen2280 ปีที่แล้ว

    hate all the songs and the meaningless sound affects, but damn I have been trying to understand this concept for hell long of a time and you made it clear

  • @susmitvengurlekar
    @susmitvengurlekar 2 ปีที่แล้ว +2

    There is nothing wrong in self promotion and frankly, you don't need promotion. Anyone who watches any one video of yours, will prefer your videos over any other videos henceforth.

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Wow! Thank you!

  • @madankhatri7727
    @madankhatri7727 11 หลายเดือนก่อน

    Your explaination of hard concepts are pretty amazing. I have been stuck in a very difficult concept called adam optimizer. Please explain it. You are my last hope.

  • @jamasica5839
    @jamasica5839 3 ปีที่แล้ว +1

    This is even more bonkers than Backpropagation Details Pt. 2 :O

    • @statquest
      @statquest  3 ปีที่แล้ว

      double bam! :)

  • @charliemcgowan598
    @charliemcgowan598 3 ปีที่แล้ว +2

    Thank you so much for all your videos, they're actually amazing!

    • @statquest
      @statquest  3 ปีที่แล้ว

      Glad you like them!

  • @r0cketRacoon
    @r0cketRacoon 9 หลายเดือนก่อน

    Thank you very much for the video
    Backpropagation with multiple outputs to me is not that hard but it's really a mess when do the computations

    • @statquest
      @statquest  9 หลายเดือนก่อน

      Yep. The good news is that PyTorch will do all that for us.

  • @shreeshdhavle25
    @shreeshdhavle25 3 ปีที่แล้ว +1

    Finally was waiting for new video so long...

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      Thanks!

    • @shreeshdhavle25
      @shreeshdhavle25 3 ปีที่แล้ว +1

      @@statquest Thanks to you Josh..... Best content in the whole world.... Also thanks to you and your content I am working in Deloitte now.

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@shreeshdhavle25 Wow! That is awesome news! Congratulations!!!

  • @osamahabdullah3715
    @osamahabdullah3715 3 ปีที่แล้ว +1

    I really can't give enough from your videos, what an amazing way of explanation , thanks for sharing your knowlege with us, when is gonna be your next videos plz ?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      My next video should come out in about 24 hours.

    • @osamahabdullah3715
      @osamahabdullah3715 3 ปีที่แล้ว +1

      @@statquest what a wonderful news, thank you sir

  • @dianaayt
    @dianaayt 9 หลายเดือนก่อน

    20:14 if we have a lot more training data we would just add all the training data we have in this to make the backpropagation?

    • @statquest
      @statquest  9 หลายเดือนก่อน +1

      Yes, or we can put the data into smaller "batches" and process the data with batches (so, if we had 10 batches, each with 50 samples each, we would only add up the 50 values in a batch before updating the parameters).

    • @r0cketRacoon
      @r0cketRacoon 9 หลายเดือนก่อน +1

      there are some methods like mini-batch gradient descent, and stochastic gradient descent, u should do some diggings about it

  • @epiccabbage6530
    @epiccabbage6530 ปีที่แล้ว +1

    This has been extremely helpful, this series is great. I am a little confused though as too why we repeat the calculations for p.setosa, i.e. why we cant simply run through the calculations once, and use the same p.setosa value 3 times (So like, x-1 + x + x) and use that for the bias recalculation. But either way this has cleared up a lot for me

    • @statquest
      @statquest  ปีที่แล้ว

      What time point, minutes and seconds, are you asking about?(unfortunately I can't remember all of the details in all of my videos)

    • @epiccabbage6530
      @epiccabbage6530 ปีที่แล้ว +1

      @@statquest starting at 18:50, you go through three different observations and solve for the cross entropy. I'm curious as too why you need to look at three different observations, i.e. why you need to plug in values 3 times instead of just doing it once. If we want to solve for psetosa twice and psetosa-1 once, why do we need to do the equation three times, instead of just doing it once? Why can't we just do 0.15-1 + 0.15 + 0.15

    • @statquest
      @statquest  ปีที่แล้ว

      @@epiccabbage6530 Because each time the predictions are made using different values for the petal and sepal widths. So we take that into account for each prediction and each derivative relative to that prediction.

    • @epiccabbage6530
      @epiccabbage6530 ปีที่แล้ว

      @@statquest Right, but why do we look at multiple predictions in the context of changing the bias once? Is it just a matter of batch size?

    • @statquest
      @statquest  ปีที่แล้ว +2

      @@epiccabbage6530 Yes, in this example, we use the entire dataset (3 rows) as a "batch". You can either look at them all at once, or you can look at them one at a time, but either way, you end up looking at all of them.

  • @rahulkumarjha2404
    @rahulkumarjha2404 2 ปีที่แล้ว +2

    Thank you for such an awesome video!!!
    I just have one doubt.
    At 18:12 of the video.
    The summation has 3 values because there are 3 items in the dataset.
    Let's say if we have 4 items in the dataset i.e 2 items of setosa, 1 for virginica and 1 for versicolor.
    So our summation will look like
    {(psetosa - 1) + (psetosa - 1) + psetosa + psetosa}
    i.e the summation is for the data setosadata_row1 , setosadata_row2, versicolordata_row3, virginicadata_row4
    Am I right?

    • @statquest
      @statquest  2 ปีที่แล้ว

      yep

    • @rahulkumarjha2404
      @rahulkumarjha2404 2 ปีที่แล้ว

      @@statquest
      Thank You!!
      Your entire neural network playlist is awesome.

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@rahulkumarjha2404 Hooray! Thank you!

  • @Waffano
    @Waffano 2 ปีที่แล้ว

    Watching these videos makes me wonder how in the world someone came up with this in the first place. I guess it slowly evolved from something more simple, but still, would be cool to learn more about the history of neural networks :O If anyone knows of any documentaries or books please do share ;)

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      A history would be nice.

  • @ferdinandwehle2165
    @ferdinandwehle2165 2 ปีที่แล้ว

    Hello Josh, your videos inspired me so much that I am trying to replicate the classification of the iris dataset.
    For my understanding, are the following statements true:
    1) The weights between the blue/orange nodes and the three categorization outputs are calculated in the same fashion as the biases (B3, B4, B5) in the video, as there is only one chain rule “path”.
    2) For weights and biases before the nodes there are multiple chain rule differentiation “paths” to the output: e.g. W1 can be linked to the output Setosa via the blue node, but could also be linked to the output Versicolour via the orange node; the path is irrelevant as long as the correct derivatives are used (especially concerning the SoftMax function).
    3) Hence, this chain rule path is correct given a Setosa input: dCEsetosa/dW1 = (dCEsetosa/d”Psetosa”) x (d”Psetosa”/dRAWsetosa) x (dRAWsetosa/dY1) x (dY1/dX1) x (dX1/dW1)
    Thank you very much for your assistance and the more than helpful video.
    Ferdinand

    • @statquest
      @statquest  2 ปีที่แล้ว

      I wish I had time to think about your question - but today is crazy busy so, unfortunately I can't help you. :(

    • @ferdinandwehle2165
      @ferdinandwehle2165 2 ปีที่แล้ว +1

      @@statquest No worries. The essence of the question is: how to optimize W1? Maybe you could have a think about it on a calmer day (:

    • @statquest
      @statquest  2 ปีที่แล้ว

      @@ferdinandwehle2165 Regardless of the details, I think you are on the right track. The w1 can be influenced by a lot more than b3 is.

  • @gabrielsantos19
    @gabrielsantos19 3 หลายเดือนก่อน +1

    Thank you, Josh! 👍

    • @statquest
      @statquest  3 หลายเดือนก่อน

      My pleasure!

  • @wuzecorporation6441
    @wuzecorporation6441 ปีที่แล้ว

    18:04 Why are we taking sum of gradient of cross entropy across different data points? Won't it be better if we take gradient for one data point and do back propagation and then take gradient of another data point to do backpropagation?

    • @statquest
      @statquest  ปีที่แล้ว +1

      You can certainly do backpropagation using one data point at a time. However, in practice, it's usually much more efficient to do it in batches, which is what we do here.

    • @sanjanamishra3684
      @sanjanamishra3684 11 หลายเดือนก่อน

      @@statquest `thanks for the great series! I had a similar doubt regarding this. I understand the point of processing in batches and taking a batch wise loss but what I can't wrap my head around is why we need to have datapoints that predict all the three categories i.e. setosa, virginica and versicolor? Does this mean that in practice we have to ensure that each batch covers all the data points i.e. a classic data imbalance problem? I normally thought that ensuring data imbalance overall in the dataset is enough. Please clarify this, thanks!

    • @statquest
      @statquest  11 หลายเดือนก่อน

      @@sanjanamishra3684 Who said you needed data points that predict all 3 species?

  • @GLORYWAVE.
    @GLORYWAVE. 11 หลายเดือนก่อน

    Thanks Josh for an incredibly well put together video.
    I have two quick questions:
    1) When you initially get that new b3 value of -1.23, and then say to repeat the process, I am assuming the process is repeated with a new 'batch' of 3 training samples, correct? i.e. you wouldn't use the same 3 that were just used?
    2) Are these multi-classification models always structured in such a way that each 'batch' or 'iteration' includes 1 actual observed sample from each class like in this example? It appears that the Total Cross Entropy calculation and derivatives would not make sense otherwise.
    Thanks again!

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      1) In this case, the 3 samples is all the data we have, so we reuse them for every iteration. If we had more data, we might have different samples in different batches, but we would eventually reuse these samples at some later iteration.
      2) No. You just add up the cross entropy, regardless of how the samples are distributed, to get the total.

  • @marahakermi-nt7lc
    @marahakermi-nt7lc 3 หลายเดือนก่อน

    heyy josh i think there is a mistake in the video at 18:54 if the predicted value is setosa i think the correspanding raw output of setosa and also the probability should be the biggest isnt that right ?

    • @statquest
      @statquest  3 หลายเดือนก่อน

      The video is correct. At that time point the weights in the model are not yet full trained - so the predictions are not great, as you see. The goal of this example is to use backpropagation to improve the predictions.

    • @marahakermi-nt7lc
      @marahakermi-nt7lc 3 หลายเดือนก่อน +1

      @@statquest i m sorry jash my bad you are brillant man baaaaaam

  • @stan-15
    @stan-15 2 ปีที่แล้ว +1

    since you used 3 sample data to get the value of the three cross-entropy derivitives, does this mean we must use multiple inputs for one gradient descent step when using cross-entropy? (more precisely, does this mean we have to use n input samples, that each light up all n features of the outputs, in order to be able to compute the appropriate derivative of the bias, and thus in order to perform one single gradient descent step?)

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      No. You can use 1 input if you want. I just wanted to illustrated all 3 cases.

  • @Pedritox0953
    @Pedritox0953 3 ปีที่แล้ว +1

    Great explanation

  • @nonalcoho
    @nonalcoho 3 ปีที่แล้ว +1

    It is really easy to understand even I am not good at calculus.
    And I got the answer that I asked you what's the meaning of the derivative of softmax in the last video. I am really so happy!
    Btw, will you make more programming lessons like you made before~?
    Thank you very much!

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      I hope to do a "hands on" webinar for neural networks soon.

    • @nonalcoho
      @nonalcoho 3 ปีที่แล้ว +1

      @@statquest looking forward to it!

  • @pedrojulianmirsky1153
    @pedrojulianmirsky1153 2 ปีที่แล้ว +1

    Thank you for all your videos, you are the best!
    I have one question though. Lets suppose you have the worst possible fit for your model, where it predicts pSetosa = 0 for instances labeled Setosa, and pSetosa = 1 for those labeled either Virginica or Versicolor.
    Then, for each Setosa labeled instance, you would get dCESetosa/db3 = pSetosa - 1 = -1, and for each nonSetosa labeled instance dCEVersiOrVirg/db3 = pSetosa = +1.
    In summary, the total dCE/db3 would be accumulating either +1 for each Setosa instance and -1 for each non Setosa. So, if you have for example a dataset with 5 Setosa, 2 Versicolor and 3 Virginca:
    dCE(total)/db3 = (1+1+1+1+1) + (-1 -1) +(-1 -1 -1) = 5-2-3 = 0.
    The total dCE/db3 would be 0, as if the model had the best fit for b3.
    Because of this compensation between the opposite signs (+) and (-), the weight (b3) wouldn´t be adjusted by gradient descent, even though the model classifies badly.
    Or maybe I missunderstood something haha.
    Anyways, I got into ML and DL mainly because of your videos, can't thank you enough!!!!!!!

    • @statquest
      @statquest  2 ปีที่แล้ว +2

      To be honest, I don't think that is possible because of how the softmax function works. For example, if it was known that the sample was setosa, but the output value was 0, then we would have e^0 / (e^0 + e^versi + e^virg) = 1 / (1 + e^versi + e^vrig) > 0.

  • @АлександраРыбинская-п3л
    @АлександраРыбинская-п3л ปีที่แล้ว

    Dear Josh, I adore your lessons! They make everything so clear! I have a small question regarding this video. Why do you say that the predicted species is setosa when the predicted probability for setosa is only 0.15 (17:13 - 17:20)? There is larger value (0.46) for virginica in this case (17:14). Why don't we say it's virginica?

    • @statquest
      @statquest  ปีที่แล้ว

      You are correct that virginica has the largest output value - however, because we know that the first row of data is for setosa, for that row, we are only interested in the predicted probability for setosa. This gives us the "loss" (the difference between the known value setosa, 1, and the predicted value for setosa, 0.1 (except in this case we using logs)) for that first row. For the second row, the known value is virginica, so, for that row, we are only interested in the predicted probability for virginica.

    • @АлександраРыбинская-п3л
      @АлександраРыбинская-п3л ปีที่แล้ว +1

      Thanks@@statquest

  • @ariq_baze4725
    @ariq_baze4725 2 ปีที่แล้ว +1

    Thank you, you are the best

  • @hangchen
    @hangchen 10 หลายเดือนก่อน

    Awesome explanation! Now I understand neural networks in more depth! Just one question - shouldn't the output of the softmax values sum to 1? @18:57

    • @statquest
      @statquest  10 หลายเดือนก่อน

      Thanks! And yes, the output of the softmax should sum to 1. However, I rounded the numbers to the nearest 100th and, as a result, it appears like they don't sum to 1. This is just a rounding issue.

    • @hangchen
      @hangchen 10 หลายเดือนก่อน +1

      Oh got it! Right if I add them up they are 1.01, which is basically 1. I just eyeballed it. Should have done a quick mind calc haha! By the way, I am so honored to have your reply!! Thanks for making my day (again, BAM!)!@@statquest

    • @statquest
      @statquest  10 หลายเดือนก่อน

      @@hangchen :)

  • @michaelyang3414
    @michaelyang3414 6 หลายเดือนก่อน

    excellent work!!! could you make one more video to show how to do all the parameters at the same time.

    • @statquest
      @statquest  6 หลายเดือนก่อน

      I show that for a simple neural network in this video: th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

    • @michaelyang3414
      @michaelyang3414 6 หลายเดือนก่อน +1

      @@statquest Yes, I watched that video several times. Actually, I watched all 28 videos in your neural network/deep learning series several times. I am also a member and have bought your books. Thank you for your excellent work! But that video is just for one input and one output. Would you make another video to show how to handle multiple inputs and outputs, similar to the video you recommended?

    • @statquest
      @statquest  6 หลายเดือนก่อน

      @@michaelyang3414 Thank you very much for your support! I really appreciate it. I'll keep that topic in mind.

  • @Tapsthequant
    @Tapsthequant 3 ปีที่แล้ว

    So much gold in this one video, how did you select the learning rate of 1. In general how do you select learning rates? Do you have ways to dynamically alter the learning rate in gradient descent? Taking recommendations.

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      For this video I coded everything by hand and setting the learning rate to 1 worked fine and was super easy. However, in general, most implementations of gradient descent will dynamically change the learning rate for you - so it should not be something you have to worry about in practice.

    • @Tapsthequant
      @Tapsthequant 3 ปีที่แล้ว

      Thank you 😊, you know I have been following this series and taking notes. I literally have a notebook.
      I also have Excel workbooks with implementations of the examples. I'm now at this video of CE, taking notes again.
      This is the softest landing I have had to a subject. Thank you 😊.
      Now how do I take this subject of Neural Networks, further after this series. I am learning informally.
      Thank you Josh Starmer,

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@Tapsthequant I think the next step is to learn about RNNs and LSTMs (types of neural networks). I'll have videos on those soon.

  • @evilone1351
    @evilone1351 2 ปีที่แล้ว

    Excellent series! Enjoyed every one of them so far, but that's the one where I lost it :) Too many subscripts and quotes in formulas.. Math has been abstracted too much here I guess, sometimes just a formula makes it easier to comprehend :D

  • @MADaniel717
    @MADaniel717 3 ปีที่แล้ว +1

    If I want to find biases of other nodes, I just do the derivative with respect to them? What about the weights? Just became a member, you convinced me with these videos lol, congrats and thanks

    • @statquest
      @statquest  3 ปีที่แล้ว

      Wow! Thank you for your support. For a demo of backpropagation, we start with one bias: th-cam.com/video/IN2XmBhILt4/w-d-xo.html then we extend that to one bias and 2 weights: th-cam.com/video/iyn2zdALii8/w-d-xo.html then we extend that to all biases and weights: th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

    • @MADaniel717
      @MADaniel717 3 ปีที่แล้ว

      @@statquest Thanks Josh! Maybe I left it unnoticed. I meant for hidden layers' weighs and biases.

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      @@MADaniel717 Yes, those are covered in the links I provided in the last comment.

  • @yuewang3962
    @yuewang3962 3 ปีที่แล้ว +2

    Caught a fresh one

  • @zedchouZ2ed
    @zedchouZ2ed 2 หลายเดือนก่อน

    at the end of this video,backpropagation algorithm use Batch Gradient Descent to update the b3,which means using the whole dataset to update one weight or biases .If we only use one sample then it would be SGD and if we have more data and split them into minibatchs and input them one by one ,it would be mini-batch Gradient Descent.Am I right about this training strategy?

    • @statquest
      @statquest  2 หลายเดือนก่อน

      yep

  • @minerodo
    @minerodo ปีที่แล้ว

    Thank you!! I understood everything but just a question: here you explain how to modify a single bias, and know I understand how to do it for each one of the biases. My question is how do you back propagate to the biases that are in the hidden layer ? In what moment ? After yo finish with b3, b4 and b5? Thanks!!

    • @statquest
      @statquest  ปีที่แล้ว

      I show how to backpropagate through the hidden layer in this video: th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

  • @praveerparmar8157
    @praveerparmar8157 3 ปีที่แล้ว +2

    Waiting for "Neural Networks in Python: from Start to Finish" :)

    • @statquest
      @statquest  3 ปีที่แล้ว +4

      I'll start working on that soon.

    • @xian2708
      @xian2708 3 ปีที่แล้ว +1

      Legend!

  • @saibalaji99
    @saibalaji99 2 ปีที่แล้ว

    Do we use the same training data until all the biases are optimised?

  • @sachinK-k5q
    @sachinK-k5q 9 หลายเดือนก่อน

    please create one such Series for single layer Perceptron as well and show the derivative as well

    • @statquest
      @statquest  9 หลายเดือนก่อน

      I'll keep that in mind.

  • @shubhamtalks9718
    @shubhamtalks9718 3 ปีที่แล้ว +1

    BAM! Clearly explained.

  • @user-rt6wc9vt1p
    @user-rt6wc9vt1p 3 ปีที่แล้ว

    Are we calculating the derivative of the total cost function (ex - log(a) - log(b) - log(c)), or just the loss for that respective weight's output?

    • @statquest
      @statquest  3 ปีที่แล้ว

      We are calculating the derivative of the total cross entropy with respect to the bias, b3.

  • @borishjha5700
    @borishjha5700 2 ปีที่แล้ว

    The word "probability" is spelled wrong at the timestamp 18:55

  • @a909ym0u7
    @a909ym0u7 หลายเดือนก่อน

    i have a question when we optimized b3 having constant b4 and b5 so when we try to optimize b4,b5 will this effects back b3 or what because there values now changing

    • @statquest
      @statquest  หลายเดือนก่อน

      Regardless of the number of parameters you are estimating, you evaluate the derivatives with the current state of the neural network before updating all of the parameters. For more details, see: th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

  • @_epe2590
    @_epe2590 3 ปีที่แล้ว +1

    Please could you do videos on classification specificly gradient descent for classification.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Can you explain how that would be different from what is in this video? In this video, we use gradient descent to optimize the bias term. In neural network circles, they call this "backpropagation" because of how the derivatives are calculated, but it is still just gradient descent.

    • @_epe2590
      @_epe2590 3 ปีที่แล้ว +1

      @@statquest Well when I see others explaining it its usually with a 3 dimention nnon linear graph. When you demo it the graph always looks like a parabloa. Am I missing something important?

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@_epe2590 When I demo it, I try to make it as simple as possible by focusing on just one variable at a time. When you do that, you can often draw the loss function as a parabola. However, when you focus on more than one variable, the graphs get much more complicated.

    • @_epe2590
      @_epe2590 3 ปีที่แล้ว +1

      @@statquest Ok. And I love you videos by the way. They are easy to understand and to absorb it all. BAM!

  • @aritahalder9397
    @aritahalder9397 2 ปีที่แล้ว

    hi, do we have to consider the inputs as batches of setosa,versicolor and verginica?? what if while calculating the derivative of total CE we had 1st row setosa as well as the 2nd row setosa?? what will be the value for dCE(pred2)/db3??

    • @statquest
      @statquest  2 ปีที่แล้ว

      We don't have to consider batches - we should be able to add up the losses from each sample for setosa.

  • @lokeshbansal2726
    @lokeshbansal2726 3 ปีที่แล้ว

    Thank you so much! You are making some amazing content.
    Can you please suggest some good book for Neural Networks in which mathematics of algorithms is explained or can you please tell from where you are learning about machine learning and neural networks.
    Again thankyou for these precious videos.

    • @statquest
      @statquest  3 ปีที่แล้ว

      Here's where I learned about the math behind cross entropy: www.mldawn.com/back-propagation-with-cross-entropy-and-softmax/ (by the way, I didn't watch the video - I just read the web page).

  • @sonoVR
    @sonoVR ปีที่แล้ว

    This is really helpful!
    So am I right to assume that in the end, when using one hot encoding we can simplify it to d/dBn = Pn - Tn and d/dWni = (Pn - Tn)Xi ?
    Given n is the number of outputs, P is the prediction, T is the one hot encoded target, i is the number of inputs, Wni is the weight associated from that input to the respective output and X is the input.
    Then when backpropagating, we can transpose the weights, multiply the weights by the respective error of Pn -Tn in the output layer and sum them to get an error for each hidden node if I'm correct

    • @statquest
      @statquest  ปีที่แล้ว

      For the Weight, things are a little more complicated because the input is modified by previous weights and biases and the activation function. For more details, see: th-cam.com/video/iyn2zdALii8/w-d-xo.html

  • @콘충이
    @콘충이 3 ปีที่แล้ว +1

    Appreciated it so much!

  • @danielsimion3021
    @danielsimion3021 4 หลายเดือนก่อน

    What about the derivatives with the inner w like w1 or w2, before entering in the ReLU function? Cause for example w1 affects all the 3 raw output values unlike b3 that affects only the first raw output.

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      See: th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

    • @danielsimion3021
      @danielsimion3021 4 หลายเดือนก่อน

      @@statquest thanks for ur answer, I've already seen that video; my problem is that w1 affects all the 3 raw datas, so when u do the the derivative of predicted probability respect to raw data, wich raw data should u use , setosa, virginica or versicolor?
      Whichever u choose u will get back to w1, because setosa raw, virginica raw and versicolor raw, all have w1 in their expression.

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      @@danielsimion3021 You use them all.

    • @danielsimion3021
      @danielsimion3021 4 หลายเดือนก่อน +1

      @@statquest ok; i did it with pen and paper and finally understood. Thank u very much.

    • @statquest
      @statquest  4 หลายเดือนก่อน +1

      @@danielsimion3021 bam! :)

  • @ecotrix132
    @ecotrix132 11 หลายเดือนก่อน

    Thanks so much for posting these videos! I am curious about this : while using gradient descent for SSR one could get stuck at local minimum. One shouldnt face this problem with cross entropy right?

    • @statquest
      @statquest  11 หลายเดือนก่อน +1

      No, you can always get stuck in a local minimum.

  • @TheTehnigga
    @TheTehnigga 2 ปีที่แล้ว

    Is cross entropy backpropagation done on a test set?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      Just the training data.

  • @Xayuap
    @Xayuap ปีที่แล้ว

    yo, Josh,
    in my example, with two output
    if I adjust repeatedly one b, then the other b doesn't need almost any adjust.
    ¿should I adjust both in paralell?

  • @ΓάκηςΓεώργιος
    @ΓάκηςΓεώργιος 3 ปีที่แล้ว

    Nice video!
    I only have one question
    How i do it when there is more than 3 data (for example there is, n for setosa ,m for virginica , k for versicolor)

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      You just run all the data through the neural network, as shown at 17:04, to calculate the cross entropy etc.

    • @ΓάκηςΓεώργιος
      @ΓάκηςΓεώργιος 3 ปีที่แล้ว +1

      Thank you a lot for your help Josh

  • @Xayuap
    @Xayuap ปีที่แล้ว

    Double Bam,
    ¿can we use 2 instead of e for base?
    I mean it would be more arquitecture wise.

    • @statquest
      @statquest  ปีที่แล้ว +1

      As long as you are consistent, you can use whatever base you want. But, generally, speaking log base 'e' is the easiest one to work with.

    • @Xayuap
      @Xayuap ปีที่แล้ว +1

      @@statquest yep, in tha whiteboard would be elegantest.
      but for the proccesor, at using base 2 in the log or exp, would mean just moving the significand left or right, I'll hope to be fast as in using relu.

  • @dr.osamahabdullah1390
    @dr.osamahabdullah1390 3 ปีที่แล้ว

    Is there any chance to talk about deep leaning or compressive sensing plz; your videos are so awesome

    • @statquest
      @statquest  3 ปีที่แล้ว

      Deep learning is a pretty vague term. For some, deep learning just means a neural network with 3 or more hidden layers. For others, deep learning refers to a convolutional neural network. I explain CNNs in this video: th-cam.com/video/HGwBXDKFk9I/w-d-xo.html

  • @environmentalchemist1812
    @environmentalchemist1812 3 ปีที่แล้ว

    Some topic suggestions: Could you go over the distinction between PCA and Factor Analysis, and describe the different factor rotations (orthogonal vs oblique, varimax, quartimax, equimax, oblimin, etc)?

    • @statquest
      @statquest  3 ปีที่แล้ว

      I'll keep that in mind.

  • @Xayuap
    @Xayuap ปีที่แล้ว

    hi, serious question,
    ¿can I do the same with the final w weights?
    something is not converging in the tests.

    • @statquest
      @statquest  ปีที่แล้ว

      What time point, minutes and seconds, are you asking about?

    • @Xayuap
      @Xayuap ปีที่แล้ว

      ​@@statquest
      I mean the cross entropy adjusting for b bias.
      ¿can I do the same for the w weights?
      I understand that the cross entropy derivatives with respect of the final weights when the mesure is setosa to be
      dCe/dWyi = Psetosa × Yi
      and
      dCe/dWyi = (Psetosa - 1) × Yi
      when Yi is the Y component exit of the previous box.

    • @statquest
      @statquest  ปีที่แล้ว

      @@Xayuap I believe that is correct.

    • @Xayuap
      @Xayuap ปีที่แล้ว

      thanks, well, if that is correct then maybe my writes are off,
      when I try to adjust both W the derivatives converge to integer numbers others than 0.
      I'm not adjusting the B bias, only the final Ws

  • @user-rt6wc9vt1p
    @user-rt6wc9vt1p 3 ปีที่แล้ว

    Is the process for calculating derivatives in respect to weights and biases the same for each layer we backpropagate through? Or would the derivative chain be made up of more parts for certain layers?

    • @statquest
      @statquest  3 ปีที่แล้ว

      If each layer is the same, then the process is the same.

    • @user-rt6wc9vt1p
      @user-rt6wc9vt1p 3 ปีที่แล้ว +1

      great, thanks!

  • @TempName-x1o
    @TempName-x1o ปีที่แล้ว

    thank you very much, video is awesome
    i have a question mark in one point,
    when you take derivative of ce(setosta, virginica, verticolor) to b3, you used Raw for setosta for all of them (for setosta, virginica, verticolor),
    * b3 is for Row setosta
    * b4 is for Row virtigina
    * b5 is for Row verticolor
    but what happens for w1 or w5, [should i sum all of them and apply to chain rule? (i guess its not the true way)] because
    * b3 is for Row virtigina or verticolor is equal to 0 only Row setosta is equal to 1
    but
    * w1 is for Row virtigina or verticolor or setosta is equal to = 1 if (pedal width) > 0 else 0 (from derivative of relu [1 if x>0 else 0])
    thanks for reading my very long question, BAM :)

    • @statquest
      @statquest  ปีที่แล้ว

      The further back (closer to the input) we go, the more stuff we have to take into account because the change in w1 or w5 can affect multiple outputs, not just the one.

  • @kamshwuchin6907
    @kamshwuchin6907 3 ปีที่แล้ว

    Thank you for the efforts in making these amazing videos!! It helps me alot in visualising the concepts. Can you make a video about information gain too? Thank you!!

    • @statquest
      @statquest  3 ปีที่แล้ว +2

      I'll keep that in mind.

    • @raminmdn
      @raminmdn 3 ปีที่แล้ว +1

      @@statquest I think videos on general concepts of information theory (such as information gain) would be greatly beneficial for many many people out there, and a very nice addition to the machine learning series. I have not been able to find such comprehensive (and at the same time clearly explained) videos as yours anywhere on TH-cam or online courses, specifically when it comes to ideas as concepts that usually seem much complicated.

  • @zahari_s_stoyanov
    @zahari_s_stoyanov 2 ปีที่แล้ว

    I wonder what would be dCE for the corresponding weights though

    • @statquest
      @statquest  2 ปีที่แล้ว

      For the other weights, it's just more backpropagation. To see the concept of how this might work, see: th-cam.com/video/iyn2zdALii8/w-d-xo.html and th-cam.com/video/GKZoOHXGcLo/w-d-xo.html

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 ปีที่แล้ว

      @@statquest haven't done maths for years and even when I studied it in uni, derivatives were my kryptonite :D BUT I was able to find this for a one-heavy output vector:
      dCE/dw[ij]=y[i](p[j]−o[j]) where "i" is the neuron from the previous layer(right before the raw output layer) and "j" is the neuron from the output(softMax) layer. Y[i] is the output of neuron "i" before multiplying it by w[ij]. P is "predicted" and O is "observed"

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 ปีที่แล้ว +1

      @@statquest I watched the whole series and it's really helpful! Just couldn't figure out this final step in order to train my network :D

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 ปีที่แล้ว

      Now that I think, it means basically that dCE/dw[ij] = y[i] * dCEb[j]

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 ปีที่แล้ว

      Ah, but of course - it is dCEb * dRAWw ! RAW is the sum of all (Node * w). So, with respect to a particular weight, RAW = Node * w + C, therefore dRAWw = Node. Hence, dCEw = dCEb * Node

  • @beshosamir8978
    @beshosamir8978 2 ปีที่แล้ว

    Hi Josh,
    I have a quick question ,i saw a video on TH-cam the man who was explained the concept said they use segmoid function in output layer for a binary classification and RelU for hiddens layers , So,i think we fall in the same problem here which is the gradient of the Segmoid Function is too small which is make us ends with take a small step , so i thought about it which we can use Croos entropy also in this situation Right ?

    • @statquest
      @statquest  2 ปีที่แล้ว

      I'm not sure I fully understand your question, any time you have more than one categories, you can use cross entropy.

    • @beshosamir8978
      @beshosamir8978 2 ปีที่แล้ว

      @@statquest
      I mean can i use cross entropy with Binary classification ?

    • @statquest
      @statquest  2 ปีที่แล้ว +1

      @@beshosamir8978 Yes.

    • @beshosamir8978
      @beshosamir8978 2 ปีที่แล้ว

      @@statquest
      So, it is smart to use it in a Binary classification problem ? Or it is better to use just Segmoid function in output layer?

  • @jaheimwoo866
    @jaheimwoo866 ปีที่แล้ว +2

    Save my university life!

  • @harshchoudhary2817
    @harshchoudhary2817 2 ปีที่แล้ว

    What I see from here is that the gradient descent optimizes on the basis of total cross entropy and tries to minimize it
    Suppose for some data the actual o/p is setosa but the neural net predicts versicolor with a very high probability say close to 1 so the loss would still be minimized and the gradient desent won't optimize it. So we will get a wrong o/p with very high probability.
    Is it so or am i missing something here?

    • @statquest
      @statquest  2 ปีที่แล้ว

      See 17:05. For the first row of data, the observed species is "setosa", but setosa gets the lowest predicted probability (0.15) and thus, the Cross Entropy for that row is 1.89. Now, if, intsead, the neural net predicted Versicolor for the first row with a probability 0.98 and the prediction for Setosa was 0.01, then the Cross Entropy would be greater, it would be -log(0.01) = 4.6, and, as a result, the total cross entropy would also be greater (-log(0.01) + -log(0.98) + -log(0.01) = 9.23). So the loss would be significantly greater and gradient descent would optimize it.

  • @andredahlinger6943
    @andredahlinger6943 2 ปีที่แล้ว +1

    Hey Josh, awesome videos

    • @statquest
      @statquest  2 ปีที่แล้ว

      I think the idea is to optimize for whatever your output ultimately ends up being.

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 ปีที่แล้ว

      I think he said that this optimization is done instead of, not after SSR. Rather than calculating SSR and dSSR , we go another step further by using softMax, then calculate CE and dCE, which puts the final answers between 0.0 and 1.0 and also provides simpler calculations for backprop :)

  • @muntedme203
    @muntedme203 2 ปีที่แล้ว +1

    Awesome vid.

  • @tulikashrivastava2905
    @tulikashrivastava2905 3 ปีที่แล้ว

    Thanks for posting the NN video series. It was just in time when I needed it 😊 You have the knack to split complex topics into logical parts explain them like a breeze😀😀
    Can I request you to share some videos on Gradient Descent Optimisation and Regularization ?

    • @statquest
      @statquest  3 ปีที่แล้ว +1

      I have two videos on Gradient Descent and five on Regularization. You can find all of my videos here: statquest.org/video-index/

    • @tulikashrivastava2905
      @tulikashrivastava2905 3 ปีที่แล้ว

      @@statquest Thanks for your quick reply! I have seen those videos and they are great as usual 👍👍
      I was requesting for Gradient descent optimisation with respect to Deep networks like Momentum, NAG, Adagrad, Adadelta, RMSProp, Adam and regularization techniques for Deep networks like weight decay, dropout, early stopping, data augmentation and batch normalization.

    • @statquest
      @statquest  3 ปีที่แล้ว

      @@tulikashrivastava2905 Noted.

  • @مهیارجهانینسب
    @مهیارجهانینسب 2 ปีที่แล้ว

    Awesome video. I really appreciate how you explain all these concepts in a fun way.
    I have a question in the previous video for softmax you said the value for predicted probabilities for classes is not reliable even though they correctly classify input data because of our random initial value for weights and biases. now by using cross entropy we basically multiply observed probability in the data set by log p and then optimize it. so Is the value of predicted probabilities for different classes of an input reliable. ?

    • @statquest
      @statquest  2 ปีที่แล้ว

      To be clear, I didn't say that the output from softmax was not reliable, I just said that it should not be treated as a "probability" when interpreting the output.