Batch Normalization - EXPLAINED!

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2024

ความคิดเห็น • 130

  • @ssshukla26
    @ssshukla26 4 ปีที่แล้ว +48

    Shouldn't be Gamma should approximate to the true variance of the neuron activation and beta should approximate to the true mean of the neuron activation? I am just confused...

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +25

      You're right. Misspoke there. Nice catch!

    • @ssshukla26
      @ssshukla26 4 ปีที่แล้ว

      @@CodeEmporium Cool

    • @dhananjaysonawane1996
      @dhananjaysonawane1996 3 ปีที่แล้ว +1

      How is this approximation happening?
      And how do we use beta, gamma at test time? We have only one example at a time during testing.

    • @FMAdestroyer
      @FMAdestroyer 2 ปีที่แล้ว +1

      @@dhananjaysonawane1996 in most frameworks when you create a BN Layer, the mean and variance (Beta and gamma) are both learnable parameters usually represented as the weights and bias from the layer. You can deduce that from Torch BN2D Layer's description bellow
      "The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size)."

    • @AndyLee-xq8wq
      @AndyLee-xq8wq 2 ปีที่แล้ว

      Thanks for clarification!

  • @efaustmann
    @efaustmann 4 ปีที่แล้ว +23

    Exactly what I was looking for. Very well researched and explained in a simply way with visualizations. Thank you very much!

  • @sumanthbalaji1768
    @sumanthbalaji1768 4 ปีที่แล้ว +9

    Just found your channel and binged through all your videos so heres a general review. As a student i assure you your content is on point and goes in depth unlike other channels that just skim the surface. Keep it up and dont be afraid to go more in depth on concepts. We love it. Keep it up brother you have earned a supporter till your channels end

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +2

      Thanks ma guy. I'll keep pushing up content. Good to know my audience loves the details ;)

    • @sumanthbalaji1768
      @sumanthbalaji1768 4 ปีที่แล้ว

      @@CodeEmporium damn did not actually expect you to reply lol. Maybe let me throw a topic suggestion then. More NLP please, take a look at summarisation tasks as a topic. Would be damn interesting.

  • @angusbarr7952
    @angusbarr7952 4 ปีที่แล้ว +16

    Hey! Just cited you in my undergrad project because your example finally made me understand batch norm. Thanks a lot!

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +4

      Sweet! Glad it was helpful homie

  • @oheldad
    @oheldad 4 ปีที่แล้ว +6

    Hey there . Im on my way to become data scientist , and your videos help me a lot ! Keep going Im sure I am not the only one you inspired :) thank you !!

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +1

      Awesome! Glad these videos help! Good luck with your Data science ventures :)

    • @ccuuttww
      @ccuuttww 4 ปีที่แล้ว +2

      Your aim should not become a data scientist to fit other people expectation you should become a people who can deal with data and estimate any unknown parameter with your own standard

    • @oheldad
      @oheldad 4 ปีที่แล้ว

      @@ccuuttww dont know why you decided that Im fulfilling others expectations on me - its not true. Im on the last semester of my electrical engineering degree , and decided to change path a little :)

    • @ccuuttww
      @ccuuttww 4 ปีที่แล้ว

      because most of people think in the following pattern : Finish all exam semester and graduate with good marks send mass CV and try to get a job titled:"Data Scientist"
      try to fit their jobs what they learn from university like a trained monkey however u are not deal with a real wold situation u just try to deal with your customer or your boss since this topic never have standard answer u can only define by yourself and your client only trust your title
      I fell this is really bad

  • @dragonman101
    @dragonman101 3 ปีที่แล้ว +1

    Quick note: at 6:50 there should be brackets after 1/3 (see below)
    Yours: 1/3 (4 - 5.33)^2 + (5 - 5.33)^2 + (7 - 5.33)^2

  • @seyyedpooyahekmatiathar624
    @seyyedpooyahekmatiathar624 4 ปีที่แล้ว +2

    Subtracting the mean and dividing by std is standardization. Normalization is when you change the range of the dataset to be [0,1].

  • @parthshastri2451
    @parthshastri2451 4 ปีที่แล้ว +9

    why did you plot the cost against height and the age isnt it supposed to be a function of weights in a neural network

  • @EB3103
    @EB3103 3 ปีที่แล้ว +2

    The loss is not a function of the features but a function of the weights

  • @jodumagpi
    @jodumagpi 4 ปีที่แล้ว

    This is good! I think that giving an example as well as the use cases (advantages) before diving into the details alwayd gets the job done

  • @ryanchen6147
    @ryanchen6147 2 ปีที่แล้ว +2

    at 3:27, I think your axises should be the *weight* for the height feature and the *weight* for the age feature if that is a contour plot of the cost function

    • @mohameddjilani4109
      @mohameddjilani4109 ปีที่แล้ว +1

      Yes , that was an error across a long period in the video

  • @superghettoindian01
    @superghettoindian01 ปีที่แล้ว

    I see you are checking all these comments - so will try to comment on all the videos I see going forward and how I’m using these videos.
    Currently using this video as supplement to Andrej Karpathy’s makemore series pt 3.
    The other video has a more detailed implementation of batch normalization but you do a great job of summarizing the key concepts. I hope one day you and Andrej can create a video together 😊.

    • @CodeEmporium
      @CodeEmporium  ปีที่แล้ว +1

      Thanks a ton for the comment. Honestly, any critical feedback is appreciated. So thanks you. It would certainly be a privilege to collaborate with Andrej for sure. Maybe in the future :)

  • @maxb5560
    @maxb5560 4 ปีที่แล้ว +1

    Love your videos. They help me alot understanding machine learning more and more

  • @yeripark1135
    @yeripark1135 2 ปีที่แล้ว

    I clearly understand the need of batch normalization and its advantages! Thanks !!

  • @Slisus
    @Slisus 2 ปีที่แล้ว

    Awesome video. I really like, how you go into the actual papers behind it.

  • @taghyeertaghyeer5974
    @taghyeertaghyeer5974 ปีที่แล้ว +3

    Hello, thank you for your video.
    I am wondering regarding the batch normalisation speeding up the training: you showed at 2:42 the contour plot of the loss as a function of height and age. However, the loss function contours should be plotted against the weights (the optimization is performed in the weights' space, and not the input space). In other words, why did you base your argument on the loss function with weight and and height being the variable (they should be held constant during optimization)?
    Thank you! Lana

    • @marcinstrzesak346
      @marcinstrzesak346 ปีที่แล้ว

      For me, it also seemed quite confusing. I'm glad someone else noticed it too.

    • @atuldivekar
      @atuldivekar 7 หลายเดือนก่อน

      The contour plot is being shown as a function of height and age to show the dependence of the loss on the input distribution, not the weights

  • @ultrasgreen1349
    @ultrasgreen1349 2 ปีที่แล้ว

    thats actually a very very good and intuitive video. Honestly Thank you

  • @ahmedshehata9522
    @ahmedshehata9522 2 ปีที่แล้ว

    You are really and also really good because you reference paper and introduce the idea

  • @erich_l4644
    @erich_l4644 4 ปีที่แล้ว +1

    This was so well put together- why less than 10k views? Oh... it's batch normalization

  • @MaralSheikhzadeh
    @MaralSheikhzadeh 2 ปีที่แล้ว

    thanks, this video helped me understand BN better. and I liked your sense of humor. made watching is more fun.:)

  • @lamnguyentrong275
    @lamnguyentrong275 4 ปีที่แล้ว +3

    wow, easy to understand , and clear accent. Thank you, sir. u done a great job

  • @Inzurrekto1
    @Inzurrekto1 3 วันที่ผ่านมา

    Thank you vor this video. Very well explained

  • @balthiertsk8596
    @balthiertsk8596 2 ปีที่แล้ว

    Hey man, thank you.
    I really appreciate this quality content!

  • @thoughte2432
    @thoughte2432 3 ปีที่แล้ว +4

    I found this a really good and intuitive explanation, thanks for that. But there was one thing that confused me: isn't the effect of batch normalization the smoothing of the loss function? I found it difficult to associate the loss function directly to the graph shown at 2:50.

    • @Paivren
      @Paivren ปีที่แล้ว

      yes, the graph is a bit weird in the sense that the loss function is not a function of the features but of the model parameters.

  • @aaronk839
    @aaronk839 4 ปีที่แล้ว +26

    Good explanation until 7:17 after which, I think, you miss the point which makes the whole thing very confusing. You say: "Gamma should approximate to the true mean of the neuron activation and beta should approximate to the true variance of the neuron activation." Apart from the fact that this should be the other way around, as you acknowledge in the comments, you don't say what you mean by "true mean" and "true variance".
    I learned from Andrew Ng's video (th-cam.com/video/tNIpEZLv_eg/w-d-xo.html) that the actual reason for introducing two learnable parameters is that you actually don't necessarily want all batch data to be normalized to mean 0 and variance 1. Instead, shifting and scaling all normalized data at one neuron to obtain a different mean (beta) and variance (gamma) might be advantageous in order to exploit the non-linearity of your activation functions.
    Please don't skip over important parts like this one with sloppy explanations in future videos. This gives people the impression that they understand what's going on, when they actually don't.

    • @dragonman101
      @dragonman101 3 ปีที่แล้ว +3

      Thank you very much for this explanation. The link and the correction are very helpful and do provide some clarity to a question I had.
      That being said, I don't think it's fair to call his explanation sloppy. He broke down complicated material in a fantastic and clear way for the most part. He even linked to research so we could do further reading, which is great because now I have a solid foundation to understand what I read in the papers. He should be encouraged to fix his few mistakes rather than slapped on the wrist.

    • @sachinkun21
      @sachinkun21 2 ปีที่แล้ว

      thanks a ton!! I was actually looking for this comment as I had the same question as to why do we even need to approximate!

  • @danieldeychakiwsky1928
    @danieldeychakiwsky1928 4 ปีที่แล้ว +7

    Thanks for the video. I wanted to add that there's debate in the community over whether to normalize pre vs. post non-linearity within the layers, i.e., for a given neuron in some layer, do you normalize the result of the linear function that gets piped through non-linearity or do you pipe the linear combination through non-linearity and then apply normalization, in both cases, over the mini-batch.

    • @kennethleung4487
      @kennethleung4487 3 ปีที่แล้ว +3

      Here's what I found from MachineLearningMastery:
      o Batch normalization may be used on inputs to the layer before or after the activation function in the previous layer
      o It may be more appropriate after the activation function if for S-shaped functions like the hyperbolic tangent and logistic function
      o It may be appropriate before the activation function for activations that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types

  • @SaifMohamed-de8uo
    @SaifMohamed-de8uo 3 หลายเดือนก่อน

    Great explanation thank you!

  • @pranavjangir8338
    @pranavjangir8338 4 ปีที่แล้ว +1

    Is not Batch Normalization also used to counter the exploding gradient problem? Would have loved some explanation on that too..

  • @priyankakaswan7528
    @priyankakaswan7528 3 ปีที่แล้ว

    the real magic starts at 6.07, this video was exactly what I needed

  • @pupfer
    @pupfer 2 ปีที่แล้ว

    The only difficult part of batch norm, namely the back prop isn't explained.

  • @akremgomri9085
    @akremgomri9085 4 หลายเดือนก่อน

    Very good explanation. However, there is something I didn't understand. Doesn't batch normalisation modify the inout data so that m=0 and v=1 as explained in the beginning ?? So how the heck we moved from normalisation being applied on inputs, to normalisation affecting activation function ? 😅😅

  • @99dynasty
    @99dynasty 2 ปีที่แล้ว

    BatchNorm reparametrizes the underlying optimization problem to make it more stable (in the sense of loss Lipschitzness) and smooth (in the sense of “effective” β-smoothness of the loss).
    Not my words

  • @enveraaa8414
    @enveraaa8414 3 ปีที่แล้ว

    Bro you have made the perfect video

  • @chandnimaria9748
    @chandnimaria9748 ปีที่แล้ว

    Just what I was looking for, thanks.

  • @hervebenganga8561
    @hervebenganga8561 2 ปีที่แล้ว

    This is beautiful. Thank you

  • @sriharihumbarwadi5981
    @sriharihumbarwadi5981 4 ปีที่แล้ว +1

    Can you please make a video on how batch normalization and l1/l2 regularization interact with each other ?

  • @ayandogra2952
    @ayandogra2952 3 ปีที่แล้ว

    Amazing work
    really liked it

  • @sanjaykrish8719
    @sanjaykrish8719 4 ปีที่แล้ว

    Fantastic explanation using contour plots.

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +1

      Thanks! Contour plots are the best!

  • @sevfx
    @sevfx 2 ปีที่แล้ว

    Great explanation, but missing parantheses at 6:52 :p

  • @iliasaarab7922
    @iliasaarab7922 3 ปีที่แล้ว

    Great explanation, thanks!

  • @Hard_Online
    @Hard_Online 4 ปีที่แล้ว

    The best I have seen so far

  • @themightyquinn100
    @themightyquinn100 2 ปีที่แล้ว

    Wasn't there an episode where Peter was playing against Larry Bird?

  • @manthanladva6547
    @manthanladva6547 4 ปีที่แล้ว

    Thanks for awesome video
    Get many idea about Batch Norm

  • @PavanTripathi-rj7bd
    @PavanTripathi-rj7bd ปีที่แล้ว

    great explanation

    • @CodeEmporium
      @CodeEmporium  ปีที่แล้ว

      Thank you! Enjoy your stay on the channel :)

  • @ccuuttww
    @ccuuttww 4 ปีที่แล้ว +1

    I wonder is it suitable to use population estimator?
    I think nowadays most of the machine learning learner/student/fans
    spent very less time on statistics after several year study I find that The model selection and the statistical theory take the most important part
    especially the Bayesian learning the most underrated topic today

  • @nyri0
    @nyri0 2 ปีที่แล้ว

    Your visualizations are misleading. Normalization doesn't turn the shape on the left into the circle seen on the right. It will be less elongated but still keep a diagonal ellipse shape.

  • @God-vl5uz
    @God-vl5uz 3 หลายเดือนก่อน

    Thank you!

  • @lazarus8011
    @lazarus8011 3 หลายเดือนก่อน

    Good video
    here's a comment for the algorithm

  • @strateeg32
    @strateeg32 2 ปีที่แล้ว

    Awesome thank you!

  • @חייםזיסמן-ש8ב
    @חייםזיסמן-ש8ב 3 ปีที่แล้ว

    Awesome explanation.

  • @nobelyhacker
    @nobelyhacker 3 ปีที่แล้ว

    Nice video, but i guess there is a little error at 6:57? I guess you have to multiply the whole with 1/3 not only the first term

  • @aminmw5258
    @aminmw5258 2 ปีที่แล้ว

    Thank you bro.

  • @mizzonimirko
    @mizzonimirko ปีที่แล้ว

    I do not understand property how this Is going to be implemented. At the end of an epoch actually we perform those operations right? At the end of that epoch, at this point the layer where i have applied It Is normalized right?

  • @luisfraga3281
    @luisfraga3281 4 ปีที่แล้ว

    Hello, I wonder what if we don't normalize the image input data (RGB 0-255) and then we use batch normalization? Is it going to work smoothly? or is it going to mess up with the learning?

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q 2 ปีที่แล้ว

    Do we calculate the mean and SD across a mini-batch for a given neutron or across all the neurone in a layer? Andrew NG says it's across each layer. Thanks.

  • @shaz-z506
    @shaz-z506 4 ปีที่แล้ว

    Good video, could you please make a video on capsule network.

  • @JapiSandhu
    @JapiSandhu 2 ปีที่แล้ว

    this is a great video

  • @QuickTechNow
    @QuickTechNow หลายเดือนก่อน

    Thanks

  • @samratkorupolu
    @samratkorupolu 3 ปีที่แล้ว

    wow, you explained pretty clearly

  • @SillyMakesVids
    @SillyMakesVids 4 ปีที่แล้ว

    Sorry, but where did gamma and beta come from and how is it used?

  • @JapiSandhu
    @JapiSandhu 2 ปีที่แล้ว

    can I add a Batch Normalization layer after an LSTM layer in pytorch?

  • @adosar7261
    @adosar7261 ปีที่แล้ว

    And why not just normalizing the whole training set instead of batch normalization?

    • @CodeEmporium
      @CodeEmporium  ปีที่แล้ว

      Batch normalization will normalize through different steps of the network. If we want to “normalize the whole training set”, we need to pass all training examples at once to the network as a single batch. This is what we see in “batch gradient descent”, but isn’t super common for large datasets because of memory constraints.

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q 2 ปีที่แล้ว

    Also if someone can say how to make gamma and beta learnable? gamma can be thought as an additional weight attached to the activation but how about beta? how to train that?

  • @pranaysingh3950
    @pranaysingh3950 2 ปีที่แล้ว

    Thanks!

  • @mohammadkaramisheykhlan9
    @mohammadkaramisheykhlan9 2 ปีที่แล้ว

    How can we use batch normalization in the test set?

  • @hemaswaroop7970
    @hemaswaroop7970 4 ปีที่แล้ว

    Thanks, Man!

  • @abhishekp4818
    @abhishekp4818 4 ปีที่แล้ว

    @CodeEmporium , could you please tell me that why do we need to normalize the outputs of activation function whe they are already within a small range(example sigmoid ranges from 0 to 1)?
    and if we do normalize them, then how do we compute and updates of its parameters during backpropgation?
    please answer.

    • @boke6184
      @boke6184 4 ปีที่แล้ว

      The activation function should be the modifiing the predictability of error or learning too

  • @elyasmoshirpanahi7184
    @elyasmoshirpanahi7184 ปีที่แล้ว

    Nice content

  • @sealivezentrum
    @sealivezentrum 3 ปีที่แล้ว +1

    fuck me, you explained way better than my prof did

  • @rodi4850
    @rodi4850 4 ปีที่แล้ว +2

    Sorry to say but very poor video. Intro was way too long and explaining more the math and why BN works was left for 1-2mins.

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +5

      Thanks for watching till the end. I tried going for a layered approach to the explanation - get the big picture. Then the applications. Then details. I wasn't sure how much more math was necessary. This was the main math in the paper, so I thought that was adequate. Always open to suggestions if you have any. If you've looked at my recent videos, you can tell the delivery is not consistent. Trying to see what works

    • @PhilbertLin
      @PhilbertLin 4 ปีที่แล้ว

      I think the intro with the samples in the first few minutes was a little drawn out but the majority of the video spent on intuition and visuals without math was nice. Didn’t go through the paper so can’t comment on how much more math detail is needed.

  • @SAINIVEDH
    @SAINIVEDH 3 ปีที่แล้ว

    For RNN's Batch Normalisation should be avoided, use Layer Normalisation instead

  • @ai__76
    @ai__76 3 ปีที่แล้ว

    Nice animations

  • @abheerchrome
    @abheerchrome 4 ปีที่แล้ว

    grate video bro keep it up

  • @PierreH1968
    @PierreH1968 3 ปีที่แล้ว

    Great explanation, very helpful!

  • @kriz1718
    @kriz1718 4 ปีที่แล้ว

    Very helpfull!!

  • @akhileshpandey123
    @akhileshpandey123 3 ปีที่แล้ว

    Nice explanation :+1

  • @roeeorland
    @roeeorland ปีที่แล้ว

    Peter is most definitely not 1.9m
    That’s 6’3

  • @SetoAjiNugroho
    @SetoAjiNugroho 4 ปีที่แล้ว

    what about layer norm ?

  • @ajayvishwakarma6943
    @ajayvishwakarma6943 4 ปีที่แล้ว

    Thanks buddy

  • @Acampandoconfrikis
    @Acampandoconfrikis 3 ปีที่แล้ว

    Hey 🅱eter, did you make it to the NBA?

  • @anishjain8096
    @anishjain8096 4 ปีที่แล้ว

    Hey brother can you please tell me how on fly data augmentation increase the image data set every on blogs and vedios they said it increase the data size but hiw

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว

      For images, you would need to make minor distortions (rotation, crop, scale, blur) in an image such that the result is a realistic input. This way, you have more training data for your model to generalize

  • @novinnouri764
    @novinnouri764 2 ปีที่แล้ว

    thansk

  • @SunnySingh-tp6nt
    @SunnySingh-tp6nt 4 หลายเดือนก่อน

    can I get these slides?

  • @rockzzstartzz2339
    @rockzzstartzz2339 4 ปีที่แล้ว

    Why to use beta and gamma?

  • @xuantungnguyen9719
    @xuantungnguyen9719 3 ปีที่แล้ว

    good visualization

  • @boke6184
    @boke6184 4 ปีที่แล้ว

    This is good for ghost box

  • @gyanendradas
    @gyanendradas 4 ปีที่แล้ว

    Can u make a video for all types pooling layers

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +1

      Interesting. I'll look into this. Thanks for the idea

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd 4 ปีที่แล้ว

    Awesome, keep going like this

    • @CodeEmporium
      @CodeEmporium  4 ปีที่แล้ว +1

      Thanks for watching every video Gaurav :)

  • @alexdalton4535
    @alexdalton4535 3 ปีที่แล้ว

    why didnt peter make it..

    • @CodeEmporium
      @CodeEmporium  3 ปีที่แล้ว

      Clearly the model was wrong

  • @its_azmii
    @its_azmii 4 ปีที่แล้ว

    hey can u link the graph that you used please?

  • @eniolaajiboye4399
    @eniolaajiboye4399 3 ปีที่แล้ว

    🤯

  • @ahmedelsabagh6990
    @ahmedelsabagh6990 3 ปีที่แล้ว

    55555 you get it :) HaHa

  • @sultanatasnimjahan5114
    @sultanatasnimjahan5114 10 หลายเดือนก่อน

    thanks