From seeing the name of Yan in a research paper during a literature survey in my internship program , to attending his lectures is really a thriller. Quite enriching and mathematically profound stuff here. Thanks for sharing it free!
Wow! Yann is such a great teacher. I thought I knew this material fairly well, but Yann is enriching my understanding with every slide. It seems to me that his teaching method is extremely efficient. I suppose that's because he has such a deep understanding of the material.
Mehn!! these are gold.. especially for people who don't have access to these types of teachers, and methods of teaching, plus the material etc (that's a lot of people actually).
At 1:05:40 Yann is explaining the two jacobians, but I was having trouble getting the intuition. Then I realized that the first jacobian was getting the gradient to modify the weights w[k+1] for function z[k+1] and the second jacobian was back propagating the gradient to function z[k] which can then be used to calculate the gradient at k for yet another jacobian to adjust weights w[k]. So one jacobian is for the parameters and the other is for the state since both the parameter variable and state variable are column vectors. Yann explains it really well. I'm amazed that I seem to be understanding this complicated mix of symbols and logic. Thank you.
I can totally see how a quantum computer could be used to perform gradient descent in all directions simultaneously, helping to find the true global minimum across all valleys in one go! 😲 It's mind-blowing to think about the potential for quantum computing to revolutionize optimization problems like this!
Thank you so much, Alfredo, for organizing the material in such a nice and compact way for us! The insights of Yann and your examples, explanations and visualization are an awesome tool for anybody willing to learn (or to remember stuff) about deep learning. Greetings from Greece and I owe you a coffee, for your tireless effort. PS. Sorry for my bad English. I am not a native speaker.
1:02:09 The "Einstein summation convention" is being used here. The student asking the question is not familiar with this convention, and Yann doesn't seem to realize that the student is unfamiliar with this convention
@@alfcnz Ohhh I see. I was reading ∂c/∂z_f as "the f-th entry of the vector", but it actually denotes the entire vector. Similarly, I was reading ∂z_g/∂z_f as "the (g,f)-th entry of the Jacobian, whereas it actually denotes the entire Jacobian matrix. Sorry, I misread. Yann's notation for the (i,j)-th entry of the Jacobian matrix is given in the last line of the same slide. Thank you so much Alfredo for your quick reply above! And thank you so so much for putting these videos on TH-cam for everyone!
@@alfcnz Thank you Alfredo. I probably messed up. It's where Yann mentions Q-learning and Deep Mind. I imagine he will cover all this in a later lecture. Thank you for doing all this. Sorry for all the comments. I'm just enjoying this challenging material a lot. I just forked your repo, and I'm starting the first notebook. Cheers!
At the 40:41 section - is the purpose of using back propagation to find the derivative of the cost function wrt z to find the best direction to "move"? I've only gotten through half of the lecture so forgive me if this is answered later
You need to add a timestamp if you’re expecting an answer to a specific part of the video. Otherwise it’s impossible for me to understand what you’re talking about.
About the code in PyTorch... (51:00 in the video)... the code instantiates the mynet class and stores the reference in model variable... but nowhere it calls the "forward" method... so how does the out variable receive any output from the model object? Is there some Pytorch magic which is not explained here ?
@@alfcnz O Yes! My bad... I was distracted.... indeed the mynet inherit from nn.module and I suppose that forward is the implementation of an abstract method.
I don't know if this helps anyone, but it might. Weighted sums like s[0] are always to the to the first power. There are no squared weighted sums or cubed. So the derivative using the power rule of nx to the first power is equal to n. The derivative of ws[0] is always the weight w. That's why the application of the chain rule is so simple. Here's some more help. If y=2x, y'=2. If q=3y, q'=3; so y(q(x))' = 2 * 3. Picture the graph of y(q(x)), What is the slope? It's 6. And as many layers as you add in a neural net, the partial slopes will be multiples of the weights.
Following the contours, there are infinite numbers for range of w that have the same losses. So, do we have the same prediction for all these params which the loss is equal for them?
So, how we can find that at least there is a pattern in our distribution, so we can find it by any model? Suppose we are going to find the md5 hash code of a string. For this one, we ourselves may know that there is not any pattern in it, but how we can find it for any other problem? Thanks
@Alfredo, This content is amazing. Although I have 2 questions. It would be great if you can help me with it: Does this mean that in SGD, we are going to compute weight update steps for all the samples(randomly)? If we perform it on all the samples individually, how is it going to affect the training time? Is it going to increase/decrease as compared to batch GD?
Hi Alfredo, in 20:58 Yann mentioned "objective function need to be Continuous mostly and differentiable almost everywhere". What does he mean? isn't the function differentiable is always continuous? also is there a function where some part only differentiable? Can someone give me one example in deep learning functions? pls help me out. And, Thanks for this amazing videos!!!
I think, he meant that the function has to be continuous everywhere (but not differentiable but it should be differentiable "almost" everywhere, as seen in the case with relu function max(0,x) it is non differentiable at x = 0, but elsewhere it is differentiable and it is continuous everywhere) so if the function is differentiable everywhere that is awesome but that is not necessary condition. The thing is, it should be continuous so that we can estimate the gradients and there's no break in the function. If there's a break somewhere in your objective function, you can't estimate gradient and your network has no way of knowing what to do. If I am wrong please do correct.
Hi Alfredo! Thank you so much for posting these lectures here! I wanted to know if there's any textbook for this course that I could refer to, along with following the lectures. Thanks :)
One of my question was overlooked ... "What is the difference between lesson x and lesson xL ?" So what is the difference between 01 and 01L for example ?
How libraries like Pytorch or Tensorflow calcualte the derivative of a function? Do they calculate the lim (f(x+dx) - f(x))/(dx) or just they have the pre-defined derivatives?
Let's say you have exactly 10 images, one per digit. Now clone them 6k times, so you have a data set of size 60k samples (same size as MNIST). Now, if your batch is anything larger than 10, say 20 (you pick two images per digit), for example, you're computing the same gradient twice for no good reason. Now take the real MNIST. It is certainly not as bad as the toy data set described above, but most images for a given digit look very similar (hopefully so, otherwise it would be impossible to recognise)! So, you're in a very very similar situation.
One reason I agree it's better not to call a unit a "neuron" is the growing acceptance that single neurons in the brain are capable of complex computation via dendritic compartment computation
20:03 Doesn't he contradict himself? First he mentions that smaller batches are better (I assume that by "better" he meant model quality) in most cases, and a few seconds later he says that it's just a hardware matter.
Thanks for posting these! With this, you reach a very wide audience and help anyone who does not have access to such teachers and universities! 👏
Yup, that's the plan! 😎😎😎
Hello Ms Coffee beans
@@Navhkrin Hello! ☕
You’re doing a massive favor to the community who wants to access to high quality content without paying a huge amount of money. Thank you so much!
You're welcome 😇😇😇
From seeing the name of Yan in a research paper during a literature survey in my internship program , to attending his lectures is really a thriller. Quite enriching and mathematically profound stuff here. Thanks for sharing it free!
You're welcome 😊😊😊
Wow! Yann is such a great teacher. I thought I knew this material fairly well, but Yann is enriching my understanding with every slide. It seems to me that his teaching method is extremely efficient. I suppose that's because he has such a deep understanding of the material.
🤓🤓🤓
Thanks very much for the content. What a time to be alive. To hear from the master himself.
💜💜💜
You are a great man. Thanks to you someone even in a third world country can learn DL from one of the inventors himself. THIS IS CRAZY!
😇😇😇
I just can't believe this content is free. Amazing! Long life to Open Source! Grazie Alfredo :)
❤️❤️❤️
Mehn!! these are gold.. especially for people who don't have access to these types of teachers, and methods of teaching, plus the material etc (that's a lot of people actually).
🤗🤗🤗
I have watched this lecture twice in the last year. Mister LeCun is great! :)
Professor / doctor LeCun 😜
At 1:05:40 Yann is explaining the two jacobians, but I was having trouble getting the intuition. Then I realized that the first jacobian was getting the gradient to modify the weights w[k+1] for function z[k+1] and the second jacobian was back propagating the gradient to function z[k] which can then be used to calculate the gradient at k for yet another jacobian to adjust weights w[k]. So one jacobian is for the parameters and the other is for the state since both the parameter variable and state variable are column vectors. Yann explains it really well. I'm amazed that I seem to be understanding this complicated mix of symbols and logic. Thank you.
👍🏻👍🏻👍🏻
I can totally see how a quantum computer could be used to perform gradient descent in all directions simultaneously, helping to find the true global minimum across all valleys in one go! 😲 It's mind-blowing to think about the potential for quantum computing to revolutionize optimization problems like this!
That is my honor to learn from you and Sir...
Don't forget to subscribe to the channel and like the video to manifest your appreciation.
@@alfcnz Ya, I did that 🤞
🥰🥰🥰
Discussion on stochastic gradient descent (12:23) and with adams (1:16:15) are great. General misconception.
🥳🥳🥳
Thank you so much for sharing this 🥰 This was the best video for learning gradient descent and backpropagation.
🥳🥳🥳
Thank you so much, Alfredo, for organizing the material in such a nice and compact way for us! The insights of Yann and your examples, explanations and visualization are an awesome tool for anybody willing to learn (or to remember stuff) about deep learning. Greetings from Greece and I owe you a coffee, for your tireless effort.
PS. Sorry for my bad English. I am not a native speaker.
I'm glad the content is of any help.
Looking forward to get that coffee in Greece. I've never visited… 🥺🥺🥺 Hopefully I'll fix that soon. 🥳🥳🥳
@@alfcnz Easy fix, I'll send a Pull Request in no time!
For coming to Greece? 🤔🤔🤔
I really love that discussion about solving non convex problems.... finally we get out of the books ! At least we unleash our mind.
🧠🧠🧠
This intimate atmosphere allows for a better understanding of the subject matter. Great questions 【ツ】 and of course great answers. Thank you
You're welcome 😁😁😁
Alfredo Canziani ... drinks are on me if you ever visit India ... this is extremely high quality content!
Thanks! I prefer food, though 😅
And yes, I'm planning to come over soon-ish.
Thank you so much for this valuable content. This teaching method is extremely amazing.
Yay! I'm glad you fancy it! 😊😊😊
Just FYI, at 1:01:00 Yann correctly says dc/dzg, but the diagram has dc/zg. Should that also be dc/dwg and dc/dwf?
Great content! It’s just great to have this quality information available
You're welcome 🐱🐱🐱
thank you for share, those are help for me more and more
🤓🤓🤓
1:02:09 The "Einstein summation convention" is being used here. The student asking the question is not familiar with this convention, and Yann doesn't seem to realize that the student is unfamiliar with this convention
It’s not. It’s just a vector matrix multiplication.
@@alfcnz Ohhh I see. I was reading ∂c/∂z_f as "the f-th entry of the vector", but it actually denotes the entire vector. Similarly, I was reading ∂z_g/∂z_f as "the (g,f)-th entry of the Jacobian, whereas it actually denotes the entire Jacobian matrix. Sorry, I misread.
Yann's notation for the (i,j)-th entry of the Jacobian matrix is given in the last line of the same slide.
Thank you so much Alfredo for your quick reply above! And thank you so so much for putting these videos on TH-cam for everyone!
How do you perturb the output and backprop? Earlier the derivative of cost function was 1. (around 1:50:00)
I've listened to it and there's no mention to backprop at that timestamp.
@@alfcnz Thank you Alfredo. I probably messed up. It's where Yann mentions Q-learning and Deep Mind. I imagine he will cover all this in a later lecture. Thank you for doing all this. Sorry for all the comments. I'm just enjoying this challenging material a lot. I just forked your repo, and I'm starting the first notebook. Cheers!
I see what I did. I gave you the end of video timestamp. My bad. LOL!
It's just after 1:27:00.
Hello. Isn't
ds[0] * dc / ds[0] + ds[1] * dc / ds[1] + ds[2] * dc / ds[2] = 3dc
instead of dc? (At time 41:00)
22:27 how to ensure about the convexity...
We don't.
@@alfcnz Yes😅, I just wanted to mention the question which you asked sir and he answered at that time slot. Thanks 🙏
This is wonderful. Thank you ❤️
😃😃😃
Thank you very much
You're very welcome 🐱🐱🐱
Love from India sir.I really like the discussion & doubt clearing part. Hope to join NYU for my MS in 2023.:)
💜💜💜
51:05 shouldn't it be self.m0(z0) as it takes in the flattened input?
Of course.
I thought that haar-like features were not that recognizable. (1:48:00)
Hello Alfredo, at 1:11:50, where do we have the loops in gradient graph? Is there any prime example? Thanks
That would be a system that we don't know how to handle. Every other connection is permitted.
@@alfcnz 🙏🌿
Does the trick explained in normalizing training samples (01:20:00) applies also to convolutional neural networks?
Indeed.
At the 40:41 section - is the purpose of using back propagation to find the derivative of the cost function wrt z to find the best direction to "move"? I've only gotten through half of the lecture so forgive me if this is answered later
Say z = f(wᵀx). If you know ∂C/∂z, then you can compute ∂C/∂w = ∂C/∂z ∂f/∂w.
Why cant we use counters for the loops in neural nets - would a loop not make the network more robust in the sense of stabilizing output?
You need to add a timestamp if you’re expecting an answer to a specific part of the video. Otherwise it’s impossible for me to understand what you’re talking about.
@@alfcnz Sorry, around 34:39 - thanks for replying
About the code in PyTorch... (51:00 in the video)... the code instantiates the mynet class and stores the reference in model variable... but nowhere it calls the "forward" method... so how does the out variable receive any output from the model object? Is there some Pytorch magic which is not explained here ?
Yup. When you call a nn.Module the forward function is called after and before some other stuff.
@@alfcnz O Yes! My bad... I was distracted.... indeed the mynet inherit from nn.module and I suppose that forward is the implementation of an abstract method.
Correct. 🙂🙂🙂
I don't know if this helps anyone, but it might. Weighted sums like s[0] are always to the to the first power. There are no squared weighted sums or cubed. So the derivative using the power rule of nx to the first power is equal to n. The derivative of ws[0] is always the weight w. That's why the application of the chain rule is so simple. Here's some more help. If y=2x, y'=2. If q=3y, q'=3; so y(q(x))' = 2 * 3. Picture the graph of y(q(x)), What is the slope? It's 6. And as many layers as you add in a neural net, the partial slopes will be multiples of the weights.
Things get a little more fussy when moving away from the 1D case, though. 😬😬😬
Following the contours, there are infinite numbers for range of w that have the same losses. So, do we have the same prediction for all these params which the loss is equal for them?
So, how we can find that at least there is a pattern in our distribution, so we can find it by any model? Suppose we are going to find the md5 hash code of a string. For this one, we ourselves may know that there is not any pattern in it, but how we can find it for any other problem? Thanks
About the notebooks... are there corrections? Or can we send them to you?
Thanks
What notebooks would you want to send to me? 😮😮😮
@Alfredo, This content is amazing. Although I have 2 questions. It would be great if you can help me with it:
Does this mean that in SGD, we are going to compute weight update steps for all the samples(randomly)?
If we perform it on all the samples individually, how is it going to affect the training time? Is it going to increase/decrease as compared to batch GD?
SGD _is_ mini-batch GD
where to get the slides from?
The course website. 😇
Great videos. may I know what does L stands for in the video title eg: 01L
Lecture.
@@alfcnz what about the videos which does not have L. It might sound silly but I am so confused 😂
Those are my sessions, the practica. So, they should have a P, if I would want to be super precise.
At the beginning there were only my videos. Yann's videos were not initially going to come online. It's too much work…
Thank you Alfredo. ☺️🤗
Just to clarify, the first code you show defines a model's graph, but it is untrained; so it can't be used yet for inference.
You need to tell me minutes:seconds, or I have no clue what you're asking about.
50:00
Is this paper to use in order to better understand backprop (the way explained on this video)? Or should we read some other work from Yann ?
What paper? You need to point out minutes:seconds if you want me to address a specific question regarding the video.
@@alfcnz i forgot to paste the link… i’ll do later. I from 1988… i will review the link.
Hi Alfredo, in 20:58 Yann mentioned "objective function need to be Continuous mostly and differentiable almost everywhere". What does he mean? isn't the function differentiable is always continuous? also is there a function where some part only differentiable? Can someone give me one example in deep learning functions? pls help me out.
And, Thanks for this amazing videos!!!
I think, he meant that the function has to be continuous everywhere (but not differentiable but it should be differentiable "almost" everywhere, as seen in the case with relu function max(0,x) it is non differentiable at x = 0, but elsewhere it is differentiable and it is continuous everywhere) so if the function is differentiable everywhere that is awesome but that is not necessary condition.
The thing is, it should be continuous so that we can estimate the gradients and there's no break in the function. If there's a break somewhere in your objective function, you can't estimate gradient and your network has no way of knowing what to do.
If I am wrong please do correct.
Yup, Aniket's correct.
Hey, Thanks for the explanation !!
Hi Alfredo! Thank you so much for posting these lectures here! I wanted to know if there's any textbook for this course that I could refer to, along with following the lectures. Thanks :)
Yes, I'm writing it. Hopefully a draft will be available by December. 🤓🤓🤓
@@alfcnz The eagernessssssssssss
Not sure anything will come out _this_ December, though…
I’m hanging in there any day it does come out. Alfredo can I mail you? About the possibility of phd supervision?
Uh… are you an NYU student?
Can you please link the reinforcement learning course Yann mentioned? Or at least the name of the author, I couldn't fully understand.
Without telling me minute:second I have no clue what you're talking about.
@@alfcnz Oh right! sorry. It's at 1:18
@@alfcnz Actually after re-listening to it, it sounded a lot clearer. It's the NYU reinforcement learning course by Larrel Pinto.
Yes, that's correct. 😇😇😇
How can I press the like button more than one ?
One of my question was overlooked ... "What is the difference between lesson x and lesson xL ?" So what is the difference between 01 and 01L for example ?
Lecture and practica. This used to be a playlist of only practica. Then it turned into a full course.
How libraries like Pytorch or Tensorflow calcualte the derivative of a function? Do they calculate the lim (f(x+dx) - f(x))/(dx) or just they have the pre-defined derivatives?
Each function f comes with its analytical derivative f'. Forward calls f, while backward calls f'.
@@alfcnz Actually I asked that before I watched it at 54:30, Regards 🤞
If you remove ' and ", that becomes a link. 🔗🔗🔗
@@alfcnz Cool !
What is the difference between lesson x and lesson xL ?
L stands for lecture. Initially I was going to publish only my sessions. Then I added Yann's.
This might be the larrel pinto course he references at 1:26 th-cam.com/video/sKqz9T_F_EU/w-d-xo.html
I didn't get "if batch-size >> num_classes then we are wasting computation". Could someone explain?
You need to add minutes:seconds, or I cannot figure out what you're talking about.
@@alfcnz wow I didn't expect a reply this soon 💜.
My question was from 30:17
Let's say you have exactly 10 images, one per digit. Now clone them 6k times, so you have a data set of size 60k samples (same size as MNIST). Now, if your batch is anything larger than 10, say 20 (you pick two images per digit), for example, you're computing the same gradient twice for no good reason.
Now take the real MNIST. It is certainly not as bad as the toy data set described above, but most images for a given digit look very similar (hopefully so, otherwise it would be impossible to recognise)! So, you're in a very very similar situation.
@@alfcnz oh got it. Thanks for explaining with the intuitive example 🙏
That's Yann's 😅😅😅
One reason I agree it's better not to call a unit a "neuron" is the growing acceptance that single neurons in the brain are capable of complex computation via dendritic compartment computation
If this is a question or note about the content, you need to add minutes:seconds, or I have no clue what you're referring at.
@@alfcnz ah, sorry. Was just to add on, at ~ 31:25 when Prof. LeCun explains why people don't like to refer to the units as 'neurons' persay
Cool! 😇😇😇
20:03 Doesn't he contradict himself? First he mentions that smaller batches are better (I assume that by "better" he meant model quality) in most cases, and a few seconds later he says that it's just a hardware matter.
We use mini-batches because we use GPU or other accelerators. Learning wise, we would prefer purely stochastic gradient descent (batch size of 1).
12:55
?
@@alfcnz A timestamp for myself to visit later :)
🤣🤣🤣
I found Yann's boring face when he tried to explain chain rule at 38:43 lol
🤣🤣🤣
2 years later.... in the video...At 54:30.... x still not fixed ... must be s0 .... 🤣🤣🤣🤣🤣🤣🤣🤣🤣 Just a joke ;-)
😭😭😭
@@alfcnzhahaaaaa... Gli esperti in computer... Cosa ci vuoi fare ? Siamo cosi 😂😂😂😂😂😂😂