In the backward function of the dense class you're returning a matrix which uses the weight parameter of the class after updating it, surely you'd calculate this dE/dX value before updating the weights, and thus dY/dX?
This video, instead of the plethora of other videos on "hOw tO bUiLd A NeUrAl NeTwOrK fRoM sCraTcH", is the literal best. It deserves 84 M views, not 84 k views. It is straight to the point, no 10 minutes explanation of pretty curves with zero math, no 20 minutes introduction on how DL can change the world I truly mean it, it is a refreshing video.
@@independentcode Thank you for the reply! I am a researcher, and I wanted to create my own DL library, using yours as base, but expanding it for different optim algorithms, initializations, regularizations, losses etc (i am now just developing it on my own privately), but one day I'll love to post it on my github. How can I appropriately cite you?
I love the 3b1b style of animation and also the consistency with his notation, this allows people to learn the matter with multiple explanations while not losing track of the core ideas. Awesome work man
THANK YOU ! This is exactly the video I was looking for. I always struggled with making a neural network, but following your video, I made a model that I can generalize and it made me understandexactly the mistakes I made in my previous attempts. It's easy to find on youtube videos of people explaining singular neurons and backpropagation, but then quickly going over the hard part: how do you compute the error in an actual network, the structural implementation and how it all ties together. This approach with separating the Dense layer from the activation layer also makes things 100x clearer, and many people end up smacking them both in the same class carelessly. The visuals make the intuition for numpy also much much easier. It's always a thing I struggled with and this explained why we do every operation perfectly. even though I was only looking for one video, after seeing such quality, I HAVE to explore the rest of your channel ! Great job.
Thank you so much for taking the time to write this message! I went through the same struggle when I wanted to make my own neural networks, which is exactly why I ended up doing a video about it! I'm really happy to see that it serves as I intended :)
Thanks for making such great quality videos. I'm working on my Ph.D., and I'm writing a lot of math regarding neural networks. Your nomenclature makes a lot of sense and has served me a lot. I'd love to read some of your publications if you have any.
I have been struggling with backpropagation in MLP from 2 weeks and when I was just searching for a video which can help me understand the process mathematically this video grabbed my attention and in this video I was able to understand the whole process both conceptually and mathematically actually the code given by you was the same code given by our mentor to us but he he was unable to explain clearly and the animations shown in the video were really great finally thank you for posting this video!!!!🛐I CAN ADVANCE IN MY PROJECT FURTHER!!!
Thank you so very, very, very much for this video. I have been wanting to do Machine Learning, but without "Magic". It drives me nuts when all the tutorials say "From Scratch" and then proceed to open Tensor Flow. Seriously, THANK you!!!
jesus christ this is a good video and shows clear understanding. no "i've been using neural networks for ten years, so pay attention as i ramble aimlessly for an hour" involved
I think the last row's indices of the W^T matrix at 17:55 must be (w1i, w2i,...,wji). Still the best explannation i have ever seen btw, thank you so much. I dont know why this channel is still so underrated, looking forward to seeing your new videos in the future
I just watched your CNN video, the next one and I couldn't resist watching this one. Although I knew most things in this video, watching everything work from scratch felt amazing.
Man, I love you. How many times i tried too do the multilayer nn on my own, but always faced thousand of problems. But this video explained everything. Thank you
one of the best video i have ever seen. struggled alot to understand this and you have explained so beautifully you made me fall in love with the neural network which i was intimidating from. thank you so much.
Thank you very much for your videos explaining how to build ANN and CNN from scratch in Python: your explanations of the detailed calculations for forward and backward propagation and for the calculations in the kernel layers of the CNN are very clear, and seeing how you have managed to implrment them in only a few lines of code is very helpful in 1. understanding the calculations and processes, 2. demistifying the what is a black box in tensorflow / keras.
i've taken inspirations from your code and cited your channel for my neural network paper for a college project, im just letting you know this here and hope that you won't feel particularly mind for it. btw, thank you so much for the video, 3blue1brown's series on neural network is great and all, but it is your video that makes all the computations really sink in and make actual sense, representing the gradients as linear algebra operations just ties everything together so neatly, compared to individual derivative formulas for the weights and bias, which is how it's usually written. And the choice of seperating the dense layers and the activation layers was, to put mildly, fucking brilliant.
I developed my first neural network in one night yesterday. that could not learn because of backward propagation, it was only going through std::vectors of std::vectors to get the output. I was setting weights to random values and tried to guess how to apply backward propagation from what i have heard about it. But it failed to do anything, kept guessing just as I did, giving wrong answers anyway. This video has a clean comprehensive explanation of the flow and architecture. I am really excited how simple and clean it is. I am gonna try again. Thank you.
I don't know about PhDs since I am not a PhD myself, but I never found any simple explanation of how to make such an implementation indeed, so I decided to make that video :)
@@independentcode I think you should keep going video seris and show how capable this type of abstraction. Implemnting easiliy almost every type of neural nets.
Thank you for the kind words. I did actually take that a step further, it's all on my GitHub here: github.com/OmarAflak/python-neural-networks I managed to make CNNs and even GANs from scratch! It supports any optimization method, but since it's all on CPU you get very quickly restricted by computation time. I really want to make series about it, but I'll have to figure out a nice way to explain it without being boring since it involves a lot of code.
The learning rate is used when we update trainable parameters (weights & biases). In the activation layer there is no parameter to update, we simply return the input gradient to the previous layer.
In your code you compute the gradient step for each sample and update immediately. I think that this is called stochastic gradient descent. To implement full gradient descent where I update after all samples I added a counter in the Dense Layer class to count the samples. When the counter reached the training size I would average all the stored nudges for the bias and the weights. Unfortunately when I plot the error over epoch as a graph there are a lot of spikes (less spikes than when using your method) but still some spikes. My training data has (x,y) and tries to find (x+y).
At 12:42, I didn't understand why you had to take the sum. We want to calculate dE/dw12, and if I am understanding it correctly, it is the derivative of error wrt our layer's 1st neuron's 2nd weight(w12). So it should be simply dE/dy1 * dy1/dw12, since the output of that neuron is just y1. If we can get it directly, then why did we take the sum to arrive here? Am I missing something?
Hi Biraj. I'm showing the sum as it would be the most general/repeatable way of proceeding for any of the derivatives, but you are right: if you can see immediately that w12 only appears in y1, then don't bother doing the sum. When I say repeatable, I mean what if the derivative was with respect to x2 for instance ? Then you would need to take into account all the y variables. But it might become confusing to some of the viewers why we proceed in one way in one case and in another for some other case. That's why I like to show the sum as a first systematic step. I hope it makes sense!
Hi there, great video, super helpful, but at 19:21 line 17 the gradient is computed with the updated weights instead of the original weights which (I believe) caused some exploding/vanishing gradient problems for my test data (iris flower dataset). Fixing that solved all my problems. If I am wrong please let me know. Note: I used leaky RELU as activation function
Great video! At 17:45, last row of matrix W' (transpose of W), subscript got a bit messed up. w_1j, w_2j and w_ij should be w_1i, w_2i and w_ji, i.e., j rows and i columns.
This is one of the best videos to really understand the vectorized form of neural networks! Really appreciate the effort you've put into this. Just as a clarification, the video is considering only 1 data point and thereby performing SGD, so during the MSE calculation Y and Y* are in a way depicting multiple responses at the end for 1 data point only right? So for MSE it should not actually be using np.mean to sum them up?
amazing video. one thing we could do is to have layers calculate inputs automatically if possible. Like if I give Dense(2,8), then the next layer I dont need to give 8 as input since its obvious that it will be 8. Similar to how keras does this.
Hi, I'm naive to math and coding, can anyone explain where in the def backward part of the dense layer code the derivatives are computed? The video explains that the derivatives are there, but I was expecting to see a function to compute it. Where exactly does the derivative appear there?
When I checked the output of the dense layer I was getting an array of size (output size, output size) instead of (output size, 1), later they said it's due to broadcast. I dont know what it is. But when i changed bias shape from (output size,1) to (output size) i get the result with shape (output size,1)
when looking at the error and it's derivative wrt some y[i], intuitively I would expect that if I increased y[i] by 1 the error would increase by dE/dy[i], but if I do the calculations the change in the error is 1/n off from the derivative, does this make sense?
Is there a practical reason why the activation functions are implemented as layers, rather than the other layers, such as Dense, taking the activation function as an argument & applying it internally?
Yes, for simplicity. If you apply the activation inside the layer, then that layer will also have to account for the activation during backward propagation. And the Dense layer is not the only layer that might use an activation, so will you implement it in every such layer? That's why it's a separate thing
@@independentcode That's a good point. Although i suppose you could implement the activation function handling for both forward and backwards propagation in the base Layer class, right? I'm asking this because I started working on a project where I build a Dense neural net to classify some data, but I decided I might as well build a little neural net library. Your video made me think about creating a better design. I first passed the architecture of the network as a list of layer_lengths to a DenseNeuralNet class. I prefer your design of making a base Layer class that will function as an abstract base class, and specifying separate layer objects, as it's more modular than my initial design.
Hi, thank you for such a great explanation. I understood the core idea of what you explain. I am not familiar with matrix calculus and derivatives. 12:41 Here I don't really understand what rule are you using for expanding the sum out. If you could point me to some resource online where I can learn this I would be grateful.
Hi. We're able to do this because E is a function of all the y variables. Let's take a simple example: X=Y1+Y2 Y1=3Z1 Y2=2Z1+Z2 Then, ∂X/∂Z1 = ∂X/∂Y1 * ∂Y1/∂Z1 + ∂X/∂Y2 * ∂Y2/∂Z1 = 1 * 3 + 1 * 2 = 5 Note that it is exactly the same as expanding first the expression of X and then deriving with respect to Z1: X=3Z1+2Z1+Z2 =5Z1+Z2 ∂X/∂Z1=5 It's called the chain rule.
this is an amazing video which explains so perfectly how neural networks work. I appreciate and thank you for all the effort energy you put in this video and it is shame that your work did not receive enough views that it deserves. I believe you use manim to make animations like 3b1b, dont you?
I'm guessing what you implemented is stochastic gradient descent, where every epoch, you update parameters for every observation, rather than for the set of all observations? Would your implementation work when back & forward prop take X and Y as arguments, instead of x and y?
You could implement something else than stochastic by not updating directly after each sample, but by average over many samples and updating then. However, it wouldn't change what you mentioned first that is we still have to loop through each data point and we don't take advantage of vectorization. If we wanted to do so, I think we'd need to make each layer accept a batch of inputs instead of a single one, and make sure the layer processes it all at once. But it would have made the video more complicated, and the goal here was to have something very simple, yet somewhat general :)
Correct me if I'm wrong, but doesn't your implementation handle batches of data as well? def forward(self, input): self.input = input return np.dot(self.weights, self.input) + self.bias If input is a matrix instead of a vector, wouldn't the dotproduct just apply to every column? Same with backprop
@@Djellowman you're correct, it just so happens that this implementation of the dense layer supports batch input. The activation would also support it since it's just applying a function to the input regardless of its size, and mse_prime in our case would also work out since it's just doing Y*-Y. So I guess here it works! But in the next video where I implement a CNN, I don't think it will, at least I haven't done it intentionally :)
Hi , Im trying to print the weights after every epoch but I'm not able to do so. Can u help whats going wrong with this approach ..I simply tried to use the forward method..during training, def predict(network, input,train=True): output = input for layer in network: if layer.__class__.__name__ =='Dense': output = layer.forward(output) list_.append(layer.weights) else : output = layer.forward(output) however i get the same corresponding weights all the time
I think you're getting the same value in the list because layer.weights is a reference. You need to copy it. So just do: list_.append(np.copy(layer.weights))
In the backward function of the dense class you're returning a matrix which uses the weight parameter of the class after updating it, surely you'd calculate this dE/dX value before updating the weights, and thus dY/dX?
Wow, you are totally right, my mistake! Thank you for noticing (and well catched!). I just updated the code and I'll add a comment on the video :)
I can't add text or some kind of cards on top of the video, so I pinned this comment in the hope that people will notice it!
@@independentcode Why can't you?
Did the youtube developers remove that awesome function too?
No wonder I've felt things have been off for so long!
Can you plz help me with this .. I want a chess ai to teach me what it learnt
th-cam.com/video/O_NglYqPu4c/w-d-xo.html
just curious what happens if we propagate the updated weights backward like in the video? Will it not work? Or will it slowly converge?
This video, instead of the plethora of other videos on "hOw tO bUiLd A NeUrAl NeTwOrK fRoM sCraTcH", is the literal best. It deserves 84 M views, not 84 k views. It is straight to the point, no 10 minutes explanation of pretty curves with zero math, no 20 minutes introduction on how DL can change the world
I truly mean it, it is a refreshing video.
I appreciate the comment :)
@@independentcode Thank you for the reply! I am a researcher, and I wanted to create my own DL library, using yours as base, but expanding it for different optim algorithms, initializations, regularizations, losses etc (i am now just developing it on my own privately), but one day I'll love to post it on my github. How can I appropriately cite you?
That's a great project! You can mention my name and my GitHub profile: "Omar Aflak, github.com/omaraflak". Thank you!
I like how he said he wouldn’t explain how a neural network works, then proceeds to explain it
This might be the most intuitive explanation of the backpropagation algorithm on the Internet. Amazing!
Probably the best explaination of neural network of TH-cam ! The voice and the musique backside is realy soothing !
True
Not only was the math presentation very clear, but the Python class abstraction was elegant.
I love the 3b1b style of animation and also the consistency with his notation, this allows people to learn the matter with multiple explanations while not losing track of the core ideas. Awesome work man
The best tutorial on neural networks I've ever seen! Thanks, you have my subscription!
THANK YOU !
This is exactly the video I was looking for.
I always struggled with making a neural network, but following your video, I made a model that I can generalize and it made me understandexactly the mistakes I made in my previous attempts.
It's easy to find on youtube videos of people explaining singular neurons and backpropagation, but then quickly going over the hard part: how do you compute the error in an actual network, the structural implementation and how it all ties together.
This approach with separating the Dense layer from the activation layer also makes things 100x clearer, and many people end up smacking them both in the same class carelessly.
The visuals make the intuition for numpy also much much easier. It's always a thing I struggled with and this explained why we do every operation perfectly.
even though I was only looking for one video, after seeing such quality, I HAVE to explore the rest of your channel ! Great job.
Thank you so much for taking the time to write this message! I went through the same struggle when I wanted to make my own neural networks, which is exactly why I ended up doing a video about it! I'm really happy to see that it serves as I intended :)
This is an unbelievably clear and concise video. It answers all of the questions that linger after watching dozens of other videos. WELL DONE!!
Thanks for making such great quality videos. I'm working on my Ph.D., and I'm writing a lot of math regarding neural networks. Your nomenclature makes a lot of sense and has served me a lot. I'd love to read some of your publications if you have any.
I have been struggling with backpropagation in MLP from 2 weeks and when I was just searching for a video which can help me understand the process mathematically this video grabbed my attention and in this video I was able to understand the whole process both conceptually and mathematically actually the code given by you was the same code given by our mentor to us but he he was unable to explain clearly and the animations shown in the video were really great finally thank you for posting this video!!!!🛐I CAN ADVANCE IN MY PROJECT FURTHER!!!
by far, the best video of this topic that I saw in the whole platform
Very clean and pedagogical explanation. Thanks a lot!
Best tutorial video about neural networks i've ever watched. You are doing such a great job 👏
This was the best mathematical explanation on TH-cam. By far.
This could be 3Blue1Brown for programmers! You got yourself a subscriber! Great video!
I'm very honored you called me that. I'll do my best, thank you !
+1
@@independentcode +1 sub
Thank you so very, very, very much for this video. I have been wanting to do Machine Learning, but without "Magic". It drives me nuts when all the tutorials say "From Scratch" and then proceed to open Tensor Flow. Seriously, THANK you!!!
I feel you :) Thank you for the comment, it makes me genuinely happy.
jesus christ this is a good video and shows clear understanding. no "i've been using neural networks for ten years, so pay attention as i ramble aimlessly for an hour" involved
I think the last row's indices of the W^T matrix at 17:55 must be (w1i, w2i,...,wji).
Still the best explannation i have ever seen btw, thank you so much. I dont know why this channel is still so underrated, looking forward to seeing your new videos in the future
Yeah I know, I messed it up. I've been too lazy to add a caption on that, but I really should. Thank you for the kind words :)
I just watched your CNN video, the next one and I couldn't resist watching this one. Although I knew most things in this video, watching everything work from scratch felt amazing.
This video really saved me. From matrix representation to chain rule and visualisation, everything is clear now.
This is basically ASMR for programmers
I almost agree, the only difference is that I can’t sleep thinking about it
@@nikozdevbruh I fall asleep and allow my self to hallucinate in math lol
I felt relaxed definetly :D
This is a so high quality content. I have only basic knowledge of linear algebra and being a non-native speaker I could fully understand this
Man, I love you. How many times i tried too do the multilayer nn on my own, but always faced thousand of problems. But this video explained everything. Thank you
This video is the best on TH-cam for Neural Networks Implementation!
Absolutely astonishing quality sir. Literally on the 3b1b level. I hope this will help me pass the uni course. SUB!
This is the best channel for learning deep learning!
one of the best video i have ever seen.
struggled alot to understand this and you have explained so beautifully
you made me fall in love with the neural network which i was intimidating from.
thank you so much.
Thank you for your message, it genuinely makes me happy to know this :)
It is the best one I've seen among the explanation videos available on TH-cam!
Well done!
Thank you very much for your videos explaining how to build ANN and CNN from scratch in Python: your explanations of the detailed calculations for forward and backward propagation and for the calculations in the kernel layers of the CNN are very clear, and seeing how you have managed to implrment them in only a few lines of code is very helpful in 1. understanding the calculations and processes, 2. demistifying the what is a black box in tensorflow / keras.
i've taken inspirations from your code and cited your channel for my neural network paper for a college project, im just letting you know this here and hope that you won't feel particularly mind for it.
btw, thank you so much for the video, 3blue1brown's series on neural network is great and all, but it is your video that makes all the computations really sink in and make actual sense, representing the gradients as linear algebra operations just ties everything together so neatly, compared to individual derivative formulas for the weights and bias, which is how it's usually written. And the choice of seperating the dense layers and the activation layers was, to put mildly, fucking brilliant.
Of course! Thank you for the kind words :)
This is such an elegant and dynamic solution. Subbed!
There are many solutions on the internet...but i must say this one is the best undoubtedly...👍 cheers man...pls keep posting more.
This is a very good approach to building neural nets from scratch.
Amazing approach ! Very well explained. Thanks!
I loved the background music. It gives peaceful mind. I hope, you will continue to make videos, very clear explanation
Thank you, that's the best video I have ever seen about neural networks!!!!! 😀
You are the only youtuber I sincierly want to return. We miss you!
this has to be the single best neural network explaining video I have ever watched
best video, very clear-cut. Finally I got the backpropagation and derivatives.
Impressive, lot of information but remains very clear ! Good job on this one ;)
Very well-done. I appreciate the effort you put into this video. Thank you.
Thank you so much, my assignment was so unclear, this definitely helps!
Such a great video. Really helped me to understand the basics.
How output gradient is calculated and passed into the backward function?
This is the best video i have seen so far ❤
That was incredibly explained and illustrated. Thanks
Thank you! I'm glad you liked it :)
@@independentcode Most welcome!
This is really dope. The best by far. Subscribed right away
Only 4 video and you have avove 1k subs,
Please continue your work 🙏🏼
Finally found the treasure. Please do more video bro. SUBSCRIBED
Whyyyy you don't have 3Million subscriptions you deserve it ♥️♥️
actually,you saved my life, thanks for doing these
This is literally a masterpiece
I developed my first neural network in one night yesterday. that could not learn because of backward propagation, it was only going through std::vectors of std::vectors to get the output. I was setting weights to random values and tried to guess how to apply backward propagation from what i have heard about it.
But it failed to do anything, kept guessing just as I did, giving wrong answers anyway.
This video has a clean comprehensive explanation of the flow and architecture. I am really excited how simple and clean it is.
I am gonna try again.
Thank you.
I did it ! Just now my creature learnt xor =D
Wonderful, informative, and excellent work. Thanks a zillion!!
This is so ASMR and well explained!
you are the best 🥺❤️..wow.. finally i able to understand the basics thanks
your voice is calming and relaxing, sorry if that is weird
Haha thank you for sharing that :) Maybe I should have called the channel JazzMath .. :)
Thank you for really great explanation!
Wish you will make even more 😉
I think most of the ML PhDs dont aware of this abstraction. Simply the best.
I don't know about PhDs since I am not a PhD myself, but I never found any simple explanation of how to make such an implementation indeed, so I decided to make that video :)
@@independentcode I think you should keep going video seris and show how capable this type of abstraction. Implemnting easiliy almost every type of neural nets.
Thank you for the kind words. I did actually take that a step further, it's all on my GitHub here: github.com/OmarAflak/python-neural-networks
I managed to make CNNs and even GANs from scratch! It supports any optimization method, but since it's all on CPU you get very quickly restricted by computation time. I really want to make series about it, but I'll have to figure out a nice way to explain it without being boring since it involves a lot of code.
@@independentcode GANs would be great also you could try to do RNNs too and maybe even some reinforcement learning stuff :D
Big Fan of you from today !
Amazing explanation!!
I have a question, while backpropagating in Activation Layer, why are we ignoring the learning rate in the implementation? 22:07
The learning rate is used when we update trainable parameters (weights & biases). In the activation layer there is no parameter to update, we simply return the input gradient to the previous layer.
Thank you! Well done! Absolutely wonderful video.
In your code you compute the gradient step for each sample and update immediately. I think that this is called stochastic gradient descent.
To implement full gradient descent where I update after all samples I added a counter in the Dense Layer class to count the samples.
When the counter reached the training size I would average all the stored nudges for the bias and the weights.
Unfortunately when I plot the error over epoch as a graph there are a lot of spikes (less spikes than when using your method) but still some spikes.
My training data has (x,y) and tries to find (x+y).
Would you be able to share the code? This is where the part where I’m confused.
Very nice and clean video, keep it up
At 12:42, I didn't understand why you had to take the sum.
We want to calculate dE/dw12, and if I am understanding it correctly, it is the derivative of error wrt our layer's 1st neuron's 2nd weight(w12). So it should be simply dE/dy1 * dy1/dw12, since the output of that neuron is just y1. If we can get it directly, then why did we take the sum to arrive here? Am I missing something?
Hi Biraj. I'm showing the sum as it would be the most general/repeatable way of proceeding for any of the derivatives, but you are right: if you can see immediately that w12 only appears in y1, then don't bother doing the sum. When I say repeatable, I mean what if the derivative was with respect to x2 for instance ? Then you would need to take into account all the y variables. But it might become confusing to some of the viewers why we proceed in one way in one case and in another for some other case. That's why I like to show the sum as a first systematic step. I hope it makes sense!
Keep it up .please make a deep learning and ml series for future.
Hi there, great video, super helpful, but at 19:21 line 17 the gradient is computed with the updated weights instead of the original weights which (I believe) caused some exploding/vanishing gradient problems for my test data (iris flower dataset). Fixing that solved all my problems. If I am wrong please let me know.
Note: I used leaky RELU as activation function
Hello, how did you fix this issue?
That was helpful, thank you so much.
Great video! At 17:45, last row of matrix W' (transpose of W), subscript got a bit messed up. w_1j, w_2j and w_ij should be w_1i, w_2i and w_ji, i.e., j rows and i columns.
whiteout any doubt best explanation of NN ive ever seen - why you stop your productivity my friend ?
Dude this is amazing
Content at it's peak
after 1000 videos watched, i think i get it now, thanks
Thanks you so much for your contribution in this field.
Awesome man!!
This is one of the best videos to really understand the vectorized form of neural networks! Really appreciate the effort you've put into this.
Just as a clarification, the video is considering only 1 data point and thereby performing SGD, so during the MSE calculation Y and Y* are in a way depicting multiple responses at the end for 1 data point only right? So for MSE it should not actually be using np.mean to sum them up?
I love u , best ML video ever
amazing video. one thing we could do is to have layers calculate inputs automatically if possible. Like if I give Dense(2,8), then the next layer I dont need to give 8 as input since its obvious that it will be 8. Similar to how keras does this.
why do we use the dot product function for matrix multiplication? i thought that those did different things
Hi,
I'm naive to math and coding, can anyone explain where in the def backward part of the dense layer code the derivatives are computed? The video explains that the derivatives are there, but I was expecting to see a function to compute it. Where exactly does the derivative appear there?
Clear, to the point. Thank you. Like (because there are just 722, and have to be a lot more)
how can we update this to include mini-batch gradient descent? Especially how will the equations change?
This video is godsend, thank you.
Amazing tutorial!
When I checked the output of the dense layer I was getting an array of size (output size, output size) instead of (output size, 1), later they said it's due to broadcast. I dont know what it is. But when i changed bias shape from (output size,1) to (output size) i get the result with shape (output size,1)
I followed the code exactly, and I still get Numpy shape errors.
I would like alot if u continue your channel bro
when looking at the error and it's derivative wrt some y[i], intuitively I would expect that if I increased y[i] by 1 the error would increase by dE/dy[i], but if I do the calculations the change in the error is 1/n off from the derivative, does this make sense?
This video should be the first video you see when you search neural network.
In tensorflow they use weight matrix W dimensions i x j then take transpose in calculation.
18:18, about W transpose, it should be w11, w12, ..., w1i, column wise. it's a i by j matrix. am I right ?
Wow I messed up the last row! It should have been (W1i, W2i, ..., Wji) !!
The matrix W itself is of size (i, j), the transposed matrix is (j, i).
@@independentcode make sense, it's W transpose. j by i . and little typo. Thanks again for this greate tutorial
Is there a practical reason why the activation functions are implemented as layers, rather than the other layers, such as Dense, taking the activation function as an argument & applying it internally?
Yes, for simplicity. If you apply the activation inside the layer, then that layer will also have to account for the activation during backward propagation. And the Dense layer is not the only layer that might use an activation, so will you implement it in every such layer? That's why it's a separate thing
@@independentcode That's a good point. Although i suppose you could implement the activation function handling for both forward and backwards propagation in the base Layer class, right? I'm asking this because I started working on a project where I build a Dense neural net to classify some data, but I decided I might as well build a little neural net library. Your video made me think about creating a better design. I first passed the architecture of the network as a list of layer_lengths to a DenseNeuralNet class. I prefer your design of making a base Layer class that will function as an abstract base class, and specifying separate layer objects, as it's more modular than my initial design.
Hi, thank you for such a great explanation. I understood the core idea of what you explain. I am not familiar with matrix calculus and derivatives.
12:41 Here I don't really understand what rule are you using for expanding the sum out. If you could point me to some resource online where I can learn this I would be grateful.
Hi. We're able to do this because E is a function of all the y variables. Let's take a simple example:
X=Y1+Y2
Y1=3Z1
Y2=2Z1+Z2
Then,
∂X/∂Z1 = ∂X/∂Y1 * ∂Y1/∂Z1 + ∂X/∂Y2 * ∂Y2/∂Z1
= 1 * 3 + 1 * 2
= 5
Note that it is exactly the same as expanding first the expression of X and then deriving with respect to Z1:
X=3Z1+2Z1+Z2
=5Z1+Z2
∂X/∂Z1=5
It's called the chain rule.
@@independentcode Thank you very much. I understand now. Because E is the mean squared error it's a sum of terms that involves y variables.
this is an amazing video which explains so perfectly how neural networks work. I appreciate and thank you for all the effort energy you put in this video and it is shame that your work did not receive enough views that it deserves. I believe you use manim to make animations like 3b1b, dont you?
Thanks a lot for the kind comment 😌 I'm glad if the video helped you in any way :) Yes it is indeed Manim!
sir please keep up with your videos I learn a lot
would you mind sharing the manim project for this video?
Is there a way to feed it all out data at once, instead of going through the entire forward & backward prop for every datapoint?
I'm guessing what you implemented is stochastic gradient descent, where every epoch, you update parameters for every observation, rather than for the set of all observations? Would your implementation work when back & forward prop take X and Y as arguments, instead of x and y?
You could implement something else than stochastic by not updating directly after each sample, but by average over many samples and updating then. However, it wouldn't change what you mentioned first that is we still have to loop through each data point and we don't take advantage of vectorization. If we wanted to do so, I think we'd need to make each layer accept a batch of inputs instead of a single one, and make sure the layer processes it all at once. But it would have made the video more complicated, and the goal here was to have something very simple, yet somewhat general :)
Correct me if I'm wrong, but doesn't your implementation handle batches of data as well?
def forward(self, input):
self.input = input
return np.dot(self.weights, self.input) + self.bias
If input is a matrix instead of a vector, wouldn't the dotproduct just apply to every column? Same with backprop
@@Djellowman you're correct, it just so happens that this implementation of the dense layer supports batch input. The activation would also support it since it's just applying a function to the input regardless of its size, and mse_prime in our case would also work out since it's just doing Y*-Y. So I guess here it works! But in the next video where I implement a CNN, I don't think it will, at least I haven't done it intentionally :)
Hi , Im trying to print the weights after every epoch but I'm not able to do so. Can u help whats going wrong with this approach ..I simply tried to use the forward method..during training,
def predict(network, input,train=True):
output = input
for layer in network:
if layer.__class__.__name__ =='Dense':
output = layer.forward(output)
list_.append(layer.weights)
else :
output = layer.forward(output)
however i get the same corresponding weights all the time
I think you're getting the same value in the list because layer.weights is a reference. You need to copy it. So just do: list_.append(np.copy(layer.weights))