BRUH. This video gave me that amazing feeling when something clicks in your brain and everything all of a sudden makes sense! Thank you I have never seen neural networks explained in this way before.
This is the best video I have ever seen on the internet that describes what a neural network is actually The best and most powerful explanations are those that give you the intuitive meaning behind the math and this video does it perfectly When a video describes a neural network by jumping into matrices and talking about subscripts i's and J's, they're just talking about the mechanics and do absolutely nothing about making you understand what you're reading Unfortunately, this is how most textbooks approach the subject and it's also how many content creators approach the subject as well This type of video only comes from someone who understands things so deeply that they're able to explain it in a way that involves almost zero math I consider this video one of the true treasures of TH-cam involving artificial intelligence education
“Do neural networks work because they reason like a human. No. They work because they fit the data.” You should have added “boom. mic drop.”. Excellent video!
Can't say I agree. I really liked the video as a whole, but that "drop" was the worst part of the video to me, since it's a bit of a strawman, for at least two reasons: - knowing what a complex system does "at a foundational level" is very far from allowing you to understand the system. After all, Biology is "just" applied Chemistry which in turn is "just" applied Physics, but good luck explaining any complex biological system from physical principles alone. - much of what humans do doesn't use "reason" at all. A few years back I decided to start learning Japanese. And I recall that for the first few months of listening to random native Japanese speakers I'd have trouble even correctly identifying the syllables of their words. But after some time and more exposure to the sounds, grammar, and speech patterns, that gradually improved. Yet that improvement had little to do with me *reasoning* about the language, and was largely an unconscious process of my brain getting better at pattern recognition in the language. At least when it comes to "pattern recognition" I see no compelling reason to declare that humans (and animals, for that matter) are doing anything fundamentally different from neural networks.
My comments about neural networks reasoning were in response to some of the recent discussions about large language models being conscious. My impression is that these discussions give people a wildly inaccurate view of what neural networks actually do. I just wanted to make it clear that all neural networks do is curve fitting. Sure you can say "neural networks are a function that map inputs to outputs" and "humans are a function that map inputs to outputs", therefore they are fundamentally doing the same thing. But there are important differences between humans and neural networks. For one thing, in the human's case the function is not learned by curve fitting. It is learned by Bayesian inference. Humans are born with an incredible amount of prior knowledge about the world, including what types of sounds human language can contain. This is why you were able to learn to recognize Japanese sounds in a few months, where it would take a neural network the equivalent of thousands of years worth of examples. If you want to say that neural networks are doing the same thing as humans that's fine, but you should equally be comfortable saying that random forests are doing the same thing as humans.
@@algorithmicsimplicity Whatever mechanism underlies human cognition, if it begets the same results as a neural network, then it can be said to also "merely" perform curve fitting. Whether that can also be described in terms of Bayesian inference would not invalidate that. Similarly, it is not helpful stating there's nothing to understand or use as a model in neurobiology since it is just atoms minimizing energy states.
@@LuisPereira-bn8jq aren't you making a strawman yourself? Also wouldn't your language example still count as his "learning abstract hierarchies and concepts"?
I have been working with machine learning models for years and this is the first time i have truly understood through visualisation the use of ReLU activation functions! Great video
Oh my lord. I've been struggling with neural networks for awhile and I've always felt like I have a decent grasp on them but this video finally brought everything together. Beautiful introduction
Except this just described one activation function, and did not show it generalized to all neural networks. Being so accessible means it couldn't explain ReLU in context. Don't get me wrong, it's a good explanation of how some variants of the ReLU activation function works, but it doesn't explain what a neural network really is, nor prove that your brain doesn't work by fitting data in a similar way.
Building an intuitive understanding of the math behind Neural Network is so important. Understand the application of NN gets the job done; understand the math behind of NN makes the job fun. This video helps the latter! Nice video!
Great video! I think this video finally shows what I was waiting for, namely what is the purpose of multiple neurons / layers in a neural network intuitively. This is the first time i have actually seen it explained clearly, good job!
This is flat out the best video on neural networks on the internet, provided you are not a complete newbie. Never have I had such an "ahaaa" moment. Clear, consize, easy to follow, going from 0 to hero effortlessly. Bravo.
Excellent video. I was flabbergasted when I first heard that ANNs with ReLu activations are basically just a piecewise linear functions. I was so obvious but prior to that I had only seen explanations from CS people that almost regard ANNs as some kind of mysterious mystical magical profoundly complex construction. This initially discouraged me from even thinking about them. I have sense decided to stop listening to computer science folk talk about them and listen to math and physics folk given I think they tend to "cut the bs" when it comes to discussing things like this. Of course there are really great computer scientists but its usually their camp that likes to oversell concepts like this.
This is such a cool way of thinking about it! You did an amazing job discussing a popular topic in a refreshing way. I can't believe I just found your channel - as a video creator myself, I understand how much time this must have taken. Liked and subscribed 💛
Fantastic video! I have been working with econometrics, data science, neural networks, and various kind of ML for 20 years but never thought of the ReLU neural networks as just a series of linear regressions until now!
Thank you! That is a really brilliant video! I have been using regressions often, but never knew that Neural Network is kinda the same idea. Very enlightening!
Careful there! You should explicitly mention that you are taking the absolute values of the errors. (Usually we use squares). Without the squares (or abs), the positive and negative errors will kill each other off, and the simple regression does not have a unique solution. Without the squares (or abs), you can start with any intercept, and find a slope that will give you ZERO total error!!
wow, ReLU is an unexpected starting point to explain NNs, but nicely demonstrates the flexibility of summing up weighted non-linear functions.such a refreshing way!
Just watched Sebastian Lague's video on Neural Networks the other day, and whilst great as always, it was _such_ a standard method of explaining them. Because mostly I just see this explained in the same way each time. This was such a nice change, and really provided me with a different way to look at this. Seeing 'no lin-alg, no calc, no stats' really concerned me, but, you did a great job, just by trying to explain different parts. Such a great explanation - would recommend to others.
This the clearest video I've ever seen on TH-cam on what a Neural Network is. Thank you so much... you are a star. Could I perhaps ask or encourage you to please create for many of us keen on learning Neural networks on our own, a video practically illustrating the fundamental difference between supervised, unsupervised and reinforcement learning.
Having worked with a simple single-layer 2-synapse neuron in a spreadsheet, I find this video vastly overexplains the topic at a high level, while not going into enough detail. It does, however, go over the linear regression needed for the synapse weight updates. Also it treats the massive regression testing as a benefit instead of a cost. One synapse per neuron in the layer above, or per input if the top layer. One neuron per output if the bottom layer. Middle layers define resolution, from this video at a rate of (neurons per layer)^(layers). Fun fact: Neural MAC (multiply-accumulate) chips can perform whole racks worth of computation. The efficiency gain here isn't so much in speed as it is reduction of power and space, by rearranging the compute units and using analog accumulation. In this way the MAC units more closely resemble our own neurons too.
Another explanation that's needed is to explain the concept of gradient descent (GD), that's the generalized method used to figure out the best fit. Lot's of systems use GD including natural evolution, it's basically trial and error with adjustments, although there are various ways to make it work more efficiently which can become quite complicated. You can even use GD to figure out better forms of the GD algorithm, that is it can be used recursively on itself.
I was wondering myself exactly what a simulated NN actually is doing (not what it is, but what it is doing) and this explanation is the best by far, if not THE answer. One adjustment I will suggest, is at the end explain that a simulated NN is not required at all, and explain that alternative systems can also perform the same function, which begs the question, what exactly are the fundamental requirements needed for line fitting to occur? Yes I like to generalize and get to the fundamentals.
This makes total sense thank you. With the last observation of the video, how does that reconcile with statements from the openai team regarding emergent properties of GPT4, that they didn't expect, or don't comprehend. I might be mixing apples and oranges, but if it's just curve fitting then why has some thing substantially changed?
8:47 --> 9:21 Is like watching my brain while I predict some trades. 🤣🤣🤣 "The reason why neural networks work....is that they fit the data" sweet stuff.
Great video, this is an explanation I have not heard before. Also I don't know if that abrupt ending was purposefully sarcastic, but I thoroughly enjoyed it lol
I wrote my own CUDA-based neural network implementation a while ago, but I used a sigmoid instead of RELU. Although the "training" covered in this video works, it was kinda odd (usually neural net backpropagation is done with gradient descent of the cost function). You probably cover this in another video, though, haven't gotten there yet. Good video nonetheless, never really thought about it this way.
The training method described in this video IS gradient descent. Backprop is just one way of computing gradients (and it is usually used because it is the fastest). It will yield the same results as this training procedure.
@@algorithmicsimplicity I should have figured it was doing much the same thing, just a lot slower and more 'random.' Your videos are awesome though, I'm trying to go into machine learning after doing aerospace embedded systems.
Not that I know much about any of this but as you were explaining things I thought of the idea of a bezier curve. Or at least thinking about how the idea is to fit the data as good as possible made me think of the bezier curve being able to create seemingly any complex line graphic. Have people applied bezier curves to neural networks or is it just not at all the right way to think about these things? I get that its very high dimensional space we're dealing with but I'd imagine a bezier curve could handle any number of dimensions.
So it depends on a bit on how you define bezier curves in higher dimensions. Technically speaking in order to define a cubic bezier curve in n dimensions you need n^3 parameters, which isn't practical for large n. You could apply a bezier curve independently to each dimension, which would basically amount to using x^3 as the activation function instead of ReLU. There has been some experimentation using polynomials as activation functions, but ReLU (or ReLU like functions) always seem to perform better.
Thank you. This is far more intuitive than the usual interpretation with nodes and edges graph with the inputs bouncing back and forth between layers of the graph until it finally gets an output. What are the advantages and disadvantages between this method of approximating a function and polynomial interpolation?
For 1-dimension inputs and outputs, there isn't much difference between them. For higher dimensional inputs polynomials become infeasible, since a polynomial would need coefficients for all of the interaction terms between the input variables (of which there are exponentially many). For this reason, neural nets are preferred when input is high dimensional as they simply apply a linear function to the input variables, and then an activation function to the result of that.
I understand by adding layers to our neural network we are gaining one extra "bend" for each segment passing by the x axis but i don't understand how n^L is the number of line we can get from doing that
This is a really cool explanation I haven't seen before. But I have two questions: Where does overfitting fit in here? more neurons would mean higher risk of overfitting? do layers help or are they unrelated? And where would co-activation of multiple neurons fit in this explanation? e.g., combination of information from multiple sensory sources?
My video on CNNs talks about overfitting and how neural networks avoid it (th-cam.com/video/8iIdWHjleIs/w-d-xo.html ) . It turns out that actually the more neurons and layers there are, the LESS neural nets overfit, but the reason is pretty unintuitive. From the neural nets perspective, there is no such thing as multiple sensory sources. Even if your input to the NN combines images and text, the neural net still just sees a vector as input, and it is still doing curve fitting just in a higher dimensional space (dimensions from image + dimensions from text).
@@algorithmicsimplicity Thank you! I had read the more neurons lead to less overfitting and thought it was counterintuitive but I guess that must have carried over from the regular modeling approach where variables remain (or should) interpretable. I'll have a look at the other videos! Thanks I guess my confusion stems from what you address at the end. We can fairly simply imitate some things via principles like Hebbian learning but the fact that in actual brains it involves different interconnected systems makes me stumble. (and it shouldn't because obviously these models are not actually like real brains)
Brilliant! Lots to unload from the concluding sentence: neural networks works because they fit the data. Sounds like an even deeper issue than misalignment due to proxy-based training.
Actually adding a bit of math to this video won't hurt while you add to them visual representation of graphs and formulas. But any way one of the most accessible explanation i have ever seen.
I find this video odd for two reasons. ReLU isn't the only choice of function. When we talk about curve fitting, that's a pretty dramatic misrepresentation of the sheer dimensionality of the system, and in the case of a deep network, the implicit meaning being captured by each node.... which is where the interesting insights about the nature of information itself is occurring, and where higher level structures and feedback loops play their roles in our minds.
Gauss: Just fit a line to a set of points. Anybody could have come up with this. 200 years later: Guys, what if we just fit multiple lines to a set of points :D :D :D
wow. very good. keep posting great content like this. you have an better potential to explain complex topics to simpler versions. and there are people who just post content just for the sake of posting and minting money. we need more people like you.
Might have been useful to explain why a nonlinear function like the ReLU is used as the transfer function, or that it's not the only common transfer function, which is kind of implied since you use the identity in the beginning, though that isn't nonlinear. Also the link to the human brain is in the structure of neurons, the conceptual foundation of a weighted sum transformed into a single value by a nonlinear process, dendritic signals combined into a single signal down the axon, at the end of which lie new connections to further neurons. Furthermore in the case of computer vision the structure of the visual cortex served as an inspiration for the neural networks in that field. If memory serves the neocognitron was one such network, not a learning network as it's parameters were tweeked by humans till it did as desired, but foundational to convolutional neural networks. Otherwise interesting enough, stressing the behind the scenes nature of neural networks, though maybe mentioning how classification relates to those regressions would have been cool too. Btw what learning scheme were you using, as far as I could tell it was some small random jump first added and then subtracted if the result got worse? I assume if the substraction is even worse it rolls back the entire thing and rolls a new jump? I ask as it neither sounded nor looked like back propagation was used. The thumbnail still bugs me, the graph representation of the network isn't wrong it just shows something different, the nature of the interconnections between neurons in the network that are hard to see in the graph representation of the resulting regression. It's like saying the map of the route is wrong because it's not a photo of the destination.
All commonly used activations are either ReLU or smooth approximations to ReLU, so I disagree that it is useful to talk about alternatives in an introductory video. It is not helpful to think of modern neural networks as having anything to do with real brains. Yes early neural networks were inspired by neuroscience experiments from the 1950s. But our modern understanding is that the way the brain actually works is vastly more complex than those 1950s experiments let on. Not to mention modern neural networks architectures are even less inspired by those experiments (e.g. transformers). The learning scheme I used is iterate through parameters, increase parameter value by 0.1, if new loss is worse than old loss then decrease parameter value by 0.2. That's it. I just repeated that operation. No backpropagation was used, as the purpose of backprop is only to speed up the computation (which was unnecessary for these toy examples). The graph representation of the network is absolutely wrong, as it misleads people into thinking the interconnections between neurons are relevant at all, when they have nothing to do with how or why neural nets work.
@@algorithmicsimplicity i would argue that modern neural networks developed from the more primitive neural networks based on the then understanding of the brain. Also I didn't write neural networks work like the brain, just that their basic building blocks were inspired by the basic building blocks of the brain and some structural aspects, as understood at the time, inspired the structure of NN. So you are claiming that neither sigmoid, hyperbolic, nor any of the transfer functions common to RBFNN are used? Or are so rarely used as to be negligible? Yes ReLU and related functions are currently popular as the transfer function of hidden layers in deep learning and CNNs. Although some of the related functions start to depart quite a bit from the original, mostly keeping linearity in the positive domain, approximations not withstanding. Looking through them it feels like calling sigmoid and tanh functions related to the step function, which they kinda are, similarly solving issues with differentiability as the approximations to ReLU do. So you kind of used a grid search, on a 0.1 sized grid to discretize the parameter space. Pretty sure a NN without connections isn't a network. Especially for back propagation they are inherent in it's functioning, after all it needs to propagate over those connections. I don't see what you mean by it misleads people or how the interconnectedness is meaningless. The fact that a single layer wasn't enough to solve a non linear separable problem is what kept the field inactive for at least a decade.
"So you are claiming that neither sigmoid, hyperbolic, nor any of the transfer functions common to RBFNN are used? Or are so rarely used as to be negligible?"
@@algorithmicsimplicity if we didn't need performance then we wouldn't have made so much progress in the last few years. Stars aligned with performant hardware and efficient algorithms.
Great video! I am left with a question though. If the number of straight line segments in an NN with n neurons in each of the L layers is n^L, then why would we ever use n > 2? If we are constrained by the total number of neurons n*L, than n = 2 maximizes n^L. I have two guesses why use n>2: 1. (Hardware) Linear algebra is fast, especially on a GPU. We want to use vectors of larger sizes to make use of the parallelism. 2. (Math) Maybe the number of gradient descent steps needed to fit a deeper neural network is larger than to fit a shallower NN with wider layers? If you plan to make any more videos about this, this question would be great to address. If not, maybe you can reply here with your thoughts? Thank you!
Really good question. There are 2 reasons why neural networks tend to use very large n (usually several thousand) even though this means less representation capacity. The first is, as you guessed, it makes better use of GPU accelerators. You can't parallelize computation across layers, but you can parallelize computation across neurons within the same layer. The second, and more important reason, is that in practice we don't care that much about representation power. Realistically, as soon as you have 10 layers with a few hundred neurons in each, you already have enough representation power to fit any function in the universe. What we actually care about is generalization performance. Just because your network has the capacity to represent the target function, doesn't mean that it will learn the correct target function from the training data. It is much more likely that the network will just overfit to the training dataset. It turns out to be the case that the wider a neural network is, the better it generalizes. It is still an open area of research why this is the case, but there are a few hypotheses floating around. My other video on convolutional neural networks actually goes into one of the hypotheses a bit, in it I explain that the more neurons you have the more likely it is that they have good initializations, but it was a bit hand-wavy. I was planning to do a more in-depth video on this topic at some point.
@@algorithmicsimplicity thank you for such a thoughtful reply! I'll watch your CNN video and all other videos you'll produce on this topic! I thought that the whole overfitting business is kind of obsolete nowadays, with LLMs having more neurons than the number of training data samples. This is only a rough understanding I've gained from some random articles, and would love to learn more about it. Do you have any suggestions for what to read or watch in this direction? As you noted in the video, there is lots of low-quality content about NNs out there, which makes it hard to find answers to even rather straightforward questions like whether overfitting is "still a thing" in large models.
@@vasilin97 The reason why we use such large neural networks is precisely because larger neural networks overfit less than smaller neural networks. This is a pretty counter-intuitive result, and is contrary to what traditional statistical learning theory predicts, but it is empirically observed over and over again. This phenomenon is known as "double descent", you should be able to find some good resources on this topic searching for that term, for example www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent , medium.com/mlearning-ai/double-descent-8f92dfdc442f . The Wikipedia page on double descent is pretty good too.
@@xt3708 I didn’t mean to be entirely sarcastic as I truly love this video’s perspective. It’s my understanding that it wasn’t until 2011 that we realized that the ReLU activation function works so well, and that it was by experimentation so I assume we didn’t get the concepts in this video for many years prior, and only understand them now (I am not an expert though so don’t quote me please)
This is the second video of yours that I watch that gives me an eureka moment. Fantastic content. One thing I don't get is, people used to use the sigmoid function before ReLU, right? Was it just because natural neurons work like that and artificial ones were inspired by them?
Yes sigmoid was the most common activation function up until around 2010. The very earliest neural networks back in the 1950s all used sigmoid, supposedly to better model real neurons, and nobody questioned this choice for a long time. Interestingly, the very first convolutional neural network paper in 1980 used ReLU, and even though it was already clear that ReLU performed better than sigmoid back then, it still took another 30 years for ReLU to catch on and become the most popular choice.
Great concise explanation, and it does works: it fits at least my brain's data like a glove! Not that I have a head shaped like a hand (or do I?), but you did light up some bulbs in there after watching those lines animations fitting better and better. However, what happens when the neural network fits too well? If you can briefly mention the overfitting problem in one of your next episodes, I''d greatly appreciate. Looking forward to the CNNs and transformer ones! 🦾🤖
Thanks for the vid! At about 10:30 you say a NN with n neurons in each of L layers expresses ~ n^L linear segments. Could this be a mistake? I think it's more like n^2 * L
The number of different linear segments is definitely at least exponential in the number of layers, e.g. proceedings.neurips.cc/paper_files/paper/2014/file/109d2dd3608f669ca17920c511c2a41e-Paper.pdf
I was expecting a duplicate of many other nueral network videos, but this was a perspective that I have not seen before! Awesome Video!
BRUH. This video gave me that amazing feeling when something clicks in your brain and everything all of a sudden makes sense! Thank you I have never seen neural networks explained in this way before.
Copied🗿
exactly !
This is the best video I have ever seen on the internet that describes what a neural network is actually
The best and most powerful explanations are those that give you the intuitive meaning behind the math and this video does it perfectly
When a video describes a neural network by jumping into matrices and talking about subscripts i's and J's, they're just talking about the mechanics and do absolutely nothing about making you understand what you're reading
Unfortunately, this is how most textbooks approach the subject and it's also how many content creators approach the subject as well
This type of video only comes from someone who understands things so deeply that they're able to explain it in a way that involves almost zero math
I consider this video one of the true treasures of TH-cam involving artificial intelligence education
“Do neural networks work because they reason like a human. No. They work because they fit the data.” You should have added “boom. mic drop.”. Excellent video!
Can't say I agree. I really liked the video as a whole, but that "drop" was the worst part of the video to me, since it's a bit of a strawman, for at least two reasons:
- knowing what a complex system does "at a foundational level" is very far from allowing you to understand the system. After all, Biology is "just" applied Chemistry which in turn is "just" applied Physics, but good luck explaining any complex biological system from physical principles alone.
- much of what humans do doesn't use "reason" at all. A few years back I decided to start learning Japanese. And I recall that for the first few months of listening to random native Japanese speakers I'd have trouble even correctly identifying the syllables of their words. But after some time and more exposure to the sounds, grammar, and speech patterns, that gradually improved. Yet that improvement had little to do with me *reasoning* about the language, and was largely an unconscious process of my brain getting better at pattern recognition in the language.
At least when it comes to "pattern recognition" I see no compelling reason to declare that humans (and animals, for that matter) are doing anything fundamentally different from neural networks.
My comments about neural networks reasoning were in response to some of the recent discussions about large language models being conscious. My impression is that these discussions give people a wildly inaccurate view of what neural networks actually do. I just wanted to make it clear that all neural networks do is curve fitting.
Sure you can say "neural networks are a function that map inputs to outputs" and "humans are a function that map inputs to outputs", therefore they are fundamentally doing the same thing. But there are important differences between humans and neural networks. For one thing, in the human's case the function is not learned by curve fitting. It is learned by Bayesian inference. Humans are born with an incredible amount of prior knowledge about the world, including what types of sounds human language can contain. This is why you were able to learn to recognize Japanese sounds in a few months, where it would take a neural network the equivalent of thousands of years worth of examples.
If you want to say that neural networks are doing the same thing as humans that's fine, but you should equally be comfortable saying that random forests are doing the same thing as humans.
@@algorithmicsimplicity Whatever mechanism underlies human cognition, if it begets the same results as a neural network, then it can be said to also "merely" perform curve fitting. Whether that can also be described in terms of Bayesian inference would not invalidate that. Similarly, it is not helpful stating there's nothing to understand or use as a model in neurobiology since it is just atoms minimizing energy states.
Why ruin a good story with the truth?
@@LuisPereira-bn8jq aren't you making a strawman yourself?
Also wouldn't your language example still count as his "learning abstract hierarchies and concepts"?
This is THE best intuition behind neural networks I have ever seen. Thanks for the great video!
Getting some bit of math history context makes these extra enjoyable. Great video, explanation, and visualization.
I have been working with machine learning models for years and this is the first time i have truly understood through visualisation the use of ReLU activation functions! Great video
This video should be THE introduction to neural networks - well done!
Oh my lord. I've been struggling with neural networks for awhile and I've always felt like I have a decent grasp on them but this video finally brought everything together. Beautiful introduction
Wow. I finally understood the greatness of ReLU through your intuitive animation.
Simply and elegantly explained. The bit at the end was superb.
Except this just described one activation function, and did not show it generalized to all neural networks. Being so accessible means it couldn't explain ReLU in context.
Don't get me wrong, it's a good explanation of how some variants of the ReLU activation function works, but it doesn't explain what a neural network really is, nor prove that your brain doesn't work by fitting data in a similar way.
Building an intuitive understanding of the math behind Neural Network is so important.
Understand the application of NN gets the job done; understand the math behind of NN makes the job fun. This video helps the latter! Nice video!
One of the best introductory explanations about the foundational principles of neural networks. Well done and keep up the good work!
This might actually be the clearest perspective on neural networks I have seen yet!
Great video!
I think this video finally shows what I was waiting for, namely what is the purpose of multiple neurons / layers in a neural network intuitively.
This is the first time i have actually seen it explained clearly, good job!
This is flat out the best video on neural networks on the internet, provided you are not a complete newbie. Never have I had such an "ahaaa" moment. Clear, consize, easy to follow, going from 0 to hero effortlessly. Bravo.
IMO one of the best explanation when it comes the idea/fundamental concept of the NN, Please make more 🙏
Thank you so much! Don't worry, more videos are on the way!
Excellent video. I was flabbergasted when I first heard that ANNs with ReLu activations are basically just a piecewise linear functions. I was so obvious but prior to that I had only seen explanations from CS people that almost regard ANNs as some kind of mysterious mystical magical profoundly complex construction. This initially discouraged me from even thinking about them. I have sense decided to stop listening to computer science folk talk about them and listen to math and physics folk given I think they tend to "cut the bs" when it comes to discussing things like this. Of course there are really great computer scientists but its usually their camp that likes to oversell concepts like this.
This is such a cool way of thinking about it! You did an amazing job discussing a popular topic in a refreshing way. I can't believe I just found your channel - as a video creator myself, I understand how much time this must have taken. Liked and subscribed 💛
Made my brain goes boom. Seriously thanks for sharing this perspective!
Thank you for this beautiful work!
Thank you very much!
This straight forward explaining method can save thousands of kids from dropping school "due to math"
Best explanation of parametric calcs ever! Bias & weights has new meaning
Thank you id love a part two getting your take on fitting the line and prediction after
The best video I have seen in giving one an understanding of neural nets. Thank you. Excellent, looking for more from you.
You've put things in a different perspective for me and I loved your explanation! Great job!
Where has this video been all my life! amazing simply amazing! we need more please
Fantastic video! I have been working with econometrics, data science, neural networks, and various kind of ML for 20 years but never thought of the ReLU neural networks as just a series of linear regressions until now!
amazing viewpoint of explanation. Would've loved an additional segment using this viewpoint to do the MNIST image recognition
I explain how this viewpoint applies in the case of image classification in my video on CNNs: th-cam.com/video/8iIdWHjleIs/w-d-xo.html
And with this single video, you earned my subscription.
THIS was the missing piece of the puzzle I was looking for. This video helped me a lot. Thanks.
Like someone else said, I expected the video to be similar to all the others, but this one gave me so much more, very nice.
This is a really great video!! Love the approach with parametric regression.
Thank you thousands of times, you excellent teacher. Finally, I saw a high-quality and clear explanation of neural networks.
Thank you! That is a really brilliant video!
I have been using regressions often, but never knew that Neural Network is kinda the same idea.
Very enlightening!
480p in 2022 surely takes me back in time. i love it!!
Careful there! You should explicitly mention that you are taking the absolute values of the errors. (Usually we use squares). Without the squares (or abs), the positive and negative errors will kill each other off, and the simple regression does not have a unique solution. Without the squares (or abs), you can start with any intercept, and find a slope that will give you ZERO total error!!
wow, ReLU is an unexpected starting point to explain NNs, but nicely demonstrates the flexibility of summing up weighted non-linear functions.such a refreshing way!
Great work. This video just updated my definition of awesomeness.
I wish I could like the video more than once..
Great job buddy
Just watched Sebastian Lague's video on Neural Networks the other day, and whilst great as always, it was _such_ a standard method of explaining them. Because mostly I just see this explained in the same way each time. This was such a nice change, and really provided me with a different way to look at this. Seeing 'no lin-alg, no calc, no stats' really concerned me, but, you did a great job, just by trying to explain different parts. Such a great explanation - would recommend to others.
This the clearest video I've ever seen on TH-cam on what a Neural Network is. Thank you so much... you are a star. Could I perhaps ask or encourage you to please create for many of us keen on learning Neural networks on our own, a video practically illustrating the fundamental difference between supervised, unsupervised and reinforcement learning.
Brilliant explanation! Very glad I stumbled on your channel!
Your channel is really amazing! Thanks for making videos.
Hands down the most enlightening ANN series on the net from my perspective, afaik. I'd be happy to pay 5 USD for the next video in the series.
This EXCELLENT and the best video explaining intuitively what a neural network does. You are seriously brilliant
very original. I learned so much in few minutes. thank you
Having worked with a simple single-layer 2-synapse neuron in a spreadsheet, I find this video vastly overexplains the topic at a high level, while not going into enough detail. It does, however, go over the linear regression needed for the synapse weight updates. Also it treats the massive regression testing as a benefit instead of a cost.
One synapse per neuron in the layer above, or per input if the top layer.
One neuron per output if the bottom layer.
Middle layers define resolution, from this video at a rate of (neurons per layer)^(layers).
Fun fact: Neural MAC (multiply-accumulate) chips can perform whole racks worth of computation. The efficiency gain here isn't so much in speed as it is reduction of power and space, by rearranging the compute units and using analog accumulation. In this way the MAC units more closely resemble our own neurons too.
Another explanation that's needed is to explain the concept of gradient descent (GD), that's the generalized method used to figure out the best fit. Lot's of systems use GD including natural evolution, it's basically trial and error with adjustments, although there are various ways to make it work more efficiently which can become quite complicated. You can even use GD to figure out better forms of the GD algorithm, that is it can be used recursively on itself.
I was wondering myself exactly what a simulated NN actually is doing (not what it is, but what it is doing) and this explanation is the best by far, if not THE answer. One adjustment I will suggest, is at the end explain that a simulated NN is not required at all, and explain that alternative systems can also perform the same function, which begs the question, what exactly are the fundamental requirements needed for line fitting to occur? Yes I like to generalize and get to the fundamentals.
Good one! Nice video. In the regression line of Gauss one is not taking the perpendicular distances, though. But very cool video!
Very well made video. Thanks.
This makes total sense thank you. With the last observation of the video, how does that reconcile with statements from the openai team regarding emergent properties of GPT4, that they didn't expect, or don't comprehend. I might be mixing apples and oranges, but if it's just curve fitting then why has some thing substantially changed?
Oh wow. This was so much more than I was expecting. And then it all clicked right in at about 9:45
Another great concise visual explanation!
Thank you!👍
Thanks for such amazing video! It resolves the long standing question in my heart
Amazing explanations!! Thank you!!
beautiful! could you do this type of videos on other machine learning models such as convolution?
Yep, I am planning to do CNN and transformer videos next.
Thank you!
Really beautiful point about layers and exp growth of number of a segments one can make!
I love to get the history lesson first always, Excellent.
most intuitive explanation ive seen
8:47 --> 9:21 Is like watching my brain while I predict some trades. 🤣🤣🤣 "The reason why neural networks work....is that they fit the data" sweet stuff.
I've loved all three of your videos, looking forward to more!
i'm positive I only got this recommended because of Veritasium's FFT video, but thank you youtube algorithm nonetheless. What a brilliant explanation!
Great video, this is an explanation I have not heard before. Also I don't know if that abrupt ending was purposefully sarcastic, but I thoroughly enjoyed it lol
Thanks! That was a really excellent explanation.
Thank you so much!
Excellent! First time I've seen it explained like this.
I wrote my own CUDA-based neural network implementation a while ago, but I used a sigmoid instead of RELU. Although the "training" covered in this video works, it was kinda odd (usually neural net backpropagation is done with gradient descent of the cost function). You probably cover this in another video, though, haven't gotten there yet. Good video nonetheless, never really thought about it this way.
The training method described in this video IS gradient descent. Backprop is just one way of computing gradients (and it is usually used because it is the fastest). It will yield the same results as this training procedure.
@@algorithmicsimplicity I should have figured it was doing much the same thing, just a lot slower and more 'random.' Your videos are awesome though, I'm trying to go into machine learning after doing aerospace embedded systems.
I loved the video, would you please make another one explaining the back propagation?
Hopefully I will get around to making a back propagation video sometime, but my immediate plans are to make videos for CNNs and transformers.
@@algorithmicsimplicity just don't stop man!
Wow this is a really interesting view to the neural networks and what role do layers play in it.
I like this video so much
It really shows that ANNs are really just, at the end of the day, glorified multivariate regression models.
Not that I know much about any of this but as you were explaining things I thought of the idea of a bezier curve. Or at least thinking about how the idea is to fit the data as good as possible made me think of the bezier curve being able to create seemingly any complex line graphic. Have people applied bezier curves to neural networks or is it just not at all the right way to think about these things? I get that its very high dimensional space we're dealing with but I'd imagine a bezier curve could handle any number of dimensions.
So it depends on a bit on how you define bezier curves in higher dimensions. Technically speaking in order to define a cubic bezier curve in n dimensions you need n^3 parameters, which isn't practical for large n. You could apply a bezier curve independently to each dimension, which would basically amount to using x^3 as the activation function instead of ReLU. There has been some experimentation using polynomials as activation functions, but ReLU (or ReLU like functions) always seem to perform better.
Thank you. This is far more intuitive than the usual interpretation with nodes and edges graph with the inputs bouncing back and forth between layers of the graph until it finally gets an output.
What are the advantages and disadvantages between this method of approximating a function and polynomial interpolation?
For 1-dimension inputs and outputs, there isn't much difference between them. For higher dimensional inputs polynomials become infeasible, since a polynomial would need coefficients for all of the interaction terms between the input variables (of which there are exponentially many). For this reason, neural nets are preferred when input is high dimensional as they simply apply a linear function to the input variables, and then an activation function to the result of that.
I understand by adding layers to our neural network we are gaining one extra "bend" for each segment passing by the x axis but i don't understand how n^L is the number of line we can get from doing that
Neural network "training" is just model fitting. It is just that the proposed structure of it is just quite versatile.
The reason why they work is that they fit the data... brilliant
This is a really cool explanation I haven't seen before.
But I have two questions:
Where does overfitting fit in here? more neurons would mean higher risk of overfitting? do layers help or are they unrelated?
And where would co-activation of multiple neurons fit in this explanation? e.g., combination of information from multiple sensory sources?
My video on CNNs talks about overfitting and how neural networks avoid it (th-cam.com/video/8iIdWHjleIs/w-d-xo.html ) . It turns out that actually the more neurons and layers there are, the LESS neural nets overfit, but the reason is pretty unintuitive.
From the neural nets perspective, there is no such thing as multiple sensory sources. Even if your input to the NN combines images and text, the neural net still just sees a vector as input, and it is still doing curve fitting just in a higher dimensional space (dimensions from image + dimensions from text).
@@algorithmicsimplicity Thank you! I had read the more neurons lead to less overfitting and thought it was counterintuitive but I guess that must have carried over from the regular modeling approach where variables remain (or should) interpretable.
I'll have a look at the other videos! Thanks
I guess my confusion stems from what you address at the end. We can fairly simply imitate some things via principles like Hebbian learning but the fact that in actual brains it involves different interconnected systems makes me stumble. (and it shouldn't because obviously these models are not actually like real brains)
Brilliant! Lots to unload from the concluding sentence: neural networks works because they fit the data. Sounds like an even deeper issue than misalignment due to proxy-based training.
Actually adding a bit of math to this video won't hurt while you add to them visual representation of graphs and formulas. But any way one of the most accessible explanation i have ever seen.
I find this video odd for two reasons. ReLU isn't the only choice of function. When we talk about curve fitting, that's a pretty dramatic misrepresentation of the sheer dimensionality of the system, and in the case of a deep network, the implicit meaning being captured by each node.... which is where the interesting insights about the nature of information itself is occurring, and where higher level structures and feedback loops play their roles in our minds.
Gauss: Just fit a line to a set of points. Anybody could have come up with this.
200 years later: Guys, what if we just fit multiple lines to a set of points :D :D :D
Nice interpretation 😊, please can you make a video explaining how neural networks used in for example digit recognition
wow. very good. keep posting great content like this. you have an better potential to explain complex topics to simpler versions. and there are people who just post content just for the sake of posting and minting money. we need more people like you.
good video! very easy to follow.
Might have been useful to explain why a nonlinear function like the ReLU is used as the transfer function, or that it's not the only common transfer function, which is kind of implied since you use the identity in the beginning, though that isn't nonlinear.
Also the link to the human brain is in the structure of neurons, the conceptual foundation of a weighted sum transformed into a single value by a nonlinear process, dendritic signals combined into a single signal down the axon, at the end of which lie new connections to further neurons. Furthermore in the case of computer vision the structure of the visual cortex served as an inspiration for the neural networks in that field. If memory serves the neocognitron was one such network, not a learning network as it's parameters were tweeked by humans till it did as desired, but foundational to convolutional neural networks.
Otherwise interesting enough, stressing the behind the scenes nature of neural networks, though maybe mentioning how classification relates to those regressions would have been cool too.
Btw what learning scheme were you using, as far as I could tell it was some small random jump first added and then subtracted if the result got worse? I assume if the substraction is even worse it rolls back the entire thing and rolls a new jump? I ask as it neither sounded nor looked like back propagation was used.
The thumbnail still bugs me, the graph representation of the network isn't wrong it just shows something different, the nature of the interconnections between neurons in the network that are hard to see in the graph representation of the resulting regression. It's like saying the map of the route is wrong because it's not a photo of the destination.
All commonly used activations are either ReLU or smooth approximations to ReLU, so I disagree that it is useful to talk about alternatives in an introductory video.
It is not helpful to think of modern neural networks as having anything to do with real brains. Yes early neural networks were inspired by neuroscience experiments from the 1950s. But our modern understanding is that the way the brain actually works is vastly more complex than those 1950s experiments let on. Not to mention modern neural networks architectures are even less inspired by those experiments (e.g. transformers).
The learning scheme I used is iterate through parameters, increase parameter value by 0.1, if new loss is worse than old loss then decrease parameter value by 0.2. That's it. I just repeated that operation. No backpropagation was used, as the purpose of backprop is only to speed up the computation (which was unnecessary for these toy examples).
The graph representation of the network is absolutely wrong, as it misleads people into thinking the interconnections between neurons are relevant at all, when they have nothing to do with how or why neural nets work.
@@algorithmicsimplicity i would argue that modern neural networks developed from the more primitive neural networks based on the then understanding of the brain. Also I didn't write neural networks work like the brain, just that their basic building blocks were inspired by the basic building blocks of the brain and some structural aspects, as understood at the time, inspired the structure of NN.
So you are claiming that neither sigmoid, hyperbolic, nor any of the transfer functions common to RBFNN are used? Or are so rarely used as to be negligible?
Yes ReLU and related functions are currently popular as the transfer function of hidden layers in deep learning and CNNs. Although some of the related functions start to depart quite a bit from the original, mostly keeping linearity in the positive domain, approximations not withstanding. Looking through them it feels like calling sigmoid and tanh functions related to the step function, which they kinda are, similarly solving issues with differentiability as the approximations to ReLU do.
So you kind of used a grid search, on a 0.1 sized grid to discretize the parameter space.
Pretty sure a NN without connections isn't a network. Especially for back propagation they are inherent in it's functioning, after all it needs to propagate over those connections.
I don't see what you mean by it misleads people or how the interconnectedness is meaningless. The fact that a single layer wasn't enough to solve a non linear separable problem is what kept the field inactive for at least a decade.
"So you are claiming that neither sigmoid, hyperbolic, nor any of the transfer functions common to RBFNN are used? Or are so rarely used as to be negligible?"
@@algorithmicsimplicity if we didn't need performance then we wouldn't have made so much progress in the last few years. Stars aligned with performant hardware and efficient algorithms.
Great video! I am left with a question though. If the number of straight line segments in an NN with n neurons in each of the L layers is n^L, then why would we ever use n > 2? If we are constrained by the total number of neurons n*L, than n = 2 maximizes n^L. I have two guesses why use n>2:
1. (Hardware) Linear algebra is fast, especially on a GPU. We want to use vectors of larger sizes to make use of the parallelism.
2. (Math) Maybe the number of gradient descent steps needed to fit a deeper neural network is larger than to fit a shallower NN with wider layers?
If you plan to make any more videos about this, this question would be great to address. If not, maybe you can reply here with your thoughts? Thank you!
Really good question. There are 2 reasons why neural networks tend to use very large n (usually several thousand) even though this means less representation capacity. The first is, as you guessed, it makes better use of GPU accelerators. You can't parallelize computation across layers, but you can parallelize computation across neurons within the same layer.
The second, and more important reason, is that in practice we don't care that much about representation power. Realistically, as soon as you have 10 layers with a few hundred neurons in each, you already have enough representation power to fit any function in the universe.
What we actually care about is generalization performance. Just because your network has the capacity to represent the target function, doesn't mean that it will learn the correct target function from the training data. It is much more likely that the network will just overfit to the training dataset.
It turns out to be the case that the wider a neural network is, the better it generalizes. It is still an open area of research why this is the case, but there are a few hypotheses floating around. My other video on convolutional neural networks actually goes into one of the hypotheses a bit, in it I explain that the more neurons you have the more likely it is that they have good initializations, but it was a bit hand-wavy. I was planning to do a more in-depth video on this topic at some point.
@@algorithmicsimplicity thank you for such a thoughtful reply! I'll watch your CNN video and all other videos you'll produce on this topic!
I thought that the whole overfitting business is kind of obsolete nowadays, with LLMs having more neurons than the number of training data samples. This is only a rough understanding I've gained from some random articles, and would love to learn more about it. Do you have any suggestions for what to read or watch in this direction? As you noted in the video, there is lots of low-quality content about NNs out there, which makes it hard to find answers to even rather straightforward questions like whether overfitting is "still a thing" in large models.
@@vasilin97 The reason why we use such large neural networks is precisely because larger neural networks overfit less than smaller neural networks. This is a pretty counter-intuitive result, and is contrary to what traditional statistical learning theory predicts, but it is empirically observed over and over again. This phenomenon is known as "double descent", you should be able to find some good resources on this topic searching for that term, for example www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent , medium.com/mlearning-ai/double-descent-8f92dfdc442f . The Wikipedia page on double descent is pretty good too.
I love this revisionist perspective! Let’s forget all the decades we spent using other activation functions than ReLU
expound plz without sarcasm
@@xt3708 I didn’t mean to be entirely sarcastic as I truly love this video’s perspective. It’s my understanding that it wasn’t until 2011 that we realized that the ReLU activation function works so well, and that it was by experimentation so I assume we didn’t get the concepts in this video for many years prior, and only understand them now (I am not an expert though so don’t quote me please)
That's a great explanation
Great video. Special thanks for the historical background.
This is a really good explanation
Damn you need and deserve more subscribers!
This is the second video of yours that I watch that gives me an eureka moment. Fantastic content. One thing I don't get is, people used to use the sigmoid function before ReLU, right? Was it just because natural neurons work like that and artificial ones were inspired by them?
Yes sigmoid was the most common activation function up until around 2010. The very earliest neural networks back in the 1950s all used sigmoid, supposedly to better model real neurons, and nobody questioned this choice for a long time. Interestingly, the very first convolutional neural network paper in 1980 used ReLU, and even though it was already clear that ReLU performed better than sigmoid back then, it still took another 30 years for ReLU to catch on and become the most popular choice.
This is such a good explanation
Nicely explained!
Great concise explanation, and it does works: it fits at least my brain's data like a glove! Not that I have a head shaped like a hand (or do I?), but you did light up some bulbs in there after watching those lines animations fitting better and better.
However, what happens when the neural network fits too well?
If you can briefly mention the overfitting problem in one of your next episodes, I''d greatly appreciate. Looking forward to the CNNs and transformer ones! 🦾🤖
Thanks for the vid! At about 10:30 you say a NN with n neurons in each of L layers expresses ~ n^L linear segments. Could this be a mistake? I think it's more like n^2 * L
The number of different linear segments is definitely at least exponential in the number of layers, e.g. proceedings.neurips.cc/paper_files/paper/2014/file/109d2dd3608f669ca17920c511c2a41e-Paper.pdf
god bless you for this work
Good, maybe it would be a good idea why this peice wise linear functions are able to generalize over unseen data
excellent video
cool explanation