Why do Convolutional Neural Networks work so well?

Algorithmic Simplicity

มุมมอง 47 777

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 พ.ย. 2024

ความคิดเห็น • 103

@algorithmicsimplicity 2 ปีที่แล้ว ⁺⁵⁵
Transformer video coming next! I'm still getting the hang of animating, but the transformer video probably won't take as long to make as this one. I haven't decided what I will do after that, so if you have any suggestions/requests for computer science, mathematics or physics topics let me know.
@bassemmansour3163 2 ปีที่แล้ว ⁺¹
what program are you using for animation? thanks!
@algorithmicsimplicity 2 ปีที่แล้ว ⁺⁴
I'm using the python package manim: github.com/ManimCommunity/manim
@davidmurphy563 2 ปีที่แล้ว ⁺⁶
I'd say probably RNNs would flow nicely from this [excellent] video. GANs too I guess. Autoencoders for sure. Oh, LSTMs, the memory problem is a fascinating one. Oh and Deep Q-Networks.
Meh, the field is so broad you can't help but hit. I'd say RNNs first as going from images to text seems a natural progression.
@wissemrouin4814 2 ปีที่แล้ว ⁺²
@@davidmurphy563 yess please, I guess RNNs needs to be presented even before transformers
@davidmurphy563 2 ปีที่แล้ว ⁺²
@@wissemrouin4814 Yeah, I would agree with you there. RNNs serve as a good introduction to a lot of the approaches you'll see for sequence-vector problems and its drawbacks explains the development of transformers.
I'd suggest RNNs then LSTNs then transformers.
That said, this channel has done sterling work explaining everything so far so I'm sure he'll do a great job even if he dives straight into the deep end.
@ozachar ปีที่แล้ว ⁺¹⁷
As a physicist, I recognize this process as "real space renormalization group" procedure in statistical mechanics. So each layer is equivalent to a renormalization step (a coarse graining). The renormalization flows are then the gradual flow towards a resolution decision of the neural net. It makes the whole "magic" very clear conceptually, and also automatically points the way for less trivial renormalization procedures known in theoretical physics (not just simple real space coarse graining). The clarity of videos like yours is so stimulating! Thanks
@warpdrive9229 ปีที่แล้ว
Bingo!
@dradic9452 ปีที่แล้ว ⁺⁵¹
Please make more videos. I've been watching countless neural networks videos and until I saw your two videos I was still lost. You explained it so clearly and concisely. I hope you make more videos.
@algorithmicsimplicity ปีที่แล้ว ⁺¹¹
Thanks for the comment, it's great to hear you found the videos useful. I was unexpectedly busy with my job the past few months, but rest assured I am still working on the transformer video.
@IllIl ปีที่แล้ว ⁺⁴³
Dude, your teaching style is absolutely superb! Thank you so much for these. This surpasses any of the explanations I've come across in online courses. Please make more! The way you demystify these concepts is just in a league of its own!
@rohithpokala ปีที่แล้ว ⁺²
Bro ,you are real super man.This video gave so many deep insights in just 15 mintues providing so much strong foundation. I can confidently say,this video single handedly throwed 1000's of neural networks videos present on the internet.You raised the bar so high for others to compete.Thanks.
@Number_Cruncher 2 ปีที่แล้ว ⁺¹⁰
This was a very cool twist in the end with the rearranged pixels. Thx, for this nice experiment.
@joshlevine4221 3 หลายเดือนก่อน ⁺⁴
3:02 _Strictly_ speaking, there are only a finite number of images for any given image size and pixel depth, so each on can be uniquely described by a single number (and it is even an integer!). These "image numbers" cover a very, very, very wide and sparsely-filled range, but the "image number" still only has a single dimension. Thank you for the great video!
@dillxn554 หลายเดือนก่อน
Great point. At 4:47, my thought is that there are 8 bits needed to cover values 0-255, so the actual volume of the space would be `3,072 color vals x 8 bits ea. = 24,576 bits` to represent an image, so `2^24576` potential images? I don't know where the 9 came from.
@nananou1687 6 หลายเดือนก่อน ⁺²
This is genuinely one of the best videos I have ever seen! No matter the type of content. You have somehow made one of the most complicated topic, and simply distilled it to this. Brilliant!
@thomassynths ปีที่แล้ว ⁺¹
This is by far the best explanation of CNNs I have ever come across. The motivational examples and the presentation are superb.
@benjamindilorenzo 8 หลายเดือนก่อน ⁺³
The best video on CNN´s. Please make a video about V-Jepa, the proposed SSL Architecture from Yann LeCun.
Also it would be nice to have a deeper look at Diffusion Transformers or Diffusion in general.
Really really good work man!
@j.j.maverick9252 2 ปีที่แล้ว ⁺⁷
another superb summary and visualisation, thank you!
@bassemmansour3163 2 ปีที่แล้ว ⁺⁶
best illustrations in the subject. thank you for your work!
@khoakirokun217 7 หลายเดือนก่อน ⁺³
I love that you point out that we have "super human capability" because we are pre trained with assumption about the spatial information :D TLDR: "we are sucked" :D
@illeto 6 หลายเดือนก่อน ⁺³
Fantastic videos.
Here before you inevitably hit 100k subscribers.
@connorgoosen2468 ปีที่แล้ว ⁺²
How has the TH-cam Algorithm not suggested you sooner? This is such a great video, just subscribed and keen to see how the channel explodes!
@djenning90 ปีที่แล้ว ⁺²
Both this and the transformers video are outstanding. I find your teaching style very interesting to learn from. And the visuals and animations you include are very descriptive and illustrative! I’m your newest fan. Thank you!
@jcorey333 9 หลายเดือนก่อน ⁺¹
This is one of the best explanations I've seen! Thanks for making videos
@jollyrogererVF84 ปีที่แล้ว ⁺¹
A brilliant introduction to the subject. Very clear and informative. A good base for further investigation.👍
@terjeoseberg990 ปีที่แล้ว ⁺²
I believe that the main advantage to convolutional neural networks over fully connected neural networks is the computational savings and the increased training data.
A convolutional neural network is basically a tiny fully connected network that’s being trained on every NxN square on every imaginable. This means that a 256x256 image is effectively turned into 254x254 or 64,516 tiny images. If you start with 1 million images in your training data, you now have 64.5 billion 3x3 images that you’re going to train the tiny neural network on.
You can then create 100 of these tiny neural networks for the first layer, another 100 for the second layer, and another 100 for the third layer, and so on for 10 to 20 layers.
@algorithmicsimplicity ปีที่แล้ว
I think that these 2 reasons are the most commonly cited reasons for the success of CNNs (along with translation invariance, which is absolutely incorrect), but I don't think that these 2 things are sufficient to explain the success of the CNN.
It is true that CNN uses much less computation than fully connected neural networks, but there are other ways to make deep neural networks which are just as computationally efficient as CNNs. For example, using a MLP-Mixer style architecture in which a linear transform is first applied independently across channels to all spatial locations, and then a linear transform is applied independently across spatial locations to all channels. In fact, this is exactly what I used when making this video! The "Deep Neural Network" I used was precisely this, it would have taken too long to train a deep fully connected neural network. This MLP-Mixer variant uses the same computation as CNN, but allows each layer to see the entire input. Which is why it achieves less accuracy than a CNN.
As for the increased training data size, it is possible this helps but even if you multiply your dataset size by 100,000, it is still nowhere near the amount of data you would expect to need to learn in 256*256 dimensional space. Also, if it was merely the increased training data, then I would expect CNNs to perform better than DNNs even on shuffled data (after all, having more data should still help in this case). But in fact we observe the opposite, CNNs perform worse than DNN when the spatial structure is destroyed.
For these reasons I believe that the fact that each layer sees a low effective dimensional input is necessary and sufficient to explain the success of CNNs.
@terjeoseberg990 ปีที่แล้ว
@@algorithmicsimplicity, It’s a combination of multiplying the dataset size by 64,500 and reducing the network size from 256x256 to 3x3. In fact it’s the reduction of the network size to 3x3 that’s allowing the effective 64,500 times increase in dataset size. It’s not one or the other, but both. Each weight gets a whole lot more training/gradient following.
You should do a video on the MLP-Mixer, and how it compares to CNN.
@BenjaminDorra 3 หลายเดือนก่อน ⁺¹
Thank you for this fascinating video !
It is a very original angle on the effectiveness of CNNs. I have never seen this approach, most articles and videos focus on the reduction in parameters and computation compared with the base MLP or the image compression.
Interestingly you don't talk about pooling, a staple of CNNs architectures. Arguably it is mostly for computational efficiency but I have seen a bit of debate on the subject (max pooling being especially polarizing).
@algorithmicsimplicity 3 หลายเดือนก่อน ⁺¹
My goal in this video is to explain why CNNs generalize better than other architectures. It is true that CNNs are more computationally efficient than MLPs, but there are other ways to improve the efficiency of MLPs. In particular, in this video the "Deep neural network" that I am comparing to is not a MLP, but a MLP-mixer. This MLP-mixer is just as parameter and compute efficient as the CNN (using an almost identical architecture), the only difference between them is that in the CNN each neuron sees a 3x3 patch, and in the MLP-mixer each neuron sees information from the entire image. This difference, and this difference alone, results in the ~20% point accuracy increase.
Max-pooling has generally been used to improve efficiency. Sometimes max-pooling can improve accuracy, but only by about 1-2%. In other cases, max-pooling can actually reduce accuracy. The main reason to use it is to just reduce computation. Because of this I don't consider max-pooling to be fundamental to the success of CNNs, you can built CNNs without max-pooling, they work fine.
@nadaelnokaly4950 8 หลายเดือนก่อน ⁺²
wow!! ur channel is a treasure
@ZetaReticulli ปีที่แล้ว ⁺⁵
@4:17 Why is it 9^N points required to densely fill N dimensions? Where is 9 being derived from? Is it for the purpose of the example given - or a more general constraint?
@algorithmicsimplicity ปีที่แล้ว ⁺⁴
It is a completely arbitrary number just for demonstration purposes. In general, in order to fill a 1d interval of length 1 to a desired density d you need d evenly spaced points. To maintain that density for n-d volume you need d^n points. I just chose d=9 for the example. And the more densely filled the input space is with training examples, the lower the test error of a model will be.
@senurahansaja3287 ปีที่แล้ว
@@algorithmicsimplicity thank you for ur explanation but in here th-cam.com/video/8iIdWHjleIs/w-d-xo.html dimensional points mean the input dimension right ?
@algorithmicsimplicity ปีที่แล้ว ⁺¹
@@senurahansaja3287 Yes that's correct.
@manthanpatki146 11 หลายเดือนก่อน ⁺¹
Man, keep making more videos, this is a brilliant video
@montanacaleb 3 หลายเดือนก่อน ⁺³
You are the 3blue1brown of ml
@panizzutti หลายเดือนก่อน ⁺¹
Bro your videos make me understand so well wtf
@PotatoMan1491 5 หลายเดือนก่อน ⁺¹
Best video I found for explaining this topic
@anangelsdiaries 9 หลายเดือนก่อน ⁺²
Fam, your videos are absolutely amazing. I finally understand what the heck a CNN is. Thanks a lot!
@yoavtamir7707 2 หลายเดือนก่อน ⁺¹
You are explaining so so so well. Thanks and keep going!!!!!!!!!!!’
@pedromartins9889 10 หลายเดือนก่อน
Great video. You explain things really well. My only complain is that you don't cite references. Citing references (which can be made simply as a list in the description) makes your less obvious statements more sound (like the fact that the quantity of significant outputs of a layer is more or less constant and small, I understand it would be very hard to explain it maintaining the flow of the video, but if there was in the description some link to that explanations or at least to a practical demonstration, the viewer could, if wanted, understand it better or at least be more sure that it is really true). Citing references also helps the viewer a lot if he wants to further study the topic (and this is fair, since you already made the rersearch for the video, so it costs you way less to show your sources than to the viewer to rediscover them). In summary: citing references gives you more credibility (in a digital world filled with so much bullshit) and gives a great deal of help to interested viewers to go deeper on the topic. Don't be mistaken, I really like your channel.
@neithanm ปีที่แล้ว ⁺¹
I feel like I missed a step. The layers on top of the horse looked like an homogeneous color. Where's the information? I was expecting to see features from small parts to recognizing the horse, but ...
@escesc1 6 หลายเดือนก่อน ⁺¹
This channel is top notch quality. Congratulations!
@KarlyVelez-u2k ปีที่แล้ว ⁺¹
Your videos are extremely good, especially for such a small channel. Great video! Can do one in Recurrent Neural Networks please .
@sergiysergiy8875 ปีที่แล้ว ⁺¹
This was great. Please, continue your content
@jorgesolorio620 2 ปีที่แล้ว ⁺³
Great video! Can do one in Recurrent Neural Networks please 🙏🏽
@justchary ปีที่แล้ว
I do not know who you are, but please continue! You definitely have a wast knowledge on the subject, because you can explain complex things simply.
@VictorWinter-n2i ปีที่แล้ว
Really nice! What tool did you use to do those awesome animations?
@algorithmicsimplicity ปีที่แล้ว
This was done using Manim ( www.manim.community/ )
@GaryBernstein ปีที่แล้ว
Can you explain how the NN produces the important-word-pair information-scores method described after 12:15 from the sentence problem raised at 10:17? Can you recommend any tg groups for this Q & topic?
@thetntsheep4075 3 หลายเดือนก่อน ⁺¹
At 14:00 with the rearranged pixels, do you mean every image in the dataset has the pixels rearranged in the same way? If they were rearranged in a different random way for each image I dont see how you could learn classification well at all
@algorithmicsimplicity 3 หลายเดือนก่อน
Yes I do mean in the same way. The same permutation is applied to every image. This is equivalent to shuffling the columns of a tabular dataset. Has no effect on fully connected neural networks, but severely impacts CNNs.
@joshmouch ปีที่แล้ว
Yeah. Jaw dropped. This is an amazing explanation. More please.
@Emma2-cg5jh 6 หลายเดือนก่อน
Where does the Performance value for the rearranged images come from? Did you made Them By Yourself or is there a paper for that?
@algorithmicsimplicity 6 หลายเดือนก่อน
All of the accuracy scores in this video are from models I trained myself on CIFAR10.
@5_inchc594 ปีที่แล้ว ⁺²
amazing content thanks for sharing!
@ThankYouESM ปีที่แล้ว
Seems like the bag-of-words algorithm can do a faster job at image recognition since it doesn't need to read a pixel more than once.
@mrfurious60 3 หลายเดือนก่อน ⁺¹
How we do go from 3 by 3 feature map to a 5 by 5 image?
@algorithmicsimplicity 3 หลายเดือนก่อน
In the second layer, the input is a 3x3 grid of outputs from the first layer. In the first layer, each output is computed from a different 3 by 3 grid of pixels. Therefore, the input to the second layer contains information from 9 different overlapping 3x3 patches, which means it sees information from a 5x5 patch of pixels.
@mrfurious60 3 หลายเดือนก่อน
@@algorithmicsimplicity Thanks man. I guess I'm a little too slow because while I get how the first layer gives us what it does, the second layer is still a problem 😅.
@solaokusanya955 ปีที่แล้ว ⁺¹
So technically, what the computer sees or not is high dependent on "whatever" we has humans dictate it to be...
@HD-Grand-Scheme-Unfolds ปีที่แล้ว
@AlgorithmicSimplicity greetings, may I ask: in you video presentation could you please specify in what sense do you mean by "Randomly re-order the pixels" (13:55)? let me explain my question. Although I know you mean reshuffling the permutation order of the set of input pixels; when I said in what sense I meant: is it (A-> a unique random re-order seed for each training example (as in for every picture) |OR| (B-> the same random re-order seed for each training example?
if you meant in the sense of "A" I would be amazed the convolution-net can get that 62.9% accuracy you mentioned earlier. That 62.9% would be more believable for me if you meant in the sense of "B".
@algorithmicsimplicity ปีที่แล้ว ⁺³
I meant in the B sense, same shuffle applied to every image in the dataset (training and test). If it was a different random shuffle for each input then no machine learning model (or human) would ever get above 10% accuracy. If you have some experience with machine learning, this operation is equivalent to shuffling the columns of a tabular dataset which of course all standard machine learning algorithms are invariant to.
@HD-Grand-Scheme-Unfolds ปีที่แล้ว
@@algorithmicsimplicity lol, speaking from in hindsight, your point in now taken. Dwl lmao😄🤣. Which human or person.... but let me be the devil's advocate for entertainment and curiosity purposes a bit: I it was somehow in sense "A", then I'd imagine that imply a phenomenon we all may call pure memorization at its finest.
But to go back on main track, I love that you went out of the way to make that clear in you presentation, yours is the second video that mentioned, but you were to first to settle the question the big question (that I already asked you, thanks again).
by the way in the name of opportunity sake I would like to ask: Do you know where a non-programmer person may source a intuitive interactive GUI based executable program that simulate and implement recurrent neural networks (especially if its the simple RNN, prefer against but will accept LSTMs or GRUs)? Github for example mostly accommodates those who meet coding knowledge prerequisite. "MemBrain" (meets concept but its RNN is still puzzling for me to figure out, and train test etc) (but its the most promising one to try work with so far) and "Neuroph Studio" (meets the concept but have no RNN support) and "Knime Analytics Platform" is likened onto coding skills, in disguise as GUI with click and parameter controls. rules for arrangments are too complex, and counter intuitive. IBM Watson studio seems similar and matlab is a puzzlebox too.
@algorithmicsimplicity ปีที่แล้ว ⁺¹
I'm afraid I don't know of any GUI programs that simulate RNNs explicitly, but I do know that RNNs are a subset of feedforward NNs. That is, it should be possible to implement an RNN in any of those programs you suggested. All you would need to do is have a bunch of neurons in each layer that copy the input directly (i.e. the i'th copy neuron should have a weight of 1 connected to the i'th input and 0 for all other connections), and then force all neuron weights to be the same in every layer. That will be equivalent to an RNN.
I would also recommend you just try and program such an app yourself. Even if you have no experience programming, you can just ask ChatGPT to write the code for you 😄.
@reubenkuhnert6870 ปีที่แล้ว
Excellent content!
@Isaacmellojr 10 หลายเดือนก่อน
Mais videos por favor! Vc tem o dom!!
@jamespogg ปีที่แล้ว
amazing vid good job man
@bobuilder4444 6 หลายเดือนก่อน
13:09 How would you know which numbers to remove?
@algorithmicsimplicity 6 หลายเดือนก่อน
You can simply order the weights by absolute value and remove the smallest weights (the ones closest to 0). This probably isn't the best way to prune weights, but it already allows you to prune about 90% of them without any loss in accuracy: arxiv.org/abs/1803.03635
@bobuilder4444 6 หลายเดือนก่อน
@@algorithmicsimplicity Thank you
@uplink-on-yt ปีที่แล้ว
12:58 Wait a minute... Did you just describe neural pruning, which has been observed in young human brains?
@nageswarkv ปีที่แล้ว
definitely good video, not fluff video
@pypypy4228 ปีที่แล้ว
It's brilliant!
@aydink7739 ปีที่แล้ว ⁺¹
This is wow, finally understand the „magic“ behind CNNs. Bravo, please continue 👍🏽
@Walczyk 10 หลายเดือนก่อน
7:04 this is just like the boost library from microsoft
@peki_ooooooo ปีที่แล้ว
Hi, how's the next video?
@scarletsence ปีที่แล้ว
This god like visualizations thanks.
@blonkasnootch7850 ปีที่แล้ว
Thank you for the video. I am not sure if it is right to say that humans have build knowledge into the brain about how the world works from birth.. accepting vision input for data processing, detecting objects or separating regions of interest is something every baby has to clearly learn. I have seen that with my children it is remarkable but not there from beginning.
@algorithmicsimplicity ปีที่แล้ว ⁺¹
Of course children still need to learn how to do visual processing, but the fact that children can learn to do visual processing implies that the brain already has some structure about the physical world built into it. It is quite literally impossible to learn from visual inputs alone, without any prior knowledge.
@yourfutureself4327 ปีที่แล้ว
💚
@lilep666 หลายเดือนก่อน
2:36 you just put images of dogs and cats arbitrarily on the x axis?? what? what does x even mean in this example and how do you decide what "x" value every image gets? x has no meaning here so what does it matter what curve you fit??
@jameswustaken3862 3 หลายเดือนก่อน ⁺¹
@lolikobob ปีที่แล้ว
Make more good videos!
@thechoosen4240 ปีที่แล้ว
Good job bro, JESUS IS COMING BACK VERY SOON;WATCH AND PREPARE

ต่อไป

เล่นอัตโนมัติ

Transformer Neural Networks Derived from Scratch