you should've started with the typical RGB 3 layer input image, and animate convolutions on that; that's where most people start to get lost as to how the weights match with inputs, translating from the 2D mental model to 3D.
Bang on right. So he made a video of how others weren't doing it right but then didn't start from the start itself to explain what actually goes on correctly. I mean, what good is this new one then :/
All these wrong illustration and animation trends have been among the many problems where you would think "why the hell have we been doing this all wrong, all the time, everywhere?". Finally, someone came and did the obvious. Thank you!
They are not wrong. They just show a special case. They use the special case because the focus is on things like stride, dilation, padding etc. It's good to make the 3D tensor animations, but don't call the existing ones wrong. I think I would have still found it easier to understand the existing ones first and then move on to the 3D animations.
A major thing that feels missing to me in the animations is clear textual labeling. It's fine that you label them out loud, and then, also, it would be more accessible for folks with hearing challenges or cognitive challenges. My crit aside, this animation is lovely, and I'm very impressed with what you've done. You've earned yourself a new subscriber :)
The animation is just meant as an abstraction of the spatial convolution operation itself. A spatial CNN layer consists of spatial convolution operations across multiple input and output channels (which is what you are referring to)
Forget the animation itself (even though its great). I just appreciate a non-moving camera. It bothers me so much when people spin the camera around a nice animation in a circle. Makes me feel like I am on a carnival ride.
Wow, really great, thanks for your work! I was struggling with the very problem you mentioned in the video - bringing together the 2D conv visualizations with the multi-channel 3x3 convolutions that are common in modern CNNs. Thanks to your work, I now understood it.
The use of all these misleading animations is the primary cause of misconception about convolutional neural networks; you have finally provided a good visualization. I am happy to share this content with my colleagues.
Oh man, I'm so glad someone took a direct approach to this problem, when I was learning I was so confused by all these animations and explanations in 2D, and then seeing resulting tensor shapes got me super confused, where the depth go and where did it appear? Thanks for bringing this video to the world!
So in case of a feature map input, 2d conv just replicate each 2d filter along the feature dimension and do multiplication wise? In the video, the filters are 2d really just replicate to fill in the the number of features? or does each 2d filter is in reality a 3d tensor to match the feature dimension?
These videos are outstanding! Finally, true visualisations that get it right. I'm sharing these with my ML Masters students. Thank you for your considerable effort putting these together.
Thank you for this, recently I tried to explain why the input and output shapes behave the way they do, and what gets combined with what. These animations will make it sooo much easier!!
Thank you for making this video! I have been trying to visualize this using all the horrible diagrams from papers. I immediately understood what they were trying to convey after watching your video!
well, not speak for the existing animations/figs, i won't say they are wrong, they have some issues, but essentially they are correct. When talking about 2D convonlution, we should know the input and output are 3D as input is a picture and output is also a picture/feature map.
The first animation you say is wring shows the contribution of one filter operations which is quite accurate. is you considered the number of input channels one and out put channels 1 that is the right figure for the whole operation. the conv2d operation are all element-wise matrices multiplication with shifting windows. the 3D animation you did look great but lack of that notion . that is my option. i stick with the 2D.
Thank you! He never explains why his animation is "correct" and in my opinion it simply isn't. 2D convolutions act 2D on 2D data. The fact that we have *multiple data* and *multiple filters* leads to us frequently blocking things in 3D, but the convolution itself is still fundamentally 2D. And if someone doesn't understand that, it's not because of bad animations.
Lol, I literally learned this the hard way just about 2 months ago, when the shape for my 2d convolution required 3 parameters, and this made me super confused :,)
"a 2D convolution actually takes in a 3D tensor as input and has a 3D convolution as output", well, it depends right? If you have a single channel/grayscale image then the input is in fact a 2D tensor, and each feature outputs a 2D tensor that is joined with all others in the feature map. So if you have a grayscale image with a single feature, the animations would in fact be correct. I think the animations are perfectly fine, as they simplify a concept to it's most basic form for easy understanding. But it is true that after you understand the basic concept, a 3D - 3D representation is also nice to understand more common and complex examples. Disclaimer that I could be wrong as I am by no means an expert, but this is my take from my current understanding of convolutions :)
I'm not aware of a library where it depends, i.e., where the depth dimension is optional. PyTorch's Conv2D will accept a 3D tensor or a 4D tensor (batched 3D tensors). The functional interface only accepts 4D. Kera's Conv2D layer will only accept a 4D tensor. TensorFlow's conv2d operation with accept anything with at least 4 dimensions (the last three are treated as the height, width, and channels and all the others before that are treated as batch dimensions. NVIDIA's cudnn implementation of 2D convolution takes a 4D tensor. And in all of these cases the weights will be in 4D which can be thought of as a 3D weight for each filter corresponding to the size of the 3D patch that the filters operate on in the input. So as far as the industry standard, there just isn't a 2D convolution where you don't have a depth dimension in the input. Your grayscale case will only have a depth of 1 but will in fact be a 3D tensor. If you're able to find a mainstream library where the depth dimension in the input is optional, let me know.
@@animatedai Still the concept doesn't need the 3d implementation, as the different features are worked on independently anyway. I definitely think it's a stretch and comes off really condescending, to call all other animations wrong.
With those sahpes og input and filter, is there even any reason to have them 3d over 2d? I get the output as you layer filters, but if the input and filter is just the same thing all thru representing them by singular cubes is not wrong
Why is the does the input shape have have so many layers in this animation? Wouldn't it have a shape equal to the image shape, then 3 layers, 1 for rgb respectively?
I'm glad you asked. In general, convolution takes a feature map as input, which can have any number of features (depth). An image is a special kind of feature map with 3 features: red, green, and blue. However, this typically only applies for the first layer of convolution in a neural network and the other convolutional layers will have more features for input, e.g., 32, 64, 128, ... 1024, 2048. So to better represent the general case and to encourage viewers to consider more than just the special case of an image, I chose to use 8 for the animations. Although 3 would also be perfectly valid.
Thank you! On my GitHub page (animatedai.github.io), you can see a few different variations, and I'm also working on an interactive webgl app where you can pick the parameters.
@@animatedai A better way to ask my question is this - How is the first feature map created in a convolutional network? Surely the time time the 2d image is convoluted we have a 2d tensor and 2d filters, just like the typical animation, right? I get that the output of this convolution will be a 3d feature map, and thus all further convolution will look like your animation that is 3d to 3d.
@@animatedai You may take a grayscale image as an input because for many cases it is sufficient, and for learning it´s a good simplification, I would not consider this being a wrong animation, it just assumes you have a grayscale image as an input. I get the point of your video but the title is clickbait.
I feel silly for asking, but the different colored blocks (in the middle) correspond to convolutions over different channels of the original matrix right?
I have been struggling to mentally visualize convolutions, specially going from one dimension to others. I was reading the book Understanding Deep Learning by Simon Prince and I realized what i thought i looked like was wrong ( The 2D to 2D animations from the beginning). I wish I would have stumbled upon yours before having to imagine what was explained in the books XD (Good book tho)
I'm glad you noticed. I picked an even column number specifically to demonstrate that. This happens because there's a stride of 2 and an even number of columns (8 in this case counting padding) and an odd sized filter (3x3). So after it's taken 3 steps, there's only 1 pixel of width remaining, and it can't move 2 more spaces. Convolution handles this by simply ignoring the remaining data and moving to the next row. This is important to know because it could cause you to lose data (which could accumulate over many layers to be significant chunks of your input). In this case, the last column was just padding anyway, so no real data is lost. Note: the last row is scanned because, unlike the columns, we have an odd number of rows, 7 counting padding.
So the point he is trying to make is that an "image" is represented as a 3d object/array , that is height(pixels) , width (pixels) and RGB components, but when we talk about 2d images we usually means a "grayscale image" , which doesn't require the RGB part mentioning explicitly, although it is still a 3d image , . So for a colored image of 128x128 pixels , tenser shape would be (128,128,3) & for a grayscale image it would be (128,128,1)
@@animatedai Thank you for answer, btw, why do you think the typical animations are wrong? Can't we just do a 2D convolution on each slice of an input image and then just stack the slices together to get the same feature map as with your animation?
That's a great question. So good, in fact that I'm planning to make a follow-up video explaining it. I think a lot of people are struggling with this idea and your question's phrasing really helped me understand where the misconception is. The short answer is that what you're proposing wouldn't be equivalent, because in neural network convolution, each filter sees the all the features/channels of the input, not just a slice of the input. That's why the filters themselves are 3D. More concretely, let's say a bumblebee-detecting neural network wanted to look for the color yellow, so it needed a filter that detected yellow. That filter couldn't just look at the red channel or just the green channel or just the blue channel. It needs to look at all of them together to distinguish yellow from red or from green or from white or any other color. So we can't slice the input up into red/green/blue and then operate on them separately. Does that make sense?
@@animatedai Yeah it makes total sense once you realize (which is quite obvious when you think about it) that the input channels pretty much always have some sort of correlation between them. In your example, we need all 3 channels to see how much red, green and blue we have to get this certain type of yellow. I talked with my professor about this topic (where i referenced to your video). And he believed that they only reason for when this 2D convolution and stacking of slices is better is if you don't have that much data. Also, that the training will be faster. I think a lot of people would appreciate a video exploring the difference between these two ideas in more detail. At least I would. Thank you already for your animations and answers!
I meant to add that the splitting and stacking isn't a crazy idea as long as you understand the limitation (compared the standard convolution) and compensate for it. In fact, it's the basis of the depthwise-separable convolution, which can be much more efficient that standard convolution. I've got a video on it that you might like: th-cam.com/video/vVaRhZXovbw/w-d-xo.html
Good questions! First: for a good explanation of "why" the filters are 3D, check out this video: th-cam.com/video/XdTn5md3qTM/w-d-xo.html. And second: the filter depth matches the input depth, and the output depth matches the filter count.
Thanks for the great video! I'm at the moment trying to learn these concepts, so I might be suggesting something that is incorrect. Here it goes anyway: an example with input of depth 3 might be good to better understand this by thinking of rgb image data. It also seems to be the (special) case in your animatioms, that the input depth is the same as the number filters leading to same depth in both input and output. That is not always the case, if I've understood this right.
I intentionally avoided using an input depth of 3, because that's a special case. Most convolutional layers in a CNN will have an input depth much higher than 3. It's better to think of the convolutional layer as "feature map in, feature map out" rather than "image in, feature map out". By that same logic, I should have made the input depth different than the output depth, because that's also a special case like you said. I had thought about this at one point, but sadly it didn't make it into the final animation. I'll probably fix that on GitHub in the future.
Thanks so much for this. Also really struggled to get proper animations. Would have liked to see how this looks in the actual neural network. i.e. how the filter can be visualized as the weights. Or show who the filter parameters are trained. Would greatly appreciate a video of GAN and LSTM. The LSTM diagrams are terrible. Really struggled to visualized how they connect to the overall network
Thank you! Great animation. However, I do have a technical nit pick. Your animation shows an operation known as cross-correlation, which is related to convolution, but it is mirrored. "Convolutional neural networks" use cross-correlations in the feed-forward phase and convolutions in the backpropagation phase.
Yep, the output is a feature map. Each filter produces one feature of the output, which get stacked together like you see. I've got a video covering that concept here: th-cam.com/video/eMXuk97NeSI/w-d-xo.html.
I find these "new and correct" animations confusing, I have no idea what's happening there. I assume this is just "the correct way to display convolution" for AI models? As an old school person who used convolutions mainly for 2D image processing (blur/edge detection) I don't see anything wrong about the old animations, that's exactly what we used to do there.
As someone who has never built a convolutional neural network, but as someone who has done lots of convolution in image processing algorithms, the convolution they are showing is normal 2d convolution. 3x3 pixel values in -> convolution kernel operation -> single pixel value out. For showing what is actually going on during an operation with a convolution kernel those first animations are perfect. As someone that's built at least a couple of neural networks with linear transformations, and knows exactly how convolution kernels work, I'd hoped to be able to intuit what's going on in a convolutional neural network from your animation, but your animation is super confusing without any context. What is your input, a 3D tensor - I thought the input was a 2D image? What is the output, I thought the output was just "features" extracted from the image? What you've created is so abstract that it literally made me more confused than I was to start with. In my opinion, the best diagram is what you've is shown between 1:11 and 1:33. Except for the concept of 'pooling' it's 100% clear what the actual mechanics within the Neural Network are, and it explains the process of convolution. With no prior knowledge of a convolutional neural network mechanics, I understand it roughly, save the concept of 'pooling'. If you think that information is too much for a student to be able to put together, you're doing too much of the thinking for them. Maybe with knowledge of convolutional neural networks the animation you've made would make sense, or with context, but for an introductory course to convolutional neural networks, it is literally so abstract as to be worse than useless. Its actively confusing.
input is 3D because it has multiple channels, so those are like 3D convolutions but they are commonly called 2D because stride and padding are 2D. Lets say you have an image with 3 channels so its 100x100x3. If your layer has 16 output layers, it will have 16 convolution filters with size height*width*3. So you will end up with a 100x100x16 image
@@carnap355 I have done a little bit of work with convolutional neural networks now, I apologize if my response came off as rude or blunt, I tend to not use as many niceties as I would in a usual conversation when it comes to the internet. I have to say I still find this confusing, though I get what you're saying, I don't understand what that big 3x3xN cuboid block in the middle is, with regard to each of the convolution kernels, and the image, and the image output (at best guess the 3d block is all the stacked results of the previous convolution operations, being subjected to a new convolution operation?). What do you think of the "CNN Explainer" website (can google it)? That's how I understand convolutional neural networks as of now. I also understand the max pooling layer now to be akin to what I would understand from image processing as a "Maximum Filter" operation. So I "think" I have an understanding of what's going on in a convolutional network, bu tfeel free to correct me.
Unfortunately, only half right. How about if we need to understand 4D or 5D convolution situation? Humans can understand 2D most intuitively and I think this is a reason for why made that 2d based animations. (And 2d convolution can extending to a larger dimension.) And deep learning convolution is unfortunately not mathematically organized. It is derived from "filter" in image processing. and "filter" also derived from "cross correlation" long before. You are animation have a multiple kernels, It just depict an argument called "channels" that is only used by "Neural Network" frameworks.
Just abstract/generalize it to higher dimensions then. This is the same as saying "why do we visualize vectors in 2d coordinate systems, even though Nd vectors are well-defiend, or even infinite dimensional vectors (Hilbert space)". Visualizations are meant to capture an intuitive/simplified example, not meant for generality. The generality comes from formal mathematical reasoning which no visualization can capture.
They are not wrong. They are a simplification that helps to understand the concept. As any simplification they are incomplete. But not wrong. It's sad that you use clickbait titles.
They are not wrong. They are just displaying a different case than what you are interested in. Maybe they are misplaced in the material you were looking at, but if they were animations for different things, like convolution filters in image processing, they wouldn't be wrong. Have some humility.
No, they are wrong. They give the feeling that each feature after convoluted is being a standalone one and post-processed accordingly, Which just isn’t true. They form a new 3d image, which then gets treated as such.
Convolution is defined for any finite dimension of tensor, even 1 dimensional. While the claims made in this video are valid when looking from the domain of machine learning. I do aggree that calling diagrams describing a different purpose of a general structure 'wrong' because its not how your particular field uses it feels a bit sensationalist.
I really appreciate the effort and is good one, but I would still go with the 2D one this is way too much jittery for me with so many things happening at one and choice of colors.
I’m a little confused with the video, because I still don’t understand WHY 2d convolution pictures are wrong. What determines the depth of the first input? Same with the convolutional layer. Is this because we have RGBA layers, or?? What’s the benefit of drawing it as 3D instead of 2d? What’s the benefit to us to have a tensor instead of array of convolutional outputs? I’m sure this sounds like thoughtless complaining but I really am curious, and there must be something about convolution in AI that I’m missing in my own knowledge. Thanks for reading this.
Those are good questions. 1) Why are the 2D animations wrong? Short answer: they simplify away the feature dimension. Long answer: The 2D convolution that you'll find in neural network libraries (all the way down to NVIDIA's hardware interface) is "conceptually" performed on 3D data. These end up being batched so the interfaces technically take 4D tensors, but that's just multiple convolution operations on 3D data performed in parallel. If your only understanding of convolution came from the 2D animations, you wouldn't understand how to create the 4D tensors (or the 3D piece of data for a particular sample in the batch). In fact, you wouldn't know why the operation took 4D tensors at all. If you'd like more information on the feature dimension, my first video on the fundamental algorithm (th-cam.com/video/eMXuk97NeSI/w-d-xo.html) should provide enough information to understand what the feature dimension is and why it's essential. 2) What determines the depth of the first input? This depends on what data you have. A color image in RGB format would be a depth of 3: red, green, and blue. A color image with transparency in RGBA format would have a depth of 4: red, green, blue, and alpha. A grayscale image with a single brightness value for each pixel would have a depth of 1: brightness. 3) What determines the depth of the output? Check out my video on filter count: th-cam.com/video/YSNLMNnlNw8/w-d-xo.html 4) What's the benefit of drawing it as 3D instead of 2D? 2D convolution conceptually operates on 3D data (2 spatial dimensions and one feature dimension), so drawing it in 3D shows everything and doesn't simplify anything away from the viewer. 5) What’s the benefit to us to have a tensor instead of array of convolutional outputs? Could you clarify this question? I'm not sure I understand it. Are you asking why we pack all the output values into a single 3D tensor instead of multiple 2D tensors (maybe one for each feature)? It's rare to want the features separated out like that, so it's convenient to have everything together in one tensor. It also has performance benefits from memory locality.
@@animatedai I still don't think the other animations are wrong though. The mathematical concept is the same for an input feature depth of 1, and for higher input depths you are just performing multiple mathematical convolutions at once. I don't think that makes convolution a _fundamentally_ 3d concept, just because it's computationally opportune to package multiple convolutions as one operation.
wait a friggin' minute.... you're telling me that the filter or kernel is 3D? I always thought it's a 2d 3x3 filter that goes to through each "layer" of the input and it recreates a 3d output tensor. Are you sure it's a 3D filter? where is this stated?
Haha, I'm sure :) You can check the documentation for your favorite neural network library to verify. The conv2d operation will actually take a 4D tensor for the filters. Each filter is 3D and you pack all the filters together in one tensor to get a 4D tensor. To convince yourself that it only makes sense for the filters to be 3D, check out the sequel video: th-cam.com/video/XdTn5md3qTM/w-d-xo.html Sources: www.tensorflow.org/api_docs/python/tf/nn/conv2d pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html
Discovered your channel just now. If only I had these resources at my disposal when learning about these topics myself. Keep it up, you will save many careers!
Instead of spending 95% of the video ranting about how other animations are bad, I would have appreciated it more if you had spend that time explaining how this animation works. I don't think I learned anything from this video.. How do you go from an input RGB image of size W * H * 3, to some cube of size 5 * 5 * 5 (+padding)? You lost me at step 1..
They are not wrong. They just show a special case. They use the special case because the focus is on things like stride, dilation, padding etc. It's good to make the 3D tensor animations, but don't call the existing ones wrong. I think I would have still found it easier to understand the existing ones first and then move on to the 3D animations.
I don't really care about the animations, the problem is when they start describing convolutions as 2D operations and don't go into detail on the effect of having multiple input and output channels. I wish I found this video sooner, but anyway it's easy enough to derive the solutions yourself from 200 google search results. ( Google really sucks nowadays ). It's actually a good mental excercise to imagine the 3d/4d filter sliding across batch of images... But good luck finding a correct padding for strided convolutions during backpropagation of both Conv and TransConv layers... I had to derive everything by hand, because internet has incorrect and even worse conflicting formulas for that... 😂
I like it, really, love it! But... I don't see what's wrong with other illustrations and peculiarly I think yours just iterates what they already clearly illustrate. I was even expecting CNN representations in XYZ visuals. Am I missing some points here? Honest question, would appreciate any enlightenment! (btw, thank you for sharing the world with your own version of splendid animation!) PS: If you're up for the challenge, do Spiking NN, I'll buy you a beer in Bali!
Premise 1: All convolution animations are wrong
Premise 2: This is a convolution animation
Conclusion: this is wrong
oh shit
0:02, "all convolution animations you've seen _up to this point_ are wrong"
Maybe it was a proof by contradiction
No?
Someone just read discrete algebra. (Kudos!)
you should've started with the typical RGB 3 layer input image, and animate convolutions on that; that's where most people start to get lost as to how the weights match with inputs, translating from the 2D mental model to 3D.
Bang on right. So he made a video of how others weren't doing it right but then didn't start from the start itself to explain what actually goes on correctly. I mean, what good is this new one then :/
All these wrong illustration and animation trends have been among the many problems where you would think "why the hell have we been doing this all wrong, all the time, everywhere?". Finally, someone came and did the obvious. Thank you!
They are not wrong. They just show a special case. They use the special case because the focus is on things like stride, dilation, padding etc.
It's good to make the 3D tensor animations, but don't call the existing ones wrong. I think I would have still found it easier to understand the existing ones first and then move on to the 3D animations.
A major thing that feels missing to me in the animations is clear textual labeling. It's fine that you label them out loud, and then, also, it would be more accessible for folks with hearing challenges or cognitive challenges. My crit aside, this animation is lovely, and I'm very impressed with what you've done. You've earned yourself a new subscriber :)
Awesome . Finally a good representation of this computations. Thanks for your hard work!!!
amazing. you have cleared all my doubts in single shot
Thanks for that. It was really confusing before your animation came up!
The animation is just meant as an abstraction of the spatial convolution operation itself. A spatial CNN layer consists of spatial convolution operations across multiple input and output channels (which is what you are referring to)
Forget the animation itself (even though its great). I just appreciate a non-moving camera. It bothers me so much when people spin the camera around a nice animation in a circle. Makes me feel like I am on a carnival ride.
Wow, really great, thanks for your work! I was struggling with the very problem you mentioned in the video - bringing together the 2D conv visualizations with the multi-channel 3x3 convolutions that are common in modern CNNs. Thanks to your work, I now understood it.
The use of all these misleading animations is the primary cause of misconception about convolutional neural networks; you have finally provided a good visualization. I am happy to share this content with my colleagues.
Oh man, I'm so glad someone took a direct approach to this problem, when I was learning I was so confused by all these animations and explanations in 2D, and then seeing resulting tensor shapes got me super confused, where the depth go and where did it appear? Thanks for bringing this video to the world!
i finally understood why convolution makes more channels. thank you so so much
Thank you so much for this! worth mentioning that the animation has a stride of 2
Love it. I always thought there were no accurate visualization on the internet too. Good job
So in case of a feature map input, 2d conv just replicate each 2d filter along the feature dimension and do multiplication wise? In the video, the filters are 2d really just replicate to fill in the the number of features? or does each 2d filter is in reality a 3d tensor to match the feature dimension?
best generalization ever, covers all the corner cases
These videos are outstanding! Finally, true visualisations that get it right. I'm sharing these with my ML Masters students. Thank you for your considerable effort putting these together.
Thank you for this, recently I tried to explain why the input and output shapes behave the way they do, and what gets combined with what. These animations will make it sooo much easier!!
Thank you for making this video! I have been trying to visualize this using all the horrible diagrams from papers. I immediately understood what they were trying to convey after watching your video!
well, not speak for the existing animations/figs, i won't say they are wrong, they have some issues, but essentially they are correct. When talking about 2D convonlution, we should know the input and output are 3D as input is a picture and output is also a picture/feature map.
This is what I expected for a long time. This explains everything clearing. Thanks for posting this.
The first animation you say is wring shows the contribution of one filter operations which is quite accurate. is you considered the number of input channels one and out put channels 1 that is the right figure for the whole operation. the conv2d operation are all element-wise matrices multiplication with shifting windows. the 3D animation you did look great but lack of that notion . that is my option. i stick with the 2D.
Thank you! He never explains why his animation is "correct" and in my opinion it simply isn't. 2D convolutions act 2D on 2D data. The fact that we have *multiple data* and *multiple filters* leads to us frequently blocking things in 3D, but the convolution itself is still fundamentally 2D.
And if someone doesn't understand that, it's not because of bad animations.
Excellent visualization! I will definitely show these visualizations to my students in Machine Learning course. They will love it.
Lol, I literally learned this the hard way just about 2 months ago, when the shape for my 2d convolution required 3 parameters, and this made me super confused :,)
Best video about neural convolution and filters!? YES!!!
Thank you so much!
"a 2D convolution actually takes in a 3D tensor as input and has a 3D convolution as output", well, it depends right? If you have a single channel/grayscale image then the input is in fact a 2D tensor, and each feature outputs a 2D tensor that is joined with all others in the feature map. So if you have a grayscale image with a single feature, the animations would in fact be correct.
I think the animations are perfectly fine, as they simplify a concept to it's most basic form for easy understanding. But it is true that after you understand the basic concept, a 3D - 3D representation is also nice to understand more common and complex examples.
Disclaimer that I could be wrong as I am by no means an expert, but this is my take from my current understanding of convolutions :)
I'm not aware of a library where it depends, i.e., where the depth dimension is optional. PyTorch's Conv2D will accept a 3D tensor or a 4D tensor (batched 3D tensors). The functional interface only accepts 4D. Kera's Conv2D layer will only accept a 4D tensor. TensorFlow's conv2d operation with accept anything with at least 4 dimensions (the last three are treated as the height, width, and channels and all the others before that are treated as batch dimensions. NVIDIA's cudnn implementation of 2D convolution takes a 4D tensor.
And in all of these cases the weights will be in 4D which can be thought of as a 3D weight for each filter corresponding to the size of the 3D patch that the filters operate on in the input.
So as far as the industry standard, there just isn't a 2D convolution where you don't have a depth dimension in the input. Your grayscale case will only have a depth of 1 but will in fact be a 3D tensor.
If you're able to find a mainstream library where the depth dimension in the input is optional, let me know.
@@animatedai Still the concept doesn't need the 3d implementation, as the different features are worked on independently anyway. I definitely think it's a stretch and comes off really condescending, to call all other animations wrong.
This video feels like an iPhone moment - a video I didn't know I needed until I saw it. Thanks a lot!
With those sahpes og input and filter, is there even any reason to have them 3d over 2d? I get the output as you layer filters, but if the input and filter is just the same thing all thru representing them by singular cubes is not wrong
I was looking for something like this to dispel my doubts and it worked! thanks :) (I think you are right, common animations are super misleading)
Finnaly. You are the best. When i was learning this, i was always looking at all those original animations and i was always so confused ...
Why is the does the input shape have have so many layers in this animation? Wouldn't it have a shape equal to the image shape, then 3 layers, 1 for rgb respectively?
I'm glad you asked. In general, convolution takes a feature map as input, which can have any number of features (depth). An image is a special kind of feature map with 3 features: red, green, and blue. However, this typically only applies for the first layer of convolution in a neural network and the other convolutional layers will have more features for input, e.g., 32, 64, 128, ... 1024, 2048. So to better represent the general case and to encourage viewers to consider more than just the special case of an image, I chose to use 8 for the animations. Although 3 would also be perfectly valid.
What actually is the 3rd dimension in this context for the source giant cube? Is that multiple colors? A batch of multiple images?
This is some amazing content.
Thank you, buddy!
Super helpful, any plans to make this open source or make interactable cases where we can change the stride and see the variation?
Thank you! On my GitHub page (animatedai.github.io), you can see a few different variations, and I'm also working on an interactive webgl app where you can pick the parameters.
Great video, but you didn’t explain why the input is a 3d tensor. If we are convoluting a 2d image, where does the 3d tensor come from?
The short answer is that both the input and output are feature maps. Check out this video explaining it: th-cam.com/video/eMXuk97NeSI/w-d-xo.html
@@animatedai A better way to ask my question is this - How is the first feature map created in a convolutional network? Surely the time time the 2d image is convoluted we have a 2d tensor and 2d filters, just like the typical animation, right? I get that the output of this convolution will be a 3d feature map, and thus all further convolution will look like your animation that is 3d to 3d.
the *first* time
A 2D image is represented as a 3D feature map with 3 features: red, green, and blue. So even the first convolution has a 3D input.
@@animatedai You may take a grayscale image as an input because for many cases it is sufficient, and for learning it´s a good simplification, I would not consider this being a wrong animation, it just assumes you have a grayscale image as an input. I get the point of your video but the title is clickbait.
@animatedai How did you learn blender? Which were your sources?
Why is your input tensor so many dimensions? Shouldn’t the depth be only 3 (1 for each color channel)?
Adding a bias term added after convolutions would be a full process representation. Anyway, great visualization!
I feel silly for asking, but the different colored blocks (in the middle) correspond to convolutions over different channels of the original matrix right?
I have been struggling to mentally visualize convolutions, specially going from one dimension to others. I was reading the book Understanding Deep Learning by Simon Prince and I realized what i thought i looked like was wrong ( The 2D to 2D animations from the beginning). I wish I would have stumbled upon yours before having to imagine what was explained in the books XD (Good book tho)
4:04 why isnt last column scanned
I'm glad you noticed. I picked an even column number specifically to demonstrate that.
This happens because there's a stride of 2 and an even number of columns (8 in this case counting padding) and an odd sized filter (3x3). So after it's taken 3 steps, there's only 1 pixel of width remaining, and it can't move 2 more spaces. Convolution handles this by simply ignoring the remaining data and moving to the next row. This is important to know because it could cause you to lose data (which could accumulate over many layers to be significant chunks of your input). In this case, the last column was just padding anyway, so no real data is lost.
Note: the last row is scanned because, unlike the columns, we have an odd number of rows, 7 counting padding.
This question has been confusing Mr for a long time thank you
this conv2d animation you do is right, thanks alot
3:30 - the final animation
So the point he is trying to make is that an "image" is represented as a 3d object/array , that is height(pixels) , width (pixels) and RGB components, but when we talk about 2d images we usually means a "grayscale image" , which doesn't require the RGB part mentioning explicitly, although it is still a 3d image , .
So for a colored image of 128x128 pixels , tenser shape would be (128,128,3) & for a grayscale image it would be (128,128,1)
Is a cube in the filter (or image) a pixel? Or is it a combination of channels?
Good question. Each cube is a single floating point value.
@@animatedai Thank you for answer, btw, why do you think the typical animations are wrong? Can't we just do a 2D convolution on each slice of an input image and then just stack the slices together to get the same feature map as with your animation?
That's a great question. So good, in fact that I'm planning to make a follow-up video explaining it. I think a lot of people are struggling with this idea and your question's phrasing really helped me understand where the misconception is.
The short answer is that what you're proposing wouldn't be equivalent, because in neural network convolution, each filter sees the all the features/channels of the input, not just a slice of the input. That's why the filters themselves are 3D.
More concretely, let's say a bumblebee-detecting neural network wanted to look for the color yellow, so it needed a filter that detected yellow. That filter couldn't just look at the red channel or just the green channel or just the blue channel. It needs to look at all of them together to distinguish yellow from red or from green or from white or any other color. So we can't slice the input up into red/green/blue and then operate on them separately.
Does that make sense?
@@animatedai Yeah it makes total sense once you realize (which is quite obvious when you think about it) that the input channels pretty much always have some sort of correlation between them. In your example, we need all 3 channels to see how much red, green and blue we have to get this certain type of yellow.
I talked with my professor about this topic (where i referenced to your video). And he believed that they only reason for when this 2D convolution and stacking of slices is better is if you don't have that much data. Also, that the training will be faster.
I think a lot of people would appreciate a video exploring the difference between these two ideas in more detail. At least I would. Thank you already for your animations and answers!
I meant to add that the splitting and stacking isn't a crazy idea as long as you understand the limitation (compared the standard convolution) and compensate for it. In fact, it's the basis of the depthwise-separable convolution, which can be much more efficient that standard convolution. I've got a video on it that you might like: th-cam.com/video/vVaRhZXovbw/w-d-xo.html
Such a hard work ! Thank you so much
I don't understand why filters are 3D. 8 deep = 8 channels? In the end you get as many channels as there are filters?
Good questions! First: for a good explanation of "why" the filters are 3D, check out this video: th-cam.com/video/XdTn5md3qTM/w-d-xo.html. And second: the filter depth matches the input depth, and the output depth matches the filter count.
Great work, such a animation for grouped convolutiion would be nice too
Convolution is not only used in neural network
Thanks for the great video! I'm at the moment trying to learn these concepts, so I might be suggesting something that is incorrect. Here it goes anyway: an example with input of depth 3 might be good to better understand this by thinking of rgb image data. It also seems to be the (special) case in your animatioms, that the input depth is the same as the number filters leading to same depth in both input and output. That is not always the case, if I've understood this right.
I intentionally avoided using an input depth of 3, because that's a special case. Most convolutional layers in a CNN will have an input depth much higher than 3. It's better to think of the convolutional layer as "feature map in, feature map out" rather than "image in, feature map out".
By that same logic, I should have made the input depth different than the output depth, because that's also a special case like you said. I had thought about this at one point, but sadly it didn't make it into the final animation. I'll probably fix that on GitHub in the future.
Thanks so much for this. Also really struggled to get proper animations. Would have liked to see how this looks in the actual neural network. i.e. how the filter can be visualized as the weights. Or show who the filter parameters are trained. Would greatly appreciate a video of GAN and LSTM. The LSTM diagrams are terrible. Really struggled to visualized how they connect to the overall network
Thank you for putting this out!
Thank you! Great animation. However, I do have a technical nit pick. Your animation shows an operation known as cross-correlation, which is related to convolution, but it is mirrored. "Convolutional neural networks" use cross-correlations in the feed-forward phase and convolutions in the backpropagation phase.
So the output is a feature map? I don't get it why the feature map on the right, stacked like that. Anyone can explain it?
Yep, the output is a feature map. Each filter produces one feature of the output, which get stacked together like you see. I've got a video covering that concept here: th-cam.com/video/eMXuk97NeSI/w-d-xo.html.
@@animatedai Thank you for the explanations, I barely understand with others visualization, but you really do a good job.
I find these "new and correct" animations confusing, I have no idea what's happening there. I assume this is just "the correct way to display convolution" for AI models? As an old school person who used convolutions mainly for 2D image processing (blur/edge detection) I don't see anything wrong about the old animations, that's exactly what we used to do there.
As someone who has never built a convolutional neural network, but as someone who has done lots of convolution in image processing algorithms, the convolution they are showing is normal 2d convolution. 3x3 pixel values in -> convolution kernel operation -> single pixel value out. For showing what is actually going on during an operation with a convolution kernel those first animations are perfect. As someone that's built at least a couple of neural networks with linear transformations, and knows exactly how convolution kernels work, I'd hoped to be able to intuit what's going on in a convolutional neural network from your animation, but your animation is super confusing without any context. What is your input, a 3D tensor - I thought the input was a 2D image? What is the output, I thought the output was just "features" extracted from the image?
What you've created is so abstract that it literally made me more confused than I was to start with. In my opinion, the best diagram is what you've is shown between 1:11 and 1:33. Except for the concept of 'pooling' it's 100% clear what the actual mechanics within the Neural Network are, and it explains the process of convolution. With no prior knowledge of a convolutional neural network mechanics, I understand it roughly, save the concept of 'pooling'. If you think that information is too much for a student to be able to put together, you're doing too much of the thinking for them.
Maybe with knowledge of convolutional neural networks the animation you've made would make sense, or with context, but for an introductory course to convolutional neural networks, it is literally so abstract as to be worse than useless. Its actively confusing.
input is 3D because it has multiple channels, so those are like 3D convolutions but they are commonly called 2D because stride and padding are 2D. Lets say you have an image with 3 channels so its 100x100x3. If your layer has 16 output layers, it will have 16 convolution filters with size height*width*3. So you will end up with a 100x100x16 image
@@carnap355 I have done a little bit of work with convolutional neural networks now, I apologize if my response came off as rude or blunt, I tend to not use as many niceties as I would in a usual conversation when it comes to the internet. I have to say I still find this confusing, though I get what you're saying, I don't understand what that big 3x3xN cuboid block in the middle is, with regard to each of the convolution kernels, and the image, and the image output (at best guess the 3d block is all the stacked results of the previous convolution operations, being subjected to a new convolution operation?). What do you think of the "CNN Explainer" website (can google it)? That's how I understand convolutional neural networks as of now. I also understand the max pooling layer now to be akin to what I would understand from image processing as a "Maximum Filter" operation. So I "think" I have an understanding of what's going on in a convolutional network, bu tfeel free to correct me.
Amazing work 😍
Unfortunately, only half right. How about if we need to understand 4D or 5D convolution situation? Humans can understand 2D most intuitively and I think this is a reason for why made that 2d based animations. (And 2d convolution can extending to a larger dimension.)
And deep learning convolution is unfortunately not mathematically organized. It is derived from "filter" in image processing. and "filter" also derived from "cross correlation" long before.
You are animation have a multiple kernels, It just depict an argument called "channels" that is only used by "Neural Network" frameworks.
Just abstract/generalize it to higher dimensions then. This is the same as saying "why do we visualize vectors in 2d coordinate systems, even though Nd vectors are well-defiend, or even infinite dimensional vectors (Hilbert space)". Visualizations are meant to capture an intuitive/simplified example, not meant for generality. The generality comes from formal mathematical reasoning which no visualization can capture.
Convolution is well defined mathematically, and has been way before the invention of image processing.
Your videos are very cool! I wonder if you thought about how to present Conv3d, it is a challenge when considering more than one channel
Is there a way to access your course online? I'm really interested in this subject!
The course is a work-in-progress. You can see the videos that are completed so far in this playlist: th-cam.com/video/eMXuk97NeSI/w-d-xo.html
They are not wrong. They are a simplification that helps to understand the concept. As any simplification they are incomplete. But not wrong. It's sad that you use clickbait titles.
They are not wrong. They are just displaying a different case than what you are interested in. Maybe they are misplaced in the material you were looking at, but if they were animations for different things, like convolution filters in image processing, they wouldn't be wrong. Have some humility.
Also convolutions as an idea are way older and more general than just image processing or neural networks. He comes off as ignorant of this.
No, they are wrong. They give the feeling that each feature after convoluted is being a standalone one and post-processed accordingly, Which just isn’t true. They form a new 3d image, which then gets treated as such.
Convolution is defined for any finite dimension of tensor, even 1 dimensional.
While the claims made in this video are valid when looking from the domain of machine learning. I do aggree that calling diagrams describing a different purpose of a general structure 'wrong' because its not how your particular field uses it feels a bit sensationalist.
3D, what tool are you using? Blender?
I'm using Blender with a lot of Geometry Nodes.
@@animatedai 감사합니다 소스 는 오픈 안되있으셔요?
Nice animation, are you planning on making animations for Transformers as well?
I'm just curious if this visualisation helps someone who doesn't know what convolution is.
Not wrong bro. They are just incomplete.
I really appreciate the effort and is good one, but I would still go with the 2D one this is way too much jittery for me with so many things happening at one and choice of colors.
Oh cool, didn't know Blender had all that.
I’m a little confused with the video, because I still don’t understand WHY 2d convolution pictures are wrong. What determines the depth of the first input? Same with the convolutional layer. Is this because we have RGBA layers, or?? What’s the benefit of drawing it as 3D instead of 2d? What’s the benefit to us to have a tensor instead of array of convolutional outputs? I’m sure this sounds like thoughtless complaining but I really am curious, and there must be something about convolution in AI that I’m missing in my own knowledge. Thanks for reading this.
Those are good questions.
1) Why are the 2D animations wrong? Short answer: they simplify away the feature dimension. Long answer: The 2D convolution that you'll find in neural network libraries (all the way down to NVIDIA's hardware interface) is "conceptually" performed on 3D data. These end up being batched so the interfaces technically take 4D tensors, but that's just multiple convolution operations on 3D data performed in parallel. If your only understanding of convolution came from the 2D animations, you wouldn't understand how to create the 4D tensors (or the 3D piece of data for a particular sample in the batch). In fact, you wouldn't know why the operation took 4D tensors at all. If you'd like more information on the feature dimension, my first video on the fundamental algorithm (th-cam.com/video/eMXuk97NeSI/w-d-xo.html) should provide enough information to understand what the feature dimension is and why it's essential.
2) What determines the depth of the first input? This depends on what data you have. A color image in RGB format would be a depth of 3: red, green, and blue. A color image with transparency in RGBA format would have a depth of 4: red, green, blue, and alpha. A grayscale image with a single brightness value for each pixel would have a depth of 1: brightness.
3) What determines the depth of the output? Check out my video on filter count: th-cam.com/video/YSNLMNnlNw8/w-d-xo.html
4) What's the benefit of drawing it as 3D instead of 2D? 2D convolution conceptually operates on 3D data (2 spatial dimensions and one feature dimension), so drawing it in 3D shows everything and doesn't simplify anything away from the viewer.
5) What’s the benefit to us to have a tensor instead of array of convolutional outputs? Could you clarify this question? I'm not sure I understand it. Are you asking why we pack all the output values into a single 3D tensor instead of multiple 2D tensors (maybe one for each feature)? It's rare to want the features separated out like that, so it's convenient to have everything together in one tensor. It also has performance benefits from memory locality.
@@animatedai I still don't think the other animations are wrong though. The mathematical concept is the same for an input feature depth of 1, and for higher input depths you are just performing multiple mathematical convolutions at once. I don't think that makes convolution a _fundamentally_ 3d concept, just because it's computationally opportune to package multiple convolutions as one operation.
Great work!
wait a friggin' minute.... you're telling me that the filter or kernel is 3D? I always thought it's a 2d 3x3 filter that goes to through each "layer" of the input and it recreates a 3d output tensor. Are you sure it's a 3D filter? where is this stated?
Haha, I'm sure :) You can check the documentation for your favorite neural network library to verify. The conv2d operation will actually take a 4D tensor for the filters. Each filter is 3D and you pack all the filters together in one tensor to get a 4D tensor. To convince yourself that it only makes sense for the filters to be 3D, check out the sequel video: th-cam.com/video/XdTn5md3qTM/w-d-xo.html
Sources:
www.tensorflow.org/api_docs/python/tf/nn/conv2d
pytorch.org/docs/stable/generated/torch.nn.functional.conv2d.html
@@animatedai you're right... my mind is actually blown, it's all been a lie, ty for the response
Discovered your channel just now. If only I had these resources at my disposal when learning about these topics myself.
Keep it up, you will save many careers!
Instead of spending 95% of the video ranting about how other animations are bad, I would have appreciated it more if you had spend that time explaining how this animation works. I don't think I learned anything from this video.. How do you go from an input RGB image of size W * H * 3, to some cube of size 5 * 5 * 5 (+padding)? You lost me at step 1..
Check out this 100% rant-free playlist to learn more! th-cam.com/video/eMXuk97NeSI/w-d-xo.html
They are not wrong. They just show a special case. They use the special case because the focus is on things like stride, dilation, padding etc.
It's good to make the 3D tensor animations, but don't call the existing ones wrong. I think I would have still found it easier to understand the existing ones first and then move on to the 3D animations.
Good job
Any plans to add your animations to Wikimedia commons? :)
The example just a concept. I don't agree with this sensational title.
Stride value was 2 pixel
I liked the idea but title is too big for this kind of correction
Which is correct?????
Finally a good animation video !
I don't really care about the animations, the problem is when they start describing convolutions as 2D operations and don't go into detail on the effect of having multiple input and output channels.
I wish I found this video sooner, but anyway it's easy enough to derive the solutions yourself from 200 google search results. ( Google really sucks nowadays ).
It's actually a good mental excercise to imagine the 3d/4d filter sliding across batch of images... But good luck finding a correct padding for strided convolutions during backpropagation of both Conv and TransConv layers... I had to derive everything by hand, because internet has incorrect and even worse conflicting formulas for that... 😂
brilliant !
I like it, really, love it! But... I don't see what's wrong with other illustrations and peculiarly I think yours just iterates what they already clearly illustrate. I was even expecting CNN representations in XYZ visuals. Am I missing some points here? Honest question, would appreciate any enlightenment! (btw, thank you for sharing the world with your own version of splendid animation!)
PS: If you're up for the challenge, do Spiking NN, I'll buy you a beer in Bali!
cool even better add names to the objects like kernel etc would be helpful to new people
Beautiful 💙
amazing!
If all are wrong, then why should i watch this one?
amazing
There is no convolution
thx
Amazing!
Oh.
Geonodes were easier?
Bravo !
sick
this is great
Nice!