Its funny you mention that the number of kernels is the least exciting part, my thesis was an attempt on finding a systematic way to reduce the number kernels by correlating them and discarding kernels that “extract roughly the same features”. Great video!
3:24 "only 2 categories...so 512 features is enough" - this statement sounds like it comes from familiarity with the problem. Is there something more to it? Did you see that number of features used in past papers? Was it from your own experimentation against 256 or 1024 features? Is there some math that arrives at this? I'd like to understand this better, so any additional color you have on this would be helpful!
You typically want more features than categories. So for something like ImageNET with 1000 categories, 512 wouldn't be enough. You'd want 2048 or higher. But this case only has 2 categories so 512 easily meets that requirement. And the exact value of 512 came from NVIDIA's StyleGAN papers, which is what I based that architecture on. I don't remember them giving a reason for that value, but it gave them good results and a higher value wouldn't fit into memory during training on the Google Colab hardware. It's more of an art than a science so let me know if that doesn't completely answer your question. I'm happy to answer follow-ups.
Thank you for the video I really stress out about this matter now I am more calm knowing it's a conventional problem and solving it by euristics is the way
2:08 Don't think we can go 512x512x3 to 512x512xN if filterSize>1. If filterSize=3 we'd be going to 510x510xN, right? Thought experiment: 5 items, slidingWindowLen 3. 3 slide-positions (123 234 345).
hmm, I suppose a feature can extend beyond the image by a pixel, might even collect useful information that informs it that it's dealing with an edge. Solving a jigsaw puzzle you usually collect the edge-pieces & try to work with them first.
This question is actually the perfect lead-in to my video on padding: th-cam.com/video/ph4LrdntONo/w-d-xo.htmlfeature=shared That's actually the video that directly follows this one in my playlist on convolution. You can see the full playlist here: th-cam.com/play/PLZDCDMGmelH-pHt-Ij0nImVrOmj8DYKbB.html&feature=shared
hey! do you have any sources on your statements at 2:21(about doubling channels when downsampling) and 2:50 (downsampling units should be followed by dimension-steady units)? currently writing a paper and trying to argue the same point, but i cant find any real research on it :)
It's just a common pattern that I've seen. There's no shortage of examples if you want to cite them, from the original resnet all the way up to modern diffusion architectures.
When it comes to one-dimensional signals, such as time domain signals collected from speed sensors, what is the difference between visualization of 1D CNN and 2D one? Does it just change the height of the cuboid into 1? And, what algorithms do you recommend for deep learning of one-dimensional time domain signals? I would really appreciate your reply, because as a Chinese student doing an undergraduate graduation project, I can't find any visualization of 1D CNN on the Chinese Internet.
How do you halve the resolution? The only way I can Imagine is to have a kernel that has half the size of the input data plus one. Is that correct to do or is something else happening?
This will be covered in an upcoming video, but to give away the answer: you can pad with "SAME" (in TensorFlow or manually pad the equivalent of it in PyTorch) and use a stride of 2.
@@animatedai Oh I see, and that is still backpropagation compatible? I guess it would be but I have no clue how to do that little step backwards, I assume just act like the data was always smaller for that step
Yes, that still works fine with backpropagation. Are you working with a library like TensorFlow or PyTorch? If so, they'll handle the backpropagation for you with their automatic differentiation. If you're using a kernel size of 1x1, it would work to act like the data was always smaller (specifically you would treat it like you dropped every other column/row of the data and then did a 1x1 convolution with a stride of 1). But for a larger kernel size like 3x3, all of the input data will be used so that won't work.
@@animatedai Ah I see that makes sense. I am actually building it from scratch in javascript. It is pretty slow but I am doing it do get a better understanding for it and I also find it fun. Also thank you for the responses that is really cool. I think what you are doing with these videos is really sleek and useful. I personally would like if you went into more depths about the actual math and numbers but I completely understand that your goal here is not that and to give a more intuitive explanation for people. Keep it up!
Good luck on your javascript CNN project! And thank you; I appreciate your support. The math is something I plan to cover; I even have the rough draft of the script written for a future video that goes over the math in detail. I just wanted to focus on teaching the intuition separately first so that it doesn't get lost in the calculation details.
So, if the filter value is 64, that's mean you stack the 512x512's photo 64 times? like you stack that face 64 times? or there are different pixel value for every filter?
Have you seen my video on the fundamental algorithm? th-cam.com/video/eMXuk97NeSI/w-d-xo.html You can think of each filter as a different pattern that the algorithm is searching for in the image (or input feature map). Each output value (each cube) represents how closely that area of the input matched the pattern. So you get a 2D output for each filter. And those outputs are stacked depth-wise to form the 3D output feature map. If you have 64 filters, you'll stack 64 of these 2D outputs together.
I have one question. What does number of features mean? For example the initial image is 512x512x3 (3 in this case are red-green-blue coloros). But what happened in the next layers? What are these 64, 128 and more numbers of features? Why do we need so many instead of just 3? Thanks. Appreciate your videos!
You have 3 dimensions, 2 spatial and 1 feature dimension. The 2 spatial dimensions encode where the information is and the feature dimension encodes different aspects under which the information can be interpreted. In the beginning you have the 3 color channels but the next layer has a much larger feature dimension in which each index represents one particular aspect like "how much red-green color contrast is between left and right is at this position". These aspects become more higher-level like "is this a circle", so the feature dimension needs to increase to cover all useful interpretations that could be applicable at that point. This agrees well with the shrinking spatial dimensions because each pixel in a later layer represents a larger area of the original image for which these many higher-level interpretations would be necessary.
How did you go from 512x512x3 to 512x512x64 while still using a kernel size of 3x3? Wouldnt you have to use a kernel size of 1x1 and have 64 kernels? That is the only way I understand it so far, I'm hoping you explain it later on. Other than that this is super helpful thank you so much 🙏
I believe The depth of each kernel must match the depth of the input. In his example, the input has the depth of eight. hence the kernel has the depth of eight.
@@tazanteflight8670 I believe The depth of each kernel must match the depth of the input. In his example, the input has the depth of eight. hence the kernel has the depth of eight. If the input only has depth of 3 (like RGB colors) then the kernel should have depth of 3 also. I guess we could use kernel depth of just one for all input sizes also.
The 3rd dimension part seems a bit tedious, I believe 2D visualizations is more helpful in practicality, just write down in a text the feature count at the bottom and go to the next. And for building neural networks in 2D visualization I've recently found KNIME to be amazing, although you are abstracting entire layers to a single box lmao.
This is channel is TH-cam's undiscovered gold mine, please keep up the amazing content!!
for real it's wild how much effort effort he puts into these.
hi there bruhh
Its funny you mention that the number of kernels is the least exciting part, my thesis was an attempt on finding a systematic way to reduce the number kernels by correlating them and discarding kernels that “extract roughly the same features”. Great video!
3:24 "only 2 categories...so 512 features is enough" - this statement sounds like it comes from familiarity with the problem. Is there something more to it? Did you see that number of features used in past papers? Was it from your own experimentation against 256 or 1024 features? Is there some math that arrives at this? I'd like to understand this better, so any additional color you have on this would be helpful!
You typically want more features than categories. So for something like ImageNET with 1000 categories, 512 wouldn't be enough. You'd want 2048 or higher. But this case only has 2 categories so 512 easily meets that requirement.
And the exact value of 512 came from NVIDIA's StyleGAN papers, which is what I based that architecture on. I don't remember them giving a reason for that value, but it gave them good results and a higher value wouldn't fit into memory during training on the Google Colab hardware.
It's more of an art than a science so let me know if that doesn't completely answer your question. I'm happy to answer follow-ups.
@@animatedai Thank you, that helps
Eagerly waiting for your next videos..
2:13 how does it stay at the same size? Padding the edges of the original image?
Thank you for the video I really stress out about this matter now I am more calm knowing it's a conventional problem and solving it by euristics is the way
0:16 If filters are stored in a 4-dimensional tensor and one of them represents the number of filters, then what does the depth represent?
it represents the depth of the input tensor
2:08 Don't think we can go 512x512x3 to 512x512xN if filterSize>1. If filterSize=3 we'd be going to 510x510xN, right? Thought experiment: 5 items, slidingWindowLen 3. 3 slide-positions (123 234 345).
hmm, I suppose a feature can extend beyond the image by a pixel, might even collect useful information that informs it that it's dealing with an edge. Solving a jigsaw puzzle you usually collect the edge-pieces & try to work with them first.
This question is actually the perfect lead-in to my video on padding: th-cam.com/video/ph4LrdntONo/w-d-xo.htmlfeature=shared
That's actually the video that directly follows this one in my playlist on convolution. You can see the full playlist here: th-cam.com/play/PLZDCDMGmelH-pHt-Ij0nImVrOmj8DYKbB.html&feature=shared
hey! do you have any sources on your statements at 2:21(about doubling channels when downsampling) and 2:50 (downsampling units should be followed by dimension-steady units)? currently writing a paper and trying to argue the same point, but i cant find any real research on it :)
It's just a common pattern that I've seen. There's no shortage of examples if you want to cite them, from the original resnet all the way up to modern diffusion architectures.
Best video for the CNN.
Is feature and filter count the same as PyTorch channel size?
awesome videos!!!
Please create a tutorial on conv3d as well, and which would better for video processing (conv2d or conv3d)
Thanks for another great video!
When it comes to one-dimensional signals, such as time domain signals collected from speed sensors, what is the difference between visualization of 1D CNN and 2D one? Does it just change the height of the cuboid into 1?
And, what algorithms do you recommend for deep learning of one-dimensional time domain signals?
I would really appreciate your reply, because as a Chinese student doing an undergraduate graduation project, I can't find any visualization of 1D CNN on the Chinese Internet.
Which approach did you find more effective for your problem? Dense layers or Conv1D layers? Or did you go another way, e.g., LSTMs?
This is wow!
This are animation you also use in the course?
I don't use these in my current course, but I'm planning to incorporate them into a new course that I'm working on now.
How do you halve the resolution? The only way I can Imagine is to have a kernel that has half the size of the input data plus one. Is that correct to do or is something else happening?
This will be covered in an upcoming video, but to give away the answer: you can pad with "SAME" (in TensorFlow or manually pad the equivalent of it in PyTorch) and use a stride of 2.
@@animatedai Oh I see, and that is still backpropagation compatible? I guess it would be but I have no clue how to do that little step backwards, I assume just act like the data was always smaller for that step
Yes, that still works fine with backpropagation. Are you working with a library like TensorFlow or PyTorch? If so, they'll handle the backpropagation for you with their automatic differentiation.
If you're using a kernel size of 1x1, it would work to act like the data was always smaller (specifically you would treat it like you dropped every other column/row of the data and then did a 1x1 convolution with a stride of 1). But for a larger kernel size like 3x3, all of the input data will be used so that won't work.
@@animatedai Ah I see that makes sense. I am actually building it from scratch in javascript. It is pretty slow but I am doing it do get a better understanding for it and I also find it fun.
Also thank you for the responses that is really cool. I think what you are doing with these videos is really sleek and useful. I personally would like if you went into more depths about the actual math and numbers but I completely understand that your goal here is not that and to give a more intuitive explanation for people. Keep it up!
Good luck on your javascript CNN project! And thank you; I appreciate your support. The math is something I plan to cover; I even have the rough draft of the script written for a future video that goes over the math in detail. I just wanted to focus on teaching the intuition separately first so that it doesn't get lost in the calculation details.
So, if the filter value is 64, that's mean you stack the 512x512's photo 64 times? like you stack that face 64 times? or there are different pixel value for every filter?
take an example like 3x3 matrix with 5 filters
1 0 1
0 1 0
1 0 1
so this value is 5 times stacked bcs the filter value is 5?
Have you seen my video on the fundamental algorithm? th-cam.com/video/eMXuk97NeSI/w-d-xo.html
You can think of each filter as a different pattern that the algorithm is searching for in the image (or input feature map). Each output value (each cube) represents how closely that area of the input matched the pattern. So you get a 2D output for each filter. And those outputs are stacked depth-wise to form the 3D output feature map. If you have 64 filters, you'll stack 64 of these 2D outputs together.
@@animatedai i see!! thanks for giving me the previous video's link
sorry for my silly question 😅
I have one question. What does number of features mean? For example the initial image is 512x512x3 (3 in this case are red-green-blue coloros). But what happened in the next layers? What are these 64, 128 and more numbers of features? Why do we need so many instead of just 3? Thanks. Appreciate your videos!
You have 3 dimensions, 2 spatial and 1 feature dimension. The 2 spatial dimensions encode where the information is and the feature dimension encodes different aspects under which the information can be interpreted. In the beginning you have the 3 color channels but the next layer has a much larger feature dimension in which each index represents one particular aspect like "how much red-green color contrast is between left and right is at this position". These aspects become more higher-level like "is this a circle", so the feature dimension needs to increase to cover all useful interpretations that could be applicable at that point. This agrees well with the shrinking spatial dimensions because each pixel in a later layer represents a larger area of the original image for which these many higher-level interpretations would be necessary.
How did you go from 512x512x3 to 512x512x64 while still using a kernel size of 3x3? Wouldnt you have to use a kernel size of 1x1 and have 64 kernels? That is the only way I understand it so far, I'm hoping you explain it later on. Other than that this is super helpful thank you so much 🙏
You're correct that you need 64 kernels, but the size of the kernels doesn't matter. It's fine to have a kernel size of 3x3 and 64 kernels.
@@animatedai I see now, thanks for clarifying
@@animatedai have you used padding of 1 to keep the dimension of output same ?
Why do your filter examples have a depth of 8 ?
Because the input has a depth of eight
@@jntb3000 Eight ... what? And why was 7 insufficient, and why is 9 too much?
I believe The depth of each kernel must match the depth of the input. In his example, the input has the depth of eight. hence the kernel has the depth of eight.
@@tazanteflight8670 I believe The depth of each kernel must match the depth of the input. In his example, the input has the depth of eight. hence the kernel has the depth of eight. If the input only has depth of 3 (like RGB colors) then the kernel should have depth of 3 also. I guess we could use kernel depth of just one for all input sizes also.
I'm having similar questions, I thought there was only 1 filter?
wow!
Imagine the ability to build your own TensorFlow neural network using such 3D visualization.
The 3rd dimension part seems a bit tedious, I believe 2D visualizations is more helpful in practicality, just write down in a text the feature count at the bottom and go to the next.
And for building neural networks in 2D visualization I've recently found KNIME to be amazing, although you are abstracting entire layers to a single box lmao.