its a very new topic so no much material on it but as time progresses even if this video might be removed more and better videos probably would arise on the matter
really thanks Andrew. 🙏 Through your video, I totally understood what Inception network!! (including CNN) 다른 한국 영상들에서는 보기 어려운 체계적이고 + 쉬운 설명입니다 영상 강력추천합니다. 👍
Great lecture! Thank you. I have a question. 2:18 Can I move the 1x1 conv layer to before max-pooling layer to reduce channels, like the one before 3x3 and 5x5 conv? What’s the difference?
How do you know how small the bottleneck layer should be? And why would one want to shrink the number of channels as done in 8:46, what is the benefit of doing this? Also, instead of using a 1x1 as a bottleneck layer and then using 5x5 or 3x3 filters to reduce computational complexity, why can't we just use 1x1 filters throughout to get the required output dimensions?
From what I've understood from this video the idea is to just reduce the amount of computations needed. Think of it as densely packing and combining all the features from the previous layer and proceeding to work with this "dense/tight" features - this is just computationaly cheaper. You can't just throw away 3x3, 5x5 filters because they are useful for finding "patterns" in images, like lines, curves, etc. A 1x1 convolution only looks at 1 pixel at a time - and such a network would just be a MLP with extra steps. This is just my intuition - feel free to add/correct something.
3x3, 5x5 and like those filters used to learn edge detection and many more details whereas 1x1 are a really useful tool for transforming the number of channels without changing the spatial dimensions.
What is the purpose of using Maxpooling if we are not reducing the height and width of the dimension? I guess in previous lectures Andrew Ng said that max pool reduces the dimension such as 28*28*8 to 14*14*8 what is the purpose of applying it here is it to only keep the most important information by maxing it?
Well, the purpose is not the actual reducing the height and width, it's more of taking Features(max values) from the activations, and leaving the unnecessary features. So, that's why, I guess, they are applying MaxPooling, in order to reduce some features, and applying padding at the same time not to reduce the dimension. It's just the assumption, please tell me know, if I am wrong :)))
Besides, how to define a labeled sample? 1. Is it the same training sample (x,y), in which case what would be the impact of running two back propagation process on the same layers: all layers before the fork? 2. or is it (x',y) where x' is the activation of the layer just before the fork? And here, I see a problem concerning the fork itself: for the very first training samples, the activation just before the fork deeply depends on the weights of all the previous layers, which are not well trained yet (still a random guess). This means that this activation is not, to some extent, a good representation of the input data in the first place. Using this "bad/distorted" data as an input to the forked network would make us train it on discriminating something that is completely different from our real input data!
I was thinking of the movie as well but I was also thinking that I am stupid to think that such a important paper will have anything to do with a movie.
2:52 What's the purpose of "maxpool -> 1x1 conv"? Seems like applying 1x1 conv directly is strictly better than that, coz some information will be lost in maxpool...
information is always lost in pooling, we use the max pool to keep important information (it has got "max" in it, and hence we use pixel which has maximum information). We use 1x1 conv to change the depth (channel) of the output by max-pooling so that we could perform concatenation (require the same dimension). Additionally, not every piece of information is important, using too much information could result in overfitting and you have compensated for it either by using the max pool (could be treated as weak regularize, theoretically ), kernel regularizer, dropout and so on.
I dint get the Inception , How can we stack convolved image after passing through different filter without pasdding since according to different filter dimension of convolved image too change
Andrew teaching me about memes, epic!
thanks again for the amazing lecture
Didn't know one could cite memes in science papers 😂 gonna do it in mine
I'm so excited about this channel that I'm actually a little paranoid that it will shut down before I can finish watching all of the videos
its a very new topic so no much material on it but as time progresses even if this video might be removed more and better videos probably would arise on the matter
Yep I am totally going to find a way to cite a meme in one of my papers!
really thanks Andrew. 🙏 Through your video, I totally understood what Inception network!! (including CNN)
다른 한국 영상들에서는 보기 어려운 체계적이고 + 쉬운 설명입니다 영상 강력추천합니다. 👍
very good explamnation.need to watch again
Great lecture! Thank you.
I have a question.
2:18
Can I move the 1x1 conv layer to before max-pooling layer to reduce channels, like the one before 3x3 and 5x5 conv?
What’s the difference?
That's the same question i do have , do you got any clarity on that ?
How do you know how small the bottleneck layer should be? And why would one want to shrink the number of channels as done in 8:46, what is the benefit of doing this?
Also, instead of using a 1x1 as a bottleneck layer and then using 5x5 or 3x3 filters to reduce computational complexity, why can't we just use 1x1 filters throughout to get the required output dimensions?
please let me know if anyone fimds the answer
From what I've understood from this video the idea is to just reduce the amount of computations needed.
Think of it as densely packing and combining all the features from the previous layer and proceeding to work with this "dense/tight" features - this is just computationaly cheaper.
You can't just throw away 3x3, 5x5 filters because they are useful for finding "patterns" in images, like lines, curves, etc. A 1x1 convolution only looks at 1 pixel at a time - and such a network would just be a MLP with extra steps.
This is just my intuition - feel free to add/correct something.
3x3, 5x5 and like those filters used to learn edge detection and many more details whereas 1x1 are a really useful tool for transforming the number of channels without changing the spatial dimensions.
I love you and your videos.
after the maxpool step, the output shoud be 26*26 *192(assume that we use 192 filters) , in the video it says 28*28*192? Why is that?
or it is assumed that there are paddings ?
Thanks a lot for sharing this.
This movie The Inception sounds great.
What is the purpose of using Maxpooling if we are not reducing the height and width of the dimension? I guess in previous lectures Andrew Ng said that max pool reduces the dimension such as 28*28*8 to 14*14*8 what is the purpose of applying it here is it to only keep the most important information by maxing it?
Well, the purpose is not the actual reducing the height and width, it's more of taking Features(max values) from the activations, and leaving the unnecessary features. So, that's why, I guess, they are applying MaxPooling, in order to reduce some features, and applying padding at the same time not to reduce the dimension. It's just the assumption, please tell me know, if I am wrong :)))
How small the number of filters have to be in the 1 x 1 convolution? Why 192 -> 16 -> 32 instead of 192 -> 1 -> 32?
The film Inception means put inside other people/mind a external thought (virus) , and from that thought start a new way to see world
I haven't understood 😕
Why we are using different filters?
How do we manage to train the intermediate forks (to do predictions from hidden layers)? do we stop the back propagation at the fork?
Besides, how to define a labeled sample? 1. Is it the same training sample (x,y), in which case what would be the impact of running two back propagation process on the same layers: all layers before the fork?
2. or is it (x',y) where x' is the activation of the layer just before the fork? And here, I see a problem concerning the fork itself: for the very first training samples, the activation just before the fork deeply depends on the weights of all the previous layers, which are not well trained yet (still a random guess). This means that this activation is not, to some extent, a good representation of the input data in the first place. Using this "bad/distorted" data as an input to the forked network would make us train it on discriminating something that is completely different from our real input data!
I am also curious about this. I do not understand how the outputs from the forks are put together in end.
I was thinking of the movie as well but I was also thinking that I am stupid to think that such a important paper will have anything to do with a movie.
Better naming would've been: Deep Modular Network
And I came here from - Inception movie explained 💤
Know your meme!
2:52 What's the purpose of "maxpool -> 1x1 conv"? Seems like applying 1x1 conv directly is strictly better than that, coz some information will be lost in maxpool...
information is always lost in pooling, we use the max pool to keep important information (it has got "max" in it, and hence we use pixel which has maximum information). We use 1x1 conv to change the depth (channel) of the output by max-pooling so that we could perform concatenation (require the same dimension). Additionally, not every piece of information is important, using too much information could result in overfitting and you have compensated for it either by using the max pool (could be treated as weak regularize, theoretically ), kernel regularizer, dropout and so on.
what are the max pools for if they don't reduce the size? (SAME max pooling)
thanks, very useful
Padding is applied, isn’t it?
This looks like an autoML but for the Conv NN
I dint get the Inception , How can we stack convolved image after passing through different filter without pasdding since according to different filter dimension of convolved image too change
hey, see in lecture there is same output shape coming from all filter 28*28 and stacking each output onto other .
They are padded
thank you very usefull
From now on, I'll expect meme while reading papers
millenial contributing in their (our) language
how to concatenate 4 different data sizes?
VampireNet also good
Can you make a video about Inception v4?