NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

Yannic Kilcher

มุมมอง 37 819

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 74

@YannicKilcher 4 ปีที่แล้ว ⁺¹⁶
ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."
@sarvagyagupta1744 4 ปีที่แล้ว ⁺¹
This is exactly what I was going to say. They got rid of batchnorm, however, they are still normalising the gradients using their respective weights. Also, this seems very similar to OpenAI's PPO algorithm. They also clip the gradients during the training process.
@pepe_reeze9320 4 ปีที่แล้ว
Could you elaborate on how batch cheating occurs for "losses that act on the full batch"? What do you mean by "full batch"?
@GuillermoValleCosmos 3 ปีที่แล้ว ⁺¹
@@pepe_reeze9320 I think it refers to loss functions that arent just a sum over a function of individual examples, but a general function of all the examples.
For example, a contrastive loss is a loss that depends on a pair of examples, but not as a sum over the loss on each example.
So a contrastive loss may be D(f(X1),f(X2)), where D is some measure of discrepancy of the output of the net for example X1 and example X2. Batch norm makes the output depend on all the examples, so that f(X1) will actually depend somewhat on X2 (and viceversa), and that can cause the network to cheat, because it's task is to produce an output similar to that for X2, without knowledge of X2
@GuillermoValleCosmos 3 ปีที่แล้ว ⁺¹
I think the concern Yannic pointed for additive losses is technically also valid, but seems much less likely to cause problems. And the even subtler concern for the gradient clipping seems to me even less likely to cause problems
@pepe_reeze9320 3 ปีที่แล้ว ⁺¹
@@GuillermoValleCosmos Thanks, you confirmed my expectation! I'll rephrase in my own words:
BN injects information of batch statistics into the network and thereby gives each sample of the batch information about other samples. Each sample “knows” mean and variance of their batch. This dependency is problematic in contrastive learning because the model could cheat by exploiting information about other samples instead of learning reasonable operations to obtain discriminative features based on each sample alone.
In my opinion, this problem is less relevant for non-contrastive losses as the samples are contrasted from an absolute “anchor”, i.e. their respective class label, and not relative from each other.
@GeekProdigyGuy 4 ปีที่แล้ว ⁺⁷⁶
I applaud the inclusion of a negative results appendix, and hopefully in the future it will become a standard or even required section (by conference/journal).
@YannicKilcher 4 ปีที่แล้ว ⁺⁷
Agree!
@billykotsos4642 4 ปีที่แล้ว ⁺⁵²
I liked that you compared this to Speed-running.
@OnTrackwithWalker 4 ปีที่แล้ว ⁺¹
Glitch-less == no reward hacking?? Feature or cheating? ;)
@PhucLe-qs7nx 4 ปีที่แล้ว ⁺¹⁴
My take for why they did the architecture search on top of AGC was that to be fair with architectures optimised for BatchNorm. BatchNorm was out in 2015 and since then all the architectures either by grad students or NAS has been designed or searched with BN as one of its core ingredient. This is especially true for the EfficientNet they are comparing against.
Now they have designed AGC trick that fixes all the problems of BN but do it in a different mechanism, just plug that in the place of BN layer for architectures optimised with BN wouldn't be very fair for it right? So they basically developed a BN replacement and squash a few years of NAS research into one, and did it with a more realisitc metric (actual training speed vs theoretical FLOp), which I think is pretty good.
@Navhkrin 3 ปีที่แล้ว ⁺¹
Then the main problem that remains is that we do not know whether the improvements are due to better NAS alone or combination of both.
@LouisChiaki 4 ปีที่แล้ว ⁺²⁹
DeepMind: new SOTA model
Google: you must have used tensorflow right?
DeepMind: tensor what?
@Navhkrin 3 ปีที่แล้ว ⁺¹
Google: Uh okay, at least tell me you used TensorBoard
DeepMind: Yeah su- *cough wandb *cough -re
@CarlosGarcia-hs8yg 3 ปีที่แล้ว
The BEST deep learning channel on youtube. Congrats!
@mrigankanath7337 4 ปีที่แล้ว ⁺²⁷
Dude telling how to beat a SOTA paper on youtube, EPIC; we would love to cite the channel
@qltang6633 4 ปีที่แล้ว ⁺³
EffNet-B7 and NFNet-F1 achieved the same Top-1 84.7 on ImageNet and have similar FLOPs (37B vs 35B). NFNet-F1 was 8.7x faster on training, but how about inference time? as we know we could always merge BN into weights to speed up those normalized networks. Does anyone have any idea?
@tho207 4 ปีที่แล้ว
love your work, as always.
you sometimes make me think about things deeper than I originally did, or pull a new perspective on certain concepts
@TheEbbemonster 3 ปีที่แล้ว
Excellent walk through! Great points on the gradient clipping, it should definitely be done before averaging. I wonder if replacing the clipping by the median gradient would have a similar effect...
@amitkumarjena3184 4 ปีที่แล้ว ⁺¹
Was waiting for this! Thank you!!
@rufus9508 3 ปีที่แล้ว
Thanks for these videos man, you help a lot with your explanations and your way of navigating the paper!
@mrpocock 4 ปีที่แล้ว ⁺¹
I wonder how it would compare to per training example clipping. Perhaps the compute overhead would be too much.
@nocturnomedieval 4 ปีที่แล้ว ⁺²
I wonder how to adapt this nfn for semantic segmentation?
@benlee5042 4 ปีที่แล้ว
Great for undergraduates, thank you !
@dr.mikeybee 4 ปีที่แล้ว
You really are a fantastic teacher. Thank you, again!
@andytroo 4 ปีที่แล้ว ⁺³
it isn't a clipping, its a rescaling (it's dividing by magnitude/lambda) ; if you have a batch of 4k, with 1 bad data, and you rescale, you're going to down-weight all the other items in the batch with the problematic sample. Consider the limit of all data in 1 batch, with a bad sample (correctly detected) -- all you would do is take small steps in the bad direction repeatedly.
@channel-su2di 4 ปีที่แล้ว ⁺²⁵
"Don't come at me math people" - Yannic 2021
@fak-yo-to 3 ปีที่แล้ว
I would go as far as to say: The batch normalisation shows the inferiority of the model to accommodate for the complexity coming from the variance of the data. It basically accommodates currently for the fact that research has not yet figured out why Batch Normalization is necessary at all for a model given training data.
Of course the same argument can be held against the data being unfit for the network. But here we have no other choice but to accept the data as is.
Same with preprocessing like transforming audio time signals to frequency domain first.
@rickrunner2219 4 ปีที่แล้ว ⁺⁴
and you still see some people tell you, go for tensorflow because, well, you know , pytorch is just for hardcore research. machine learning, is still in "building process", and many knowledge, skills, considered now as solide, will be dismissed tomorow. Time for research and improvement. Not certainties.
@h.raouzi175 4 ปีที่แล้ว
I agree pytorch is better, I started deep learning with keras and tensorflow, I was struggling with understanding with semantic of the code , but since I started pytorch it allow me to understand every bit of the code , and taking advantage of my programming skills and there is Kind of freedom in using it
@MsFearco 3 ปีที่แล้ว
Thanks for this. I love your explanations.
@robotter_ai 4 ปีที่แล้ว ⁺¹
Yes, great paper! I always distrusted this strange hack-norm!!1
@robotter_ai 4 ปีที่แล้ว ⁺²
Okay, but still not solved :(
@sourabmangrulkar9105 3 ปีที่แล้ว
Thank you! Great explanation. Keep up the good work :)
@Gogargoat 4 ปีที่แล้ว
Does the gradient clipping occur also (repeatedly) for the gradients that get passed back further into the network, or only at the point where they accumulate at each specific parameter? For the latter case I'd probably try transforming them by a scaled asinh() or erf() function instead of clipping.
@АлексейТучак-м4ч 4 ปีที่แล้ว
also thought about that
@MsFearco 4 ปีที่แล้ว ⁺¹⁶
I feel like every second day there is new SOTA in everything
@NextFuckingLevel 4 ปีที่แล้ว
Can't help it, these companies do speed running to get the big money asap
@navidhakimi7122 4 ปีที่แล้ว ⁺⁶
Could you do a video on transformers for time series data?
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
or
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
@anticlementous 4 ปีที่แล้ว ⁺¹
You might be the first cited TH-cam channel in a machine learning research paper, but not the first TH-cam channel cited in a paper. Thunderfoot's channel was cited in the Nature paper "Coulomb explosion during the early stages of the reaction of alkali metals with water" in 2015.
@opx-tech 4 ปีที่แล้ว ⁺⁶
The sum is finite, so you can swap it round. You’re good. -Math person
@simongiebenhain3816 4 ปีที่แล้ว ⁺¹
Yes, but only because the gradient is a linear operator. Just like when you want to find the minimas/maximas of a polynomial you can calculate the derivative for every term separately.
@piotr.ziolo. 4 ปีที่แล้ว
Could you share the information what software do you use to annotate pdfs? I couldn't find a tool that gives you a blank page to draw on alongside every pdf page. And I'd like to use it in my classes.
I tried to find this information on your channel, but I gave up as there are too many videos where this information could be hidden.
@1998sini 4 ปีที่แล้ว ⁺¹
There actually is a tutorial he made about how he reads papers and which apps he uses. AFAIK he uses OneNote for paper annotation
@piotr.ziolo. 3 ปีที่แล้ว
@@1998sini Thank you :-) I found it. It helps me a lot.
@1998sini 3 ปีที่แล้ว
@@piotr.ziolo. No problem man, glad I could help :)
@DistortedV12 4 ปีที่แล้ว
This isn't state of the art though no? Efficient Net L2+meta psuedo labeling has highest top-1 or am I missing something?
@thongnguyen1292 4 ปีที่แล้ว ⁺⁵
Thanks for the awesome video, as usual. Now the touche: How's your thesis going? :D
@YannicKilcher 4 ปีที่แล้ว ⁺¹³
I'm procrastinating :D
@Pmaisterify 4 ปีที่แล้ว ⁺¹
@@YannicKilcher we are rooting for you, your videos have literally helped me advance dramatically, really cant say thank you enough
@chanalex3971 4 ปีที่แล้ว
I love your work!
@benlee5042 4 ปีที่แล้ว
cant wait to see u as the first cited YT channel XD
@awsaf49 4 ปีที่แล้ว
Just wondering was their result actually verified or they simply put that on their papers....
@bg2junge 4 ปีที่แล้ว
Don't they also change the image resolution for the training set(smaller images while training, and a larger image while inference), there are a lot of the little things that went into achieving this kind of performance gain.
@visionscaper 4 ปีที่แล้ว
LayerNorm still needs a lot of memory, but does’t it at least partially resolve the issues that BatchNorm has. Funny it wasn’t mentioned.
@DavidDohmen 3 ปีที่แล้ว
Similar to "get rid of batch norm" - have you had a look into the HSIC bottleneck paper, trying to get rid of backprop?
arxiv.org/pdf/1908.01580v3.pdf
Coming from the information theory direction myself, i find this approach fascinating. Kind of strange this paper hasn't got more attention.
@hiramcoriarodriguez1252 4 ปีที่แล้ว ⁺¹²
A pytorch implementation at github.com/rwightman/pytorch-image-models/blob/master/timm/models/nfnet.py, because not everyone knows Jax. BTW its a shame that Google DeepMind doesn't use TensorFlow for that kind of projects.
@cedricmanouan2333 4 ปีที่แล้ว ⁺¹
thanks...lmao, they use partial tf since they know tf is not fast when it comes to deep research
@hoaxuan7074 4 ปีที่แล้ว
Like if you did a random projection just before the input to a net in a lot of ways that normalizes the statistics of the input. Well literally normalizes as in makes Gaussian. A fixed randomly chosen pattern of sign flips before the fast Hadamard transform is a quick random projection. Repeat for better quality. You may still choose to normalize the vector length as well. You may find you no longer need bias terms!!!!!!!👿👽👻
@TechyBen 4 ปีที่แล้ว ⁺⁷
At this rate someone "Yoloing" the AI will create the singularity...
@sacramentofwilderness6656 4 ปีที่แล้ว
Creates a network, that is ready to make inference before training)
@rudolfarsenibraun7819 4 ปีที่แล้ว
I feel you're being too critical, complaining about dropout really? That's totally independent of the BN stuff.
@maxheadroom5532 4 ปีที่แล้ว
7:36 Spot the teenage mutant ninja turtle.
@cedricmanouan2333 4 ปีที่แล้ว
If you are here Bcuz you don't want to dive into the original paper...be blessed !
@skeletonlord2126 4 ปีที่แล้ว
I love you
@TheHyperion11 4 ปีที่แล้ว ⁺²
First!
@iwy510 4 ปีที่แล้ว
"What's the property of dropout"
@herp_derpingson 4 ปีที่แล้ว
Efficienter NET XD
@chuongnguyen4980 4 ปีที่แล้ว
so fast
@DeadtomGCthe2nd 3 ปีที่แล้ว
BS matters
@droidcrackye5238 3 ปีที่แล้ว
It is a fake high-performance model

ต่อไป

เล่นอัตโนมัติ

Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)