Why Do Neural Networks Love the Softmax?

Mutual Information

มุมมอง 63 297

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 20 มิ.ย. 2024
The machine learning consultancy: truetheta.io
Join my email list to get educational and useful articles (and nothing else!): mailchi.mp/truetheta/true-the...
Want to work together? See here: truetheta.io/about/#want-to-w...
Neural Networks see something special in the softmax function.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Github: github.com/Duane321
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
SOURCE NOTES
I decided to make this video when inspecting jacobians/gradients starting from the end of a small network. Right near the softmax, the jacobian looked simple enough that I suspected interesting math behind it. And there was. I came across several excellent blogs on the Softmax's jacobian and its interaction with the negative log likelihood. Source [1] was the primary source, since it was quite well explained and used condensed notation. [2] was useful for understanding the broader context and [3] was a separate, thorough perspective.
SOURCES
[1] M. Peterson, "Softmax with cross-entropy," mattpetersen.github.io/softma..., 2017
[2] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, section 6.2.2.3
[3] M. Lester James, "Understanding softmax and the negative log-likelihood," ljvmiranda921.github.io/noteb..., 2017
TIME CODES
0:00 Everyone uses the softmax
0:23 A Standard Explanation
3:20 But Why the Exponential Function?
3:57 The Broader Context
6:05 Two Choices Together
6:51 The Gradient
10:07 Other Reasons

ความคิดเห็น • 121

@mgostIH ปีที่แล้ว ⁺¹¹⁶
softmax is also invariant by constant addition! This is often used in implementations computing combinations with the resulting probabilities, for example in attention with the paper "self attention does not need O(N^2) memory" uses this to avoid blowups in the computation while computing the combination sequentially, avoiding the need to form the entire attention matrix.
@Mutual_Information ปีที่แล้ว ⁺⁵
Very interesting - pinned!
@BuddyVQ ปีที่แล้ว ⁺⁶
I’m my courses, this was one of the big advantages of using exponential (alongside the convenient Jacobian). Without invariance to constant shifts, distributed scores like [0,1,2] vs [100,101,102] vary significantly when they should not.
@mCoding ปีที่แล้ว ⁺⁵²
Great intuition explained in simple terms, and top tier visualizations as always!
@Mutual_Information ปีที่แล้ว ⁺²
Thank you James - appreciate the love!
@nerkulec ปีที่แล้ว ⁺⁴¹
Highest quality Deep Learning content out here on youtube.
@Mutual_Information ปีที่แล้ว ⁺⁷
I'm certainly working on it!
@Otomega1 ปีที่แล้ว ⁺²
Just about to write this comment
So I'll just like this one
@Friemelkubus 11 หลายเดือนก่อน ⁺¹
+1
@zezkai7887 ปีที่แล้ว ⁺²⁵
Interestingly, as mentioned in one of Andrej Karpathy's videos , this shift (at 1:58) is also performed for softmax to ensure numerical stability.
@CalvinHirschOoO 11 หลายเดือนก่อน ⁺⁸
Great video. Softmax is great but sometimes is restrictive. A recent paper "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing" discovered that transformers actually abuse/circumvent the softmax in order to perform no-ops (i.e. avoid normalization). Highly recommend reading if you're interested in how softmax affects learning.
@Mutual_Information 11 หลายเดือนก่อน ⁺⁴
Thanks, there's always so much research to keep up with..
@alvinjamur1 ปีที่แล้ว ⁺¹⁰
As a long time NN guy (since ‘93) & quant….IMHO….this channel deserves 10 gazillion readers. Very well done!
@Mutual_Information ปีที่แล้ว ⁺³
Appreciate it - but I don't quite think there's 10 gazillion NN guys out there. We'll just have to settle for being the cool club 😎
@monikaherath7505 ปีที่แล้ว
Hello friend. As a beginner in NN, why has ML and similar stuff only just seem to have exploded? Especially at universities? Why has it become a fad just now when it has been researched for decades? Thanks for your help
@manavt2000 5 หลายเดือนก่อน
@@monikaherath7505 because of the advancements in computing power...superfast gpus
@hellfishii ปีที่แล้ว ⁺³
The ammount of math behind the justification of why the soft max function is chosen is insane. what a fascinating time to be alive.
@nuko_paon1351 ปีที่แล้ว ⁺⁸
lately, youtube was recommending me garbage langchain tutorials with their garbage ideas like a ton, in the term of recently posted videos.
but sometimes, I have to truly appreciate that it also shows me the way to the hidden gem, like you.
again, thanks for making great contents. keep it up!
edited: subbed!
@Mutual_Information ปีที่แล้ว ⁺¹
Appreciate it! I am doing my best to not succumb to the hype trends :)
@1495978707 ปีที่แล้ว ⁺⁸
Just found you. Great content, taking the time to seriously explain why a choice is made and a good choice is so rare to be so well done
@rufus9508 ปีที่แล้ว ⁺¹
Great quality content, hope this channel gets more attention!
@broccoli322 ปีที่แล้ว ⁺¹
Well explained! This channel deserves more subscribers.
@nikoskonstantinou3681 ปีที่แล้ว ⁺¹
Really insightful video. Good work!
@jonashallgren4446 11 หลายเดือนก่อน ⁺⁴
Man, I can feel a binge watch of your videos coming, great content, at least I will procrastinate while doing something very useful lol
@Mutual_Information 11 หลายเดือนก่อน
As far as procrastination tasks go, this videos aren't a terrible use of time.
That said, my old stuff is a lot harder to watch lol
@gar4772 3 หลายเดือนก่อน ⁺¹
Hands down one of the very best ML channels on youtube! Simple, concise explanations. Dj is one of the best educators I have ever seen on the subject. Thank you!
@Mutual_Information 3 หลายเดือนก่อน ⁺¹
Thank you my dude!
@xfts1988 11 หลายเดือนก่อน
Seeing how you represented the J of sm 7:15 helped me see how softmax is the actual generalization of the logistic in higher dimensions. I always took softmax as an algebraic tool to crunch matrices into probability scores. Thank you for your amazing content
@oceannuclear หลายเดือนก่อน ⁺¹
This is so beautifully done! Thank you! The pacing is perfect (for me anyways, since I can pause and mentally check the maths), the inclusion of the actual expression at 5:58 is helpful too.
And the choice of topic is extremely insightful as well! I never thought how softmax is prevalent because the maths cancels out to simplify matrix multiplication operations into simple vector subtraction!
@Mutual_Information หลายเดือนก่อน ⁺¹
You get it!! :)
@CodeEmporium ปีที่แล้ว ⁺²
Very interesting! Thanks for the quality content!
@Mutual_Information ปีที่แล้ว
Excellent to hear from you CodeEmporium - appreciate the compliment!
@yubtubtime 11 หลายเดือนก่อน ⁺²
What a beautiful visualization of the Jacobian 🤩
@matthewtang1489 11 หลายเดือนก่อน ⁺¹
just realized I found gold. Beautiful video, never thought of it even though I use it every day.
@jacksonstenger ปีที่แล้ว ⁺¹
Thanks for another great video!
@Nahte001 ปีที่แล้ว ⁺⁵
Great vid, only thing I think might have been worth mentioning is the grounding in and reliance on a probabilistic objective. While I certainly see the point you were making with the various shortcuts it affords the gradient calculation, it's only useful in so far as the prediction space is tractable and the input-output relation is many-to-one. Also I'm a big fan of BCE with logits as an implementation of this idea, there's more to life than cross entropy!!!
@madcauchyren6209 9 หลายเดือนก่อน ⁺¹
This is really an informative video. Thank you!
@wedenigt ปีที่แล้ว ⁺⁸
Great walkthrough! Your channel definitely deserves more attention.
At 8:47, maybe one should emphasize that the derivative of the loss w.r.t. f(s) is independent of our choice of f. Thus, the simplicity of this term cannot be attributed to the softmax - it's only due to the choice of the loss.
@Mutual_Information ปีที่แล้ว ⁺⁵
It's funny you say that. At 9:01, I had originally said this: "Now the simplicity isn't due to the softmax, but the natural log from within the loss. Fortunately, that also plays nicely with derivatives."
But I cut it out. I felt it was deducible from what was on screen at the time, but in retrospect.. it's a good thing to clarity.
@gregorykafanelis5093 ปีที่แล้ว ⁺³
This greatly resembles the Boltzman term used in physics, which states that the probability of a given state (in thermal equilibrium) will be p_i = exp(-ε/(κT))/ Z,with Z being the sum of all these states to normalize the result. Also the expression you give for the loss function mimics the way entropy can be defined using Boltzman probabilities. Nature truly provides the most elegant way to reach equilibrium.
@Mutual_Information ปีที่แล้ว ⁺²
Yea how wild is that!? Statistical physics was way ahead (or inspired?) modern DL
@chaddoomslayer4721 ปีที่แล้ว ⁺¹
Always wait for your videos more than I wait for my birthday ha
@yingqiangli6026 ปีที่แล้ว ⁺²
One of the best lectures on Softmax on the entire Internet!
@Mutual_Information ปีที่แล้ว
Thank you Yingqiang!
@user-wr4yl7tx3w ปีที่แล้ว ⁺¹
Truly insightful
@aliasziken7847 ปีที่แล้ว ⁺³
In fact, softmax can be derived from an optimization point of view, that is, from the point of view of max entropy
@abdolrahimtooraanian5615 5 หลายเดือนก่อน ⁺¹
Well explained!!!
@h4ck314 ปีที่แล้ว ⁺¹
quite insightful, thanks
@phovos 4 หลายเดือนก่อน ⁺¹
wow this is amazing, ty! I'm going to watch this every day until I can explain this.
@Mutual_Information 4 หลายเดือนก่อน
I certainly don't mind that plan. If there's something in particular that's confusing, feel free to ask!
@user-xk6rg7nh8y 8 หลายเดือนก่อน ⁺¹
amazing !! thank you so much for your great explanation !!
@Mutual_Information 8 หลายเดือนก่อน
Glad you liked it - more to come!
@user-nv3fy6bd4p ปีที่แล้ว ⁺²
new vid just dropped from the absolute legend!
@user-nv3fy6bd4p ปีที่แล้ว ⁺¹
I thank Dr.Orabona for recommending your channel!
@Mutual_Information ปีที่แล้ว
Thank you Ops!
@cartercheng5846 ปีที่แล้ว ⁺¹
finally updated!
@MathVisualProofs ปีที่แล้ว ⁺²
Excellent!
@Mutual_Information ปีที่แล้ว
Thank you MVP!
@niccolomartinello7610 11 หลายเดือนก่อน ⁺¹
I always assumed (based on quality and quantity of videos) that Mutual Information was one of the most famous stats channel on yt, I can't believe that he has only 20k followers (at the time of writing).
The second that the algorithm will favour one of your videos the channel will blow up, I'd bet good money on that.
Keep up the good work.
@Mutual_Information 11 หลายเดือนก่อน
Thanks! I hope you're right. So far it's coming along. I think my upcoming videos will do well enough, so I'm optimistic too
@azaleacolburn 11 หลายเดือนก่อน ⁺²
Great content thank you!
@Mutual_Information 11 หลายเดือนก่อน
Thank you right back
@ali-om4uv ปีที่แล้ว ⁺¹⁰
That was impressive! I would really appreciate if you could give us a list of books and papers you read during your ml learning journey!
@Mutual_Information ปีที่แล้ว ⁺⁵
I don't have that on hand at the moment, but each video has sources on what I researched for that video. Maybe that helps?
Also, I can tell you generally my overall favorite books. Those are Probabilistic Machine Learning by Kevin Murphy, Elements of Statistical Learning by Hastie et al and Deep Learning by Goodfellow et al. There are other greats as well.
@GregThatcher 22 วันที่ผ่านมา
Thanks!
@luciengrondin5802 ปีที่แล้ว ⁺¹
What I find the most interesting is how it relates to statistical physics.
@piratepartyftw 11 หลายเดือนก่อน
The softmax function ultimately comes out of statistical mechanics. It's the same function as the pdf of the boltzman distribution, and the denominator is the "partition function" (super important in statistical physics).
In statmech, the softmax (boltzmann distribution) comes out of the fact that each system state has some probability to occur, and this is represented by the softmax (numerator being the "frequency" with which some state might happen, and denominator being the sum over the all states). Basically the frequency of the state (e.g. temperature and pressure of a gas) is proportional to the number of microstates (e.g. positions and velocities of gas molecules) that might represent it. But in physics we dont really work with counts of microstates, we work with entropies, which is the log of the count of microstates. So the formula for softmax ends up being the exponential of the entropy, to undo the log in the entropy formula to get the raw microstates count. That's where the exponent comes from: inverting the log in an entropy formula. But when you generalize the idea to other systems that have entropy (or things you wanna treat as entropy in a max-entropy sorta way, like machine learning scores), you still gotta take the exponential, even when there's no "microstates" to count or think about.
A lot of the hardcore math people in machine learning were originally trained as physicists, so ML inherited a lot of stuff from physics.
@piratepartyftw 11 หลายเดือนก่อน
Indicentally, Boltzmann's entropy is just a special case of shannon's entropy where each outcome is equally likely (because each microstate is indistinguishable and therefore has to be equiprobable for consistency). So that's why there's a log in the entropy- same reason as in the Shannon entropy.
@fluo9576 11 หลายเดือนก่อน
The structure of the nature always comes out when you go deep enough
@ebefanta7338 ปีที่แล้ว ⁺¹
New to all of this stuff and even though I didn’t quite understand all the terms due to lacking some mathematical background for it, I still find it absolutely incredible that by using the -log and exp function you can take such a nightmarish matrix multiplication and reduce it to literally just subtracting 2 vectors. Makes me really excited to learn even more about deep learning. Would you happen to have any good recommendations for resources where I can shore up my math knowledge for this kind of content?
@Mutual_Information ปีที่แล้ว
I'm a big fan of the deep learning book by Goodfellow and others: www.deeplearningbook.org/ It's got some essential math in there useful for DL. It's a well known book, so you may have already come across it.
@ebefanta7338 ปีที่แล้ว
@@Mutual_InformationSounds awesome. I’ll check it out right away. Thanks for the recommendation
@fizipcfx ปีที่แล้ว ⁺⁴
Here is the emberrassing thing: i have been coding neural networks for two years and i learned just now what it means to have a "differantiable activation function". I always wondered how does the pytorch differantiate my custom function algebraically 😂😂
@Mutual_Information ปีที่แล้ว ⁺²
PyTorch is magic! Or a sufficiently advanced technology that sometimes it looks like magic
@fizipcfx ปีที่แล้ว ⁺¹
@@Mutual_Information yess
@yorailevi6747 ปีที่แล้ว ⁺¹
They way i see it is related to ideal gas models and normal distribution. in a sense our dataset is particles of certain energy and we need to find the distribution most fit for them.
sadly however the gas (data) is mixed but we are lucky, because we do have a maxwells demon (labels)
so if we just let it do its job it will find a good uniform distribution (latent space) that can be used to pick out the particles
@Mutual_Information ปีที่แล้ว
I don't quite understand this but if it informs your intuition, that's good news
@taxtr4535 ปีที่แล้ว ⁺¹
this video was fucking great! keep it up bro
@Mutual_Information ปีที่แล้ว
No plans of slowing down!
@brianprzezdziecki 11 หลายเดือนก่อน ⁺²
That was fucking incredible
@Alexander_Sannikov 11 หลายเดือนก่อน
when i was fooling around with homebrew neural networks, i didn't know softmax was a thing, so i used an L2 norm (sum of squared differences). why is that worse?
@gix_lg 11 หลายเดือนก่อน ⁺¹
I'm not an expert on NN, so forgive my dumb question: nowadays the activation function more used is not ReLU?
@Mutual_Information 11 หลายเดือนก่อน
Yes! Softmax is not used as an activation much anymore. ReLU is the common choice. But the very last layer, for classification tasks, is almost a Softmax
@kalisticmodiani2613 ปีที่แล้ว
Aren't the softmax in some of these models deep inside the network ? Why would the loss function applied to the outputs influence that selection of the exponential function deep inside the network ?
@Mutual_Information ปีที่แล้ว ⁺¹
Softmax as an activation function is separate question. They used to be popular, but ReLU ultimately took it's place b/c it saturates less. Using the softmax as an activation can make for a lot of zero-gradients.. and make learning tricky.
@Nahte001 ปีที่แล้ว ⁺¹
@@Mutual_Information Softmax (as well as other kWTA-esqe activation funcs) play a vital role in multi-headed attention as a way of encouraging disentanglement between the heads. The reason they're ineffective in CNNs/RNNs is because without the structural prior of multi-headed attention, its selective nature forces information loss whenever it's applied. When the loss happens before the natural reduction to class logits, this isn't an issue, but when the layer in question is trying to learn features but can only express one thing at a time you can see where the issue arises.
@Mutual_Information ปีที่แล้ว
@@Nahte001 I see, thank you
@lulube11e111 11 หลายเดือนก่อน ⁺¹
Was this video made with manim?
@Mutual_Information 11 หลายเดือนก่อน ⁺¹
No actually, I'm using a personal library built on top of Altair
@stacksmasherninja7266 11 หลายเดือนก่อน
I really doubt your sentene of NLL "matching" empirical distribution. Are there any known results that prove this? I don't think it has to match the empirical distribution at all
@parlor3115 10 หลายเดือนก่อน
Idk, I've been lately noticing a trend of many AI research and development groups (OpenAI included), transitioning towards using the hardmin function instead.
@aram9167 8 หลายเดือนก่อน
Am I tripping or are you using 3B1B's voice in some places? For example at 8:28 and 9:47??
@Mutual_Information 8 หลายเดือนก่อน
Lol I assure you I am not using his voice
@betacenturion237 11 หลายเดือนก่อน
In physics we get our dicks really hard about symmetry symmetry symmetry. It's everywhere can you can't function in this field without it. I thought that 'real life' was more often asymmetric and thus I thought the value of symmetry was overblown in practical contexts. I'm not saying that symmetry is useless, it just felt like once you had to start dealing with real, noisy data, all of those symmetry tricks go out the window. Little did I expect that these AI engineers were exploiting a similar idea to perform simple computations of the loss function.
Instead of dealing with a dense matrix of derivatives (which inherently are computationally unstable compared to integrals), why don't we just construct our function in such a way that we only get diagonal terms? This is exactly what physicists do! I'm just surprised by its far reaching consequences I guess...
@azophi 11 หลายเดือนก่อน
Problem: negative model scores
Solution: make a neural network to map model scores onto probabilities 🧠
@hw5622 ปีที่แล้ว ⁺²
Well explained! Thank you ❤ by the way I kept distracted by you nice looking face….
@Mutual_Information ปีที่แล้ว ⁺¹
haha thank you, I will wear a mask next time :)
@hw5622 ปีที่แล้ว
@@Mutual_Information haha that won’t be necessary. It’s good to see. Thank you again, I love all your videos and I am slowly going through them.❤
@MTd2 ปีที่แล้ว ⁺²
Isn't this basically trying to calculate and then trying minimize shannon's entropy?
@Mutual_Information ปีที่แล้ว
I don't see shannon's entropy showing up explicitly here. Though the NLL looks a lot like a shannon entropy, but it uses two different distribution (the empirical one, y, and the model's, sigma(s)).. shannon entropy is a calculation on one distribution.
@MTd2 ปีที่แล้ว
@@Mutual_Information what do you mean by empirical? And it's very difficult to not think about shannon entropy because one of the first papers on chess AI was written by Shannon and used entropy to calculate some sort of minimax function.
@AR-iu7tf 11 หลายเดือนก่อน
Very nice explanation of the utility of softmax and the advantage of using it with cross entropy loss. Thank you! Here is another recent video that complements this video - why use cross entropy loss that leverages softmax to create a probability distribution?
th-cam.com/video/LOh5-LTdosU/w-d-xo.html
@peceed ปีที่แล้ว ⁺¹
So it is almost certain that biological brains use the same transformation.
@Mutual_Information ปีที่แล้ว
ha well I don't have nearly the evidence to suggest that, but you never know
@peceed ปีที่แล้ว
@@Mutual_Information Biology search for "mathematical opportunities". Do you believe that physically distributed weights of neurons (that can be large) can be multiplied in matrix or rather that they compute local difference? End there is evidence that brain uses "logarithmic representation".
@Stopinvadingmyhardware 11 หลายเดือนก่อน
Because it pushes the bad results out of the domain.
@alfrednewman2234 ปีที่แล้ว
AI generated image, text
@kimchi_taco 11 หลายเดือนก่อน
softmax is misleading name. It should be softARGmax.
@june6959 ปีที่แล้ว
"Promo SM" 😞
@yash1152 11 หลายเดือนก่อน
0:09 sorry! but dislike for low audio volume levels.
@yash1152 11 หลายเดือนก่อน
1:39 i so much want to like the video, but i wont....
i am tired of this plague on entirety of small youtube channels these days.... its this one thing which news channels get always right.
the plague being either super low up audio levels in vocals speech, or deafening high levels of intro/bg music.
@yash1152 11 หลายเดือนก่อน
4:46 5:04 5:18 ohw, so, here, it was likely a result of ultra focused mic pointed at the neck, and not the mouth.
TH-camrs PLEASE listen ur videos in comparison with a pre-tested accepted sample _at least_ once _before_ posting to youtube.
u spent hours & hours on visuals, dont mess up on audio please.
@yash1152 11 หลายเดือนก่อน
this is being experienced widespread on utube likely as utubers are pouring money on _upgrading_ to expensive _focused_ mics but due to inexperience with it, still editing according to their last used mics...
heh, money alone aint never enough eh!! its not the tools by itself, its the worker who excels at it.
@Mutual_Information 11 หลายเดือนก่อน
You caught me - I know very little about audio quality. All I do is apply a denoiser in Adobe Premiere.. what specifically should I do to fix this? Is it just that the the volume is too high and sometimes too low? What type of audio processing would you recommend (hopefully it's available in Adobe Premiere..)
Thanks!
@ronaldnixon8226 11 หลายเดือนก่อน
Obama caint force me to learn none a this nonsense! Trump own's them document's!
@444haluk ปีที่แล้ว
I find the eagerness for trying to come up with simpler terms disturbing. It is basically laziness with a few extra steps. Simple doesn't mean useful, better or true.
@arturprzybysz6614 ปีที่แล้ว ⁺⁵
Simpler terms could be useful, as they provide bigger conceptual "boxes" (less accurate ones), which sometimes allow for generalization and using intuition from other areas.
Sometimes simple means more useful.
@Eye-vp5de ปีที่แล้ว
Simple does mean better in this case, because a neural network must be trained, and I think it's training would be much more expensive if the expression wouldn't be this simple
@lydianlights ปีที่แล้ว ⁺¹
Well I find the fact that you "find the eagerness to come up with simpler terms disturbing" to be itself disturbing. Equating simplicity with laziness is itself lazy.

ต่อไป

เล่นอัตโนมัติ