Ah, feels interesting Got to know about Contrastive Learning through a paper that we had to read as part of our coursework, which was on Probabilistic Contrastive Learning (nicknamed as proco) Will explore this more in the coming months, thanks for the simple explanation ! You sound like a young excited prof, takes time to explain obscure stuff
Thank you for a nice explanation. I have a small question at 12:45. You mentioned the function is essentially a logistic regression or a binary cross-entropy loss function. But those functions are usually something like these: Loss = -y \sum y_i * log(\hat{y}_i). Without the y_i term before the log function, it is a bit hard for me to relate the loss function in the clip to the loss function I showed here. Could you please tell me how I can relate them?
Indeed, it can be a bit confusing to see it, and I should have done a better job clarifying it more in the tutorial. In a binary classification problem, you use the log-loss function that looks something like this - E(y log(p(y)) + (1 - y) log(1 - p(y)) The above equation is written using the expected value notation and will translate into an average for your mini-batch. Also, see the Wikipedia article - en.wikipedia.org/wiki/Cross_entropy .... go to the section "Cross-entropy loss function and logistic regression" Now in the binary classification, you can see that one of the terms will be zero e.g. If y has label of 1 then the second term will be zero. You can see that the utility of y before the log function to eliminate the term (as per the label 1 or 0) That said, the formulation of the NCE loss function looks similar to binary log loss except for y in front of the log as you could use it for multiclass classification and not limit it to the binary case. In other words - its structure looks like that of log loss. Hope this helps.
Hello, thank you for the explanation, it was great. But I have a confusion. At 6:45, you mentioned that we introduced NCE because we can't solve the partition function. But when we use NCE (15:40), we still need to deal with the partition function. And here we deal with it by setting it to a constant 1. So why not just make the partition function constant 1 at the beginning, so we don't have to introduce NCE?
The confusion is understandable. Here is an attempt to resolve it: 1) our “primary” goal is estimation of valid probability distribution function 2) we identified that normalizing constant is the problem. There should be no doubt about it and you need to for a creating a “valid”probability distribution function. 3) when we identify the problem generally we end up focusing on it so one approach is to “learn” it (the constant) with the help of “neural networks”. This does not work. I explained it in the tutorial but you can also just take my word for now. Now sometimes instead of focusing on the “problem creator” another way to get our solution is to focus on what was our “original” goal. It was estimating “valid” probability distribution function. The findings of the paper was that a decently parameterized neural network (ie the one with good amount of parameters … a large network) when “trained” using the NCE loss function can estimate the valid/final normalized probability distribution function. In this setup you don’t need to worry about dealing with normalizing constant explicitly. The “learning” part implicitly takes care of it for you. “Learning” => training Training => need for a loss function Which loss function can help you train such a thing - it is NCE in this case. In brief/summary: Your confusion occurs when I (or rather the paper) say - “we can ignore the constant” …. Another way to think of it as we are simply ignoring the thing in the mathematical expression but by making use of a proper loss function we are ended up learning/estimating the valid probability distribution function … our actual goal! Hope this makes sense!
Best video available on this topic. Thank you so much for such a detailed explanation. I wish there were more channels like this. Looking forward to more videos of yours on other papers. Please keep making content like this.
Hi Kapil, Thank you for very intuitive explanation of Noise-Contrastive estimation. I came here while reading a paper on Self-supervised learning NPID. Your explanation really helped a lot. 🤘
Terrific tutorial! But I have a question. The data distribution p_theta is the neural network, theta is the parameters of neural network. But what does exactly p_n will be like? Does it means that we should generate k negative samples from something like gaussian distirbution and so on?
🙏 Generating the k negative samples is indeed a trick thing to do. Don't think in terms of gaussian distribution rather how to get the relevant/appropriate negative samples. There are methods to do what they call hard negative sample mining etc. Am not well versed in those but that is the direction you should think in.
Thank you, Kapil. Your explanation of this complex topic was very easy to digest! What software do you use to create videos like this? I would also like to create videos like this.
But I still want to clarify one thing: here you mention the Lnse seemd to be the log-likelihood instead of Loss Funtion, which should have a total minus sign infront. Am I right?
This was wonderful!
🙏
Wonderful video, thank you so much. Your style is very pleasant.
Ah, feels interesting
Got to know about Contrastive Learning through a paper that we had to read as part of our coursework, which was on Probabilistic Contrastive Learning (nicknamed as proco)
Will explore this more in the coming months, thanks for the simple explanation !
You sound like a young excited prof, takes time to explain obscure stuff
Really a beautiful tutorial!! I wish you could update this with the newer InfoNCE and point out some similar and difference
🙏
Thank you for a nice explanation. I have a small question at 12:45. You mentioned the function is essentially a logistic regression or a binary cross-entropy loss function. But those functions are usually something like these: Loss = -y \sum y_i * log(\hat{y}_i). Without the y_i term before the log function, it is a bit hard for me to relate the loss function in the clip to the loss function I showed here. Could you please tell me how I can relate them?
Indeed, it can be a bit confusing to see it, and I should have done a better job clarifying it more in the tutorial.
In a binary classification problem, you use the log-loss function that looks something like this -
E(y log(p(y)) + (1 - y) log(1 - p(y))
The above equation is written using the expected value notation and will translate into an average for your mini-batch. Also, see the Wikipedia
article - en.wikipedia.org/wiki/Cross_entropy .... go to the section "Cross-entropy loss function and logistic regression"
Now in the binary classification, you can see that one of the terms will be zero e.g. If y has label of 1 then the second term will be zero. You can see that the utility of y before the log function to eliminate the term (as per the label 1 or 0)
That said, the formulation of the NCE loss function looks similar to binary log loss except for y in front of the log as you could use it for multiclass classification and not limit it to the binary case. In other words - its structure looks like that of log loss.
Hope this helps.
Hello, thank you for the explanation, it was great. But I have a confusion.
At 6:45, you mentioned that we introduced NCE because we can't solve the partition function.
But when we use NCE (15:40), we still need to deal with the partition function. And here we deal with it by setting it to a constant 1.
So why not just make the partition function constant 1 at the beginning, so we don't have to introduce NCE?
The confusion is understandable. Here is an attempt to resolve it:
1) our “primary” goal is estimation of valid probability distribution function
2) we identified that normalizing constant is the problem. There should be no doubt about it and you need to for a creating a “valid”probability distribution function.
3) when we identify the problem generally we end up focusing on it so one approach is to “learn” it (the constant) with the help of “neural networks”. This does not work. I explained it in the tutorial but you can also just take my word for now.
Now sometimes instead of focusing on the “problem creator” another way to get our solution is to focus on what was our “original” goal. It was estimating “valid” probability distribution function.
The findings of the paper was that a decently parameterized neural network (ie the one with good amount of parameters … a large network) when “trained” using the NCE loss function can estimate the valid/final normalized probability distribution function. In this setup you don’t need to worry about dealing with normalizing constant explicitly. The “learning” part implicitly takes care of it for you.
“Learning” => training
Training => need for a loss function
Which loss function can help you train such a thing - it is NCE in this case.
In brief/summary:
Your confusion occurs when I (or rather the paper) say - “we can ignore the constant” …. Another way to think of it as we are simply ignoring the thing in the mathematical expression but by making use of a proper loss function we are ended up learning/estimating the valid probability distribution function … our actual goal!
Hope this makes sense!
@@KapilSachdeva Thanks😀
🙏
Best video available on this topic. Thank you so much for such a detailed explanation. I wish there were more channels like this. Looking forward to more videos of yours on other papers. Please keep making content like this.
🙏
Thanks for your sharing. For someone wants to read the theoretical paper about contrastive learning, this videp is really helpful.
🙏🙏
Hi Kapil, Thank you for very intuitive explanation of Noise-Contrastive estimation. I came here while reading a paper on Self-supervised learning NPID. Your explanation really helped a lot. 🤘
🙏
Thank you very much for such a comprehensive tutorial
🙏
Terrific tutorial! But I have a question. The data distribution p_theta is the neural network, theta is the parameters of neural network. But what does exactly p_n will be like? Does it means that we should generate k negative samples from something like gaussian distirbution and so on?
🙏 Generating the k negative samples is indeed a trick thing to do. Don't think in terms of gaussian distribution rather how to get the relevant/appropriate negative samples. There are methods to do what they call hard negative sample mining etc. Am not well versed in those but that is the direction you should think in.
@@KapilSachdeva Thanks for your reply!
Thank you, Kapil. Your explanation of this complex topic was very easy to digest!
What software do you use to create videos like this? I would also like to create videos like this.
🙏 Powerpoint. Nothing special.
Very clear explanation. Thank you so much.
But I still want to clarify one thing: here you mention the Lnse seemd to be the log-likelihood instead of Loss Funtion, which should have a total minus sign infront. Am I right?
When doing minimization you would put the negative sign.
What an excellent and detailed explanation of noise contrastive estimation! Thanks for sharing.
🙏
Thanks a lot for your detailed explanation. It was quite helpful to understand the missing parts in the original paper.
🙏
Thank you sir for such detailed explanation!
🙏
Really nice!
🙏
Amazing!!
thanks for your explanation.
🙏
Thx so much!
I didn't get a headache!
😃🙏