I think the points on the PDF curves are not probability values as probability values at those points are 0 when considering continuous random variables. The integration between those points actually results in a probability value. Hence, when you integrate from 0 to infinity, the area under the curve results in 1 (probability cannot exceed the value of 1)
What do you mean by the statement that the “positive and negative log ratios will cancel each other out?” Attempting to verify this, suppose we have X∈{1, 2, 3, 4} and two simple PMFs: - P(X), with probabilities 0.1, 0.2, 0.3, and 0.4 respectively - Q(X), with probabilities 0.25, 0.25, 0.25, and 0.25 respectively But ln(0.1/0.25) + ln(0.2/0.25) + ln(0.3/0.25) + ln(0.4/0.25) = -0.487109, not 0. Perhaps I’m doing something wrong/misinterpreting the video, but I don’t get why this should be true.
Since the area under the curve of a PDF is 1, if P(x1) is very large at a point x1, then P at other points P(x2), P(x3) etc. have to be smaller, such that the total area under the curve for all P(x) does not exceed 1. So if P(x1) > Q(x1), there must be other points xi where P(xi) < Q(xi), and the positive log-ratio at x1 will be cancelled out by the negative log-ratios at xi (using xi here because there can be an arbitrary number of such points). TBH I haven't proven myself whether or not positive and negative log-ratios cancel out "exactly" to 0 (even though they definitely cancel each other out, which we don't want in this case because both P(xi) > Q(xi) and P(xj) < Q(xj) should be contributing together towards a larger divergence between P and Q, not against each other). My math is a bit rusty, but here's a sketch of the proof idea: - First of all, you will need to integrate log(P(x)/Q(x)) over -inf and +inf x. (You can't choose specific x values here because you do not know in advance where P(x) or Q(x) is large) - log(P(x) / Q(x)) = log(P(x)) - log(Q(x)) - Then it becomes something like integral log(P(x)) - integral log(Q(x)). - Both integral P(x) and Q(x) would be 1, because the area under the curve of a PDF is 1. I'm not sure how adding the log changes things, but log(P(x)) and log(Q(x)) are still bell-shaped, which means fixed area under the curve, which means the difference will be bounded. Not sure if this difference is 0, but even if it's not, it will still be a constant irrespective of the actual "distance" between distributions P and Q. Hope this helps.
demo codes: github.com/szhaovas/blog-ytb/blob/master/NES/kl_demo.py
Thank you for making a very intuitive video about the KL divergence 🙏
Thank you. This really helped.
thank you so much :)
Amazing explanation and the code is such a smart idea
Thank you for sharing🙏
This is probably the best and simple explanation.. Thanks @CabbageCat for the video
👍
I think the points on the PDF curves are not probability values as probability values at those points are 0 when considering continuous random variables. The integration between those points actually results in a probability value. Hence, when you integrate from 0 to infinity, the area under the curve results in 1 (probability cannot exceed the value of 1)
You are right, it was a sloppy use of terms. Should have been probability density.
@@cabbagecat9612 Nonetheless, it was a great effort explaining the concept
What do you mean by the statement that the “positive and negative log ratios will cancel each other out?”
Attempting to verify this, suppose we have X∈{1, 2, 3, 4} and two simple PMFs:
- P(X), with probabilities 0.1, 0.2, 0.3, and 0.4 respectively
- Q(X), with probabilities 0.25, 0.25, 0.25, and 0.25 respectively
But ln(0.1/0.25) + ln(0.2/0.25) + ln(0.3/0.25) + ln(0.4/0.25) = -0.487109, not 0. Perhaps I’m doing something wrong/misinterpreting the video, but I don’t get why this should be true.
Since the area under the curve of a PDF is 1, if P(x1) is very large at a point x1, then P at other points P(x2), P(x3) etc. have to be smaller, such that the total area under the curve for all P(x) does not exceed 1. So if P(x1) > Q(x1), there must be other points xi where P(xi) < Q(xi), and the positive log-ratio at x1 will be cancelled out by the negative log-ratios at xi (using xi here because there can be an arbitrary number of such points).
TBH I haven't proven myself whether or not positive and negative log-ratios cancel out "exactly" to 0 (even though they definitely cancel each other out, which we don't want in this case because both P(xi) > Q(xi) and P(xj) < Q(xj) should be contributing together towards a larger divergence between P and Q, not against each other).
My math is a bit rusty, but here's a sketch of the proof idea:
- First of all, you will need to integrate log(P(x)/Q(x)) over -inf and +inf x. (You can't choose specific x values here because you do not know in advance where P(x) or Q(x) is large)
- log(P(x) / Q(x)) = log(P(x)) - log(Q(x))
- Then it becomes something like integral log(P(x)) - integral log(Q(x)).
- Both integral P(x) and Q(x) would be 1, because the area under the curve of a PDF is 1. I'm not sure how adding the log changes things, but log(P(x)) and log(Q(x)) are still bell-shaped, which means fixed area under the curve, which means the difference will be bounded. Not sure if this difference is 0, but even if it's not, it will still be a constant irrespective of the actual "distance" between distributions P and Q.
Hope this helps.