For those who don't know the research and esoteric academic work on this stuff. Just imagine how difficult this would be to learn from academic documents. Ben has saved us hours upon hours of work and frustration.
This was an excellent video, thanks! I hadn't understood until now that the only reason that proposals in HMC would ever be rejected is due to a slight increase in the value of H due to the fact that we're only approximately integrating the Equations of Motion.
I am actually a bit confused about the explanation, specifically about the relationship to statistical mechanics. With the exception of the initialization of the momentum, there are no random fluctuations in the dynamics (as opposed to Ito processes). The Boltzman distribution result holds specifically for Ito processes (can be derived from the Fokker-Planck equation), but not for such deterministic dynamics. I would be happy about a clarification.
@SpartacanUsuals - Why wouldn't Gibbs sampling work in this scenario? Since we are sampling from a multidimensional probability space, why not to just use the Gibbs sampling? How is HMC different from Gibbs sampling? I am having a tough time understanding this, would be great if you could answer this. By the way, great explanation. It did help me understand and clarify the basic concepts of HMC.
Thank you for this video, it goes very nicely with the papers referenced. I was wondering if you have a video or could recommend one that goes into more depth on solving for the path of the particle? I am still struggling to fully understand what that means, otherwise I am clear on other steps :)
Thanks for your wonderful video! Here are my questions: first I didn't undrestand why changing M* to -M* , makes the conditional probability to change from 0 to 1? and also why the path from (theta,M) to (theta*,M*) is determinsitic?
The path from (theta,M) to (theta*,M*) is deterministic because it is the result of the integration of the particle path through the parameter space, according to the conservation of the total energy. This is a pure mechanic reasoning and nothing probabilistic is used for this step. Understanding the change of sign of M* is still a bit magic for me! ^_^
The -M* is a clever trick. You essentially reverse the motion. Imagine throwing a ball with a certain momentum m1 from a certain position x1. After a time t it will end up in a place x2 with momentum m2. If you then throw the ball from x2 but with momentum -m2 after time t it will end up at x1. So from (x1, m1) you go to (x2, m2) but report the result as (x2, -m2) because from (x2, -m2) you get back to (x1, -m1) which you would report as (x1, m1). (You don't actually go back in the algorithm. It's just to make sure there is no bias in the proposed values. See Metropolis-Hastings)
Excellent Video! But one question. At 17:10, where does the normalizing constant 1/z come from? the integral over the support of the normal's pdf is 1 and the other terms on the RHS are simply constant w.r.t the variable of integration. Is the LHS not a valid pdf at this point? Hence why we obtain the normalizing constant
Great video, thanks Ben. Would have been good to have seen a visualisation showing the -m trick working, took me a long time to satisfy this in my mind (essentially thinking how the equation of motion applied to the proposal point and then flipping will gets us back to the original point). Also it is very interesting that in an ideal world, this essentially never rejects a proposal as the original point and proposal should have the same energy. Maybe a visualisation driving that point home would also have been useful. Once I felt I understood these concepts, the algorithm made sense to me.
Of course very important for the -m* trick that the joint distribution is independent for m and theta, so that we can be certain we have symmetry for paths at m and -m*.
This presentation is fantastic. Is there any chance you'd be able to expand it to relate the sampling concepts to warning messages in stan and the necessary tweaking (tree depth, adaptive delta etc)?
Hi i have a question. Could you send me the wolfram program that you presented here? I have to do a presentation to my phd and it would be useful!! Thank you so much
What was the intuition or reason for choosing a bimodal? Was it just to make the position and momentum graph easier to visualize both components as convergence occurs?
Thanks for your comment. Yes, it was a somewhat arbitrary choice. It mainly chose this rather than a unimodal Gaussian so that the paths were more interesting (i.e. not just circles). Best, Ben
Thank you for the video. I just have a question. For the bimodal posterior, wouldn't you receive warnings from Stan regarding the chains performance? I think it was Betancourt saying that, in the presence of bimodality, HMC is not appropriate.
Amazing video to explain the intuition! Love it! I have a question though. Mathematically, I understand why we need to flip m to -m, i.e to make the proposal symmetric. But in effect, since we are going to throw away m as we are only interested in theta, would we still need to flip it? It doesn't affect the r value (since gaussian) or the next momentum value (since we are squaring it)
For those who don't know the research and esoteric academic work on this stuff. Just imagine how difficult this would be to learn from academic documents. Ben has saved us hours upon hours of work and frustration.
This is incredible. I've struggled reading through both papers mentioned but finally got an intuitive idea of how HMC works 4 mins into this video.
Don’t cry, dear. It’s easy. Lol
This was an excellent video, thanks! I hadn't understood until now that the only reason that proposals in HMC would ever be rejected is due to a slight increase in the value of H due to the fact that we're only approximately integrating the Equations of Motion.
That's a very intuitive explanation of the HMC sampler, great job! Thanks for sharing.
Thank you!
Ben Lambert no thank you! This video is really clear and helpful!
Thanks for this video. I was trying a lot to understand physical analogy of HMC. This video really helped.
Thank you for such a well prepared and explained material!
I am actually a bit confused about the explanation, specifically about the relationship to statistical mechanics. With the exception of the initialization of the momentum, there are no random fluctuations in the dynamics (as opposed to Ito processes). The Boltzman distribution result holds specifically for Ito processes (can be derived from the Fokker-Planck equation), but not for such deterministic dynamics. I would be happy about a clarification.
@SpartacanUsuals - Why wouldn't Gibbs sampling work in this scenario? Since we are sampling from a multidimensional probability space, why not to just use the Gibbs sampling? How is HMC different from Gibbs sampling? I am having a tough time understanding this, would be great if you could answer this.
By the way, great explanation. It did help me understand and clarify the basic concepts of HMC.
I highly appreciate the effort you use to create animations!
Thank you for this, it's pure gold. Highly appreciated.
Brilliant work, thank you for posting.
Thank you for this video, it goes very nicely with the papers referenced. I was wondering if you have a video or could recommend one that goes into more depth on solving for the path of the particle? I am still struggling to fully understand what that means, otherwise I am clear on other steps :)
Thanks for your wonderful video!
Here are my questions:
first I didn't undrestand why changing M* to -M* , makes the conditional probability to change from 0 to 1?
and also why the path from (theta,M) to (theta*,M*) is determinsitic?
The path from (theta,M) to (theta*,M*) is deterministic because it is the result of the integration of the particle path through the parameter space, according to the conservation of the total energy. This is a pure mechanic reasoning and nothing probabilistic is used for this step.
Understanding the change of sign of M* is still a bit magic for me! ^_^
The -M* is a clever trick. You essentially reverse the motion.
Imagine throwing a ball with a certain momentum m1 from a certain position x1. After a time t it will end up in a place x2 with momentum m2. If you then throw the ball from x2 but with momentum -m2 after time t it will end up at x1.
So from (x1, m1) you go to (x2, m2) but report the result as (x2, -m2) because from (x2, -m2) you get back to (x1, -m1) which you would report as (x1, m1).
(You don't actually go back in the algorithm. It's just to make sure there is no bias in the proposed values. See Metropolis-Hastings)
Thank you for this great explanation!
Thankyou for the great explanation.
Excellent Video! But one question.
At 17:10, where does the normalizing constant 1/z come from? the integral over the support of the normal's pdf is 1 and the other terms on the RHS are simply constant w.r.t the variable of integration.
Is the LHS not a valid pdf at this point? Hence why we obtain the normalizing constant
Thank you~ I am watching this video for preparation of my master program
I have read like 100 blogs and none of them can states the things clearly like this video.
Great video, thanks Ben. Would have been good to have seen a visualisation showing the -m trick working, took me a long time to satisfy this in my mind (essentially thinking how the equation of motion applied to the proposal point and then flipping will gets us back to the original point). Also it is very interesting that in an ideal world, this essentially never rejects a proposal as the original point and proposal should have the same energy. Maybe a visualisation driving that point home would also have been useful. Once I felt I understood these concepts, the algorithm made sense to me.
Of course very important for the -m* trick that the joint distribution is independent for m and theta, so that we can be certain we have symmetry for paths at m and -m*.
This is awesome can you please do one for Variational Inference.
Thanks! Yes, one is currently being planned. Cheers, Ben
This presentation is fantastic. Is there any chance you'd be able to expand it to relate the sampling concepts to warning messages in stan and the necessary tweaking (tree depth, adaptive delta etc)?
Excellent lecture!
So HMC almost always accepts (rather than rejects) proposals as long as the momentum term is set correctly initially?
Very nice visualization.
Hi i have a question. Could you send me the wolfram program that you presented here? I have to do a presentation to my phd and it would be useful!!
Thank you so much
What was the intuition or reason for choosing a bimodal? Was it just to make the position and momentum graph easier to visualize both components as convergence occurs?
Thanks for your comment. Yes, it was a somewhat arbitrary choice. It mainly chose this rather than a unimodal Gaussian so that the paths were more interesting (i.e. not just circles). Best, Ben
Thank you for the video.
I just have a question. For the bimodal posterior, wouldn't you receive warnings from Stan regarding the chains performance? I think it was Betancourt saying that, in the presence of bimodality, HMC is not appropriate.
yes, very helpfull Mr.Lambert. Thank you :)
Thanks for the nice job~🍺 It's precise and easy to understand.
Amazing video to explain the intuition! Love it! I have a question though. Mathematically, I understand why we need to flip m to -m, i.e to make the proposal symmetric. But in effect, since we are going to throw away m as we are only interested in theta, would we still need to flip it? It doesn't affect the r value (since gaussian) or the next momentum value (since we are squaring it)
Ok I found the answer in Neal's paper. ... This negation need not be done in practice,..
Great job! Thanks a lot! :)
I love you, thank you!
genious
what is x
Thanks for your comment. 'x' is the data sample. So p(X|theta) is the likelihood. Hope that helps! Best, Ben