12:24 You write "For every realization of X, there exist many realizations of Z". But I can't get my head around this. A NeuralNet should be deterministic, no? If I feed in MNIST image #1739 from my training set, and pass it in 10 times, the z-representation is gona be the same each time. And 6D -> 2D is a squash. For a given z I'd expect many x (4D worth). But for a given x I would expect a unique realisation of z. It's difficult to watch the video further with this lingering confusion in my mind.
Apologies if I was not clear enough but here is another attempt at explaining it. Let's start by saying that you have an observed random variable X. A philosophical question we can ask ourselves is what is that created X in this universe? As you know everything we observe is caused by something else. It just so happens that we can not really always observe things that cause things that we observe. In statistical modeling, we use the symbol Z for things that we can not observe (it could be one thing or a vector of things). Now, if you are ok with the above thought; I hope you would agree with me that X could have been caused by different Zs. Can we really find (i.e. accurately and with 100% assurance predict) the cause given an observation? My answer is not really! .... now this should make us a bit humble and acknowledge that the best we can do is to perhaps predict different possible Zs for a given X. With each Z maybe there is a probability associated. Some Z may be more likely than others. Take water on the road as an example. Was it the rain that wet the road or was it the sprinklers? What I have described above is a probabilistic latent variable model. In this model, you aim to predict many possible causes i.e. Zs given an observation X. Now, let's bring your neural network. Yes, it is deterministic and you are correct if you try to predict Z from X you will always get the same value. This is why, in VAE, we do not try to predict Z rather we predict the parameters of the distribution from which Z comes. Which distribution it comes from is an assumption that we make. Really we do not know anything! We make assumptions! Sometimes by chance or our prior knowledge, we make good assumptions. If we say (assume), that Z is a normal (gaussian) random variable then we will (encoder part of the VAE) predict mu and sigma. Once your network is trained and in test mode then yes for a given X you will always get the same mu and sigma. We will then use mu and sigma to sample many possible Z's for a given X. One more time :) For a give X (a sample, an image) we predict the mu & sigma of the possible cause. We then use the mu and sigma to take either 1 sample or as many as we want. That sample (or samples) is Z! ... the act of sampling is probabilistic in nature! Hopefully this helps a bit. Let me know.
@@KapilSachdeva Thank you! One more little question. Why we predict latend_dim (mu and sigma) values but not just a single mu and a single sigma or why we need several distributions with different parametres?
@@KapilSachdeva Can we really find (i.e. accurately and with 100% assurance predict) the cause given an observation? My answer is not really! . Now i understand why the marginal probability(p(z)*p(x|z)) is an intractable integral right sir?
I have been seeing VAE videos for the past 1 week and this the only video which helped me understand. Your way of teaching is very clear. Thanks a lot!! keep making more videos.
This might be the clearest and most in depth explantion of variational autoencoders on TH-cam. I was struggling for a while to understand VAEs and your video has really helped me. Thanks a lot!
Thank you so much for this amazing series. I am now able to connect the dots. I have a couple of questions and would really appreciate your take on these. First one is on how I am understanding the difference between the latent representations of VAE’s vs AE’s and your take on this understanding. 1. Consider three samples or images(i_1, i_2, i_3) and the same architecture as you have in this video. A trained autoencoder with two latent neurons(z_1, z_2) would simply yield one point in the 2D latent space for each image and thus we will have 3 points in the latent space each belonging to i_1, i_2 and i_3. For a trained VAE with two latent dimensions (z_1, z_2), the first image i_1 will be sampled from a joint distribution of gaussian(N1(mean_1, log_var), N2(mean, log_var)) or randomly plucking a point from the ellipsoid (projection of bivariate gaussian on a 2D plane). And for image i_1, i_2, i_3 these are basically plucked randomly from three such ellipsoids on the same 2D space. So instead of three points in the latent space of AE, we will have three ellipsoids in the latent space of VAE, each representing the distribution of i_1, i_2 and i_3. Is this understanding correct? 2. I am currently going through beta-VAE’s and understanding the concept of disentanglement. I am not able to understand how playing with the value of beta which is a coefficient in front of the KL divergence of ELBO achieves disentanglement. How can I build an understanding of this concept in terms of gaussian distribution? I understand that disentanglement does not mean orthogonality of the latent vectors, rather the distributions are uncorrelated. I am not able to digest this concept clearly. Could you please throw some light on this topic. Again thank you so much for putting immense effort into making these great lectures for free!
Great video! This series (KL, ELBO and VAE) is the most easy-to-understand video I've ever seen. I have two question in ELBO: 1. Why the first term can be interpreted as reconstruction loss? It's a expectation of log likelyhood. 2. How can we calculate the 2nd term? If code explanation is added, it will be perfect.
Sorry for the late reply: 1) x|z => that you are generating x using the predicted z. However, the x is being generated using z it would not be perfect (at least initially). We also have the ground truth x. This term would be compared with the ground truth x (using a loss function) and hence can be interpreted as reconstruction loss. 2) I explain this in the VAE tutorial. There are code snippets. That said, I think the VAE example in keras is pretty good. If you follow the tutorial along with that code it should be helpful - keras.io/examples/generative/vae/ Hope this is helpful.
Green theta is for decoder, phi is for encoder … yellow theta are not learned parameters. I explain it in more detail in reparameterization trick video. See the next video in this series.
Thank you for the video. Looking at the video, I just got some questions and hope to hear from you. 1. Do we do backpropagation for encoding part and decoding part at the same time or separately? 2. is the prior probability for the latent variable, z couble any value of normal distribution that I think is proper be fine?
1) at the same time. As you can see that after connecting encoder and decoder it has taken the same shape as that of autoencoder. Loss function will be ELBO. 2) selecting prior on z is the most difficult aspect. In the original VAE, standard normal N(0,1) was used. However, if you have a better idea (based on your domain knowledge) you should use it. There is new research that attempts to “learn” prior as an additional step in VAE.
Hello Eric, as such there is no such thing as amortization property in Latent Variable Model. I am sure it is the terminology that has created some confusion for you. Let me try to clarify. 1) Model vs Algorithm - Quite often these 2 terms are used interchangeably but they are not. Model is a specification of how various variables (that describe a physical phenomenon) interact with each other, the potential outcomes, and what distribution type they follow (if we are considering Random variables). This so-called interaction between various variables of the models takes the form of parameters or weights. Also if we have Random Variables then we would be interested in the parameters of the associated distribution type. e.g. if we say (in our model) that our Random variable follows Normal distribution then the parameters for that would be mu and sigma. During modeling, we may not specify the values of mu and sigma. On the other hand, an algorithm is a mechanism to discover/identify the values of the weights that creates associations between various variables and parameters of the distribution types they follow (if they are Random variables). 2) Latent Variable Model - It is a specification of a phenomenon in which we assume the presence of Random variables that we can not observe. 3) The job of an algorithm for a Latent Variable Model is to determine the parameters of the distribution of both the unobserved (Latent) and observed random variables. The algorithm would also find the weights that establish the relationship between latent and observed variables as well. In general, the traditional algorithms (e.g. Expectation-Maximization) for Latent Variable Model are not very efficient in terms of computational cost and they become worse when we deal with very large training datasets. 4) Amortization is a concept or ... rather an inspiration that we draw from the human brain. Our brain has the capability to utilize some past inferences/computations and apply them to new data. You can also see it as "smart memorization" ... as in it is not pure "memorization" rather a capability to utilize it smartly for the new observations. 5) Variational AutoEncoder is an "algorithm" for Latent Variable Model that offers the amortization type capability. Amortization is achieved because the architecture/setup/methodology of neural networks (in general) has the capability to offer that smart memorization. This way, VAE gives us the capability to use Latent Variable Models even when we have large datasets. Hope this makes sense now. Let me know if you need more clarifications.
Thank you so much for this video.I have one query, how can we predict the random variable when it is not observed . Even if we predict then how can we be sure that it can be accepted with confidence.
Think of it this way: If you see outside your window and see that the street is wet, you predict that it must have rained. Wet street was the caused by the rain (high prob) or may be sprinkler (low prob). Nevertheless u managed to predict the cause based on data (wet street) and your prior experience. Now apply this example to data at hand as well. In many ways you can see the latent variables as the cause. The idea here is to be able to predict the cause (latent variable) from the observed data. How we make this happen is by first predicting it (latent variable) and then try to regenerate the original data from our prediction. This is where the auto-encoder style architecture is used. We can put some confidence in this entire thing by doing some test/validation as you do in any other ML or statistical model. Hope this makes sense. Let me know if u need more clarifications.
Since we are "making" our neural network "predict" 2*log(sigma) we multiply 0.5 to cancel the 2 and then use exponentiation to cancel the log You will find this explanation starting at 17:28 log var = 2 log sigma
12:24 You write "For every realization of X, there exist many realizations of Z". But I can't get my head around this. A NeuralNet should be deterministic, no? If I feed in MNIST image #1739 from my training set, and pass it in 10 times, the z-representation is gona be the same each time. And 6D -> 2D is a squash. For a given z I'd expect many x (4D worth). But for a given x I would expect a unique realisation of z. It's difficult to watch the video further with this lingering confusion in my mind.
Apologies if I was not clear enough but here is another attempt at explaining it.
Let's start by saying that you have an observed random variable X. A philosophical question we can ask ourselves is what is that created X in this universe? As you know everything we observe is caused by something else. It just so happens that we can not really always observe things that cause things that we observe. In statistical modeling, we use the symbol Z for things that we can not observe (it could be one thing or a vector of things).
Now, if you are ok with the above thought; I hope you would agree with me that X could have been caused by different Zs. Can we really find (i.e. accurately and with 100% assurance predict) the cause given an observation? My answer is not really! .... now this should make us a bit humble and acknowledge that the best we can do is to perhaps predict different possible Zs for a given X. With each Z maybe there is a probability associated. Some Z may be more likely than others.
Take water on the road as an example. Was it the rain that wet the road or was it the sprinklers?
What I have described above is a probabilistic latent variable model. In this model, you aim to predict many possible causes i.e. Zs given an observation X.
Now, let's bring your neural network. Yes, it is deterministic and you are correct if you try to predict Z from X you will always get the same value. This is why, in VAE, we do not try to predict Z rather we predict the parameters of the distribution from which Z comes. Which distribution it comes from is an assumption that we make. Really we do not know anything! We make assumptions! Sometimes by chance or our prior knowledge, we make good assumptions.
If we say (assume), that Z is a normal (gaussian) random variable then we will (encoder part of the VAE) predict mu and sigma. Once your network is trained and in test mode then yes for a given X you will always get the same mu and sigma. We will then use mu and sigma to sample many possible Z's for a given X.
One more time :)
For a give X (a sample, an image) we predict the mu & sigma of the possible cause. We then use the mu and sigma to take either 1 sample or as many as we want. That sample (or samples) is Z! ... the act of sampling is probabilistic in nature!
Hopefully this helps a bit. Let me know.
What a great reply! Hats off!!
@@KapilSachdeva Thank you! One more little question. Why we predict latend_dim (mu and sigma) values but not just a single mu and a single sigma or why we need several distributions with different parametres?
@@KapilSachdeva Can we really find (i.e. accurately and with 100% assurance predict) the cause given an observation? My answer is not really! . Now i understand why the marginal probability(p(z)*p(x|z)) is an intractable integral right sir?
I have been seeing VAE videos for the past 1 week and this the only video which helped me understand. Your way of teaching is very clear. Thanks a lot!! keep making more videos.
🙏
This might be the clearest and most in depth explantion of variational autoencoders on TH-cam. I was struggling for a while to understand VAEs and your video has really helped me. Thanks a lot!
🙏
This is the best explanation of Variational Autoencoder I have ever seen.
🙏
Thank you so much for the clear explanation of stochastic neurons! looking forward to the reparam trick tutorial!💯
🙏
It’s here - th-cam.com/video/nKM9875PVtU/w-d-xo.html
Your videos have the most amazing clarity of mathematics behind AI and a very nice sense of humor
🙏
Thank you so much for this amazing series. I am now able to connect the dots. I have a couple of questions and would really appreciate your take on these. First one is on how I am understanding the difference between the latent representations of VAE’s vs AE’s and your take on this understanding.
1. Consider three samples or images(i_1, i_2, i_3) and the same architecture as you have in this video. A trained autoencoder with two latent neurons(z_1, z_2) would simply yield one point in the 2D latent space for each image and thus we will have 3 points in the latent space each belonging to i_1, i_2 and i_3. For a trained VAE with two latent dimensions (z_1, z_2), the first image i_1 will be sampled from a joint distribution of gaussian(N1(mean_1, log_var), N2(mean, log_var)) or randomly plucking a point from the ellipsoid (projection of bivariate gaussian on a 2D plane). And for image i_1, i_2, i_3 these are basically plucked randomly from three such ellipsoids on the same 2D space. So instead of three points in the latent space of AE, we will have three ellipsoids in the latent space of VAE, each representing the distribution of i_1, i_2 and i_3. Is this understanding correct?
2. I am currently going through beta-VAE’s and understanding the concept of disentanglement. I am not able to understand how playing with the value of beta which is a coefficient in front of the KL divergence of ELBO achieves disentanglement. How can I build an understanding of this concept in terms of gaussian distribution? I understand that disentanglement does not mean orthogonality of the latent vectors, rather the distributions are uncorrelated. I am not able to digest this concept clearly. Could you please throw some light on this topic.
Again thank you so much for putting immense effort into making these great lectures for free!
you are the boss...I can say ..Fantastic and magnificent way of presenting and captivating till the end till we understand.. Million thanks
🙏
Dear @Kapil Sachdeva,
again, my biggest thank you, for this amazing series.
Greetings from Germany.
🙏
Clear and easy to understand compared to other videos throwing lots of math formula at the begiinning. Great work! subscribed your channel
🙏
Kapil... Awesome... you are back. So glad.
🙏
Best video on VAE 👏👏👏👏
🙏
Thank you very much.. marvelous explanation!!
🙏
Awesome video. Is there any intuition on why we are using reverse KL as opposed to forward KL?
Brilliant exposition of underlying math Kapil. Going to watch your videos on KL divergence and evidence lower bounds.
🙏
Great video! This series (KL, ELBO and VAE) is the most easy-to-understand video I've ever seen.
I have two question in ELBO:
1. Why the first term can be interpreted as reconstruction loss? It's a expectation of log likelyhood.
2. How can we calculate the 2nd term? If code explanation is added, it will be perfect.
Sorry for the late reply:
1) x|z => that you are generating x using the predicted z. However, the x is being generated using z it would not be perfect (at least initially). We also have the ground truth x. This term would be compared with the ground truth x (using a loss function) and hence can be interpreted as reconstruction loss.
2) I explain this in the VAE tutorial. There are code snippets. That said, I think the VAE example in keras is pretty good. If you follow the tutorial along with that code it should be helpful - keras.io/examples/generative/vae/
Hope this is helpful.
Wow! This was such an wonderful explanation.
🙏
Amazing tutorial. Thank you sir!
🙏
Spectacular 👏
🙏
Thank you so much for really clearly explained it!
🙏
Thanks for sharing, cool stuff. Could you please tell me what tools/software do you use to make these animations and drawings
🙏 mostly PowerPoint except for few advanced animations I use manim (github.com/manimCommunity/manim)
Hi Kapil. At 30:03, do the green θ and the yellow θ both refer to the decoder's parameters?
Green theta is for decoder, phi is for encoder … yellow theta are not learned parameters. I explain it in more detail in reparameterization trick video. See the next video in this series.
Very easy to understand video.
Thanks 😊
🙏
Thank you for the video.
Looking at the video, I just got some questions and hope to hear from you.
1. Do we do backpropagation for encoding part and decoding part at the same time or separately?
2. is the prior probability for the latent variable, z couble any value of normal distribution that I think is proper be fine?
1) at the same time. As you can see that after connecting encoder and decoder it has taken the same shape as that of autoencoder. Loss function will be ELBO.
2) selecting prior on z is the most difficult aspect. In the original VAE, standard normal N(0,1) was used. However, if you have a better idea (based on your domain knowledge) you should use it. There is new research that attempts to “learn” prior as an additional step in VAE.
Please do more paper explanations!!!!! God bless you!
🙏
Thanks for the video! One question: what is the amortization properties in Latent variable model, at 33:13?
Hello Eric, as such there is no such thing as amortization property in Latent Variable Model.
I am sure it is the terminology that has created some confusion for you.
Let me try to clarify.
1)
Model vs Algorithm -
Quite often these 2 terms are used interchangeably but they are not. Model is a specification of how various variables (that describe a physical phenomenon) interact with each other, the potential outcomes, and what distribution type they follow (if we are considering Random variables). This so-called interaction between various variables of the models takes the form of parameters or weights. Also if we have Random Variables then we would be interested in the parameters of the associated distribution type. e.g. if we say (in our model) that our Random variable follows Normal distribution then the parameters for that would be mu and sigma. During modeling, we may not specify the values of mu and sigma.
On the other hand, an algorithm is a mechanism to discover/identify the values of the weights that creates associations between various variables and parameters of the distribution types they follow (if they are Random variables).
2)
Latent Variable Model -
It is a specification of a phenomenon in which we assume the presence of Random variables that we can not observe.
3)
The job of an algorithm for a Latent Variable Model is to determine the parameters of the distribution of both the unobserved (Latent) and observed random variables. The algorithm would also find the weights that establish the relationship between latent and observed variables as well.
In general, the traditional algorithms (e.g. Expectation-Maximization) for Latent Variable Model are not very efficient in terms of computational cost and they become worse when we deal with very large training datasets.
4)
Amortization is a concept or ... rather an inspiration that we draw from the human brain. Our brain has the capability to utilize some past inferences/computations and apply them to new data. You can also see it as "smart memorization" ... as in it is not pure "memorization" rather a capability to utilize it smartly for the new observations.
5)
Variational AutoEncoder is an "algorithm" for Latent Variable Model that offers the amortization type capability.
Amortization is achieved because the architecture/setup/methodology of neural networks (in general) has the capability to offer that smart memorization.
This way, VAE gives us the capability to use Latent Variable Models even when we have large datasets.
Hope this makes sense now. Let me know if you need more clarifications.
Thank you so much for this video.I have one query, how can we predict the random variable when it is not observed .
Even if we predict then how can we be sure that it can be accepted with confidence.
Think of it this way:
If you see outside your window and see that the street is wet, you predict that it must have rained. Wet street was the caused by the rain (high prob) or may be sprinkler (low prob). Nevertheless u managed to predict the cause based on data (wet street) and your prior experience.
Now apply this example to data at hand as well. In many ways you can see the latent variables as the cause. The idea here is to be able to predict the cause (latent variable) from the observed data.
How we make this happen is by first predicting it (latent variable) and then try to regenerate the original data from our prediction. This is where the auto-encoder style architecture is used.
We can put some confidence in this entire thing by doing some test/validation as you do in any other ML or statistical model.
Hope this makes sense. Let me know if u need more clarifications.
Perhaps I missed this, but could you explain why we multiply logvar with 0.5? Line 8 of code at 22:01
Since we are "making" our neural network "predict" 2*log(sigma) we multiply 0.5 to cancel the 2 and then use exponentiation to cancel the log
You will find this explanation starting at 17:28
log var = 2 log sigma
What is the meaning of realization of a Random Variable?
a sample of a Random Variable. A random variable has an associated probability distribution. A realization is a sample from that distribution.
Hah... Christmas gift?
🙏 ... Trust all is well Sunny!
@@KapilSachdeva Sir the best thing is am glad you are back. I am looking forward for more content.