Variational Inference by Automatic Differentiation in TensorFlow Probability

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024

ความคิดเห็น • 47

  • @MachineLearningSimulation
    @MachineLearningSimulation  3 ปีที่แล้ว +11

    Big error at 15:00. The approximation to the ELBO (and in general to the expectation) should not contain the q(mu)!
    Hence, the expectation to the ELBO should read:
    ELBO ≈ 1/L * sum_{l=0}^{L-1} ( log p(mu_l, X=D) - log q(mu_l) )
    Then the argument that one usually "ignores this term" is invalid as there is no term in the first place.
    Sorry for the confusion. The file on GitHub has been updated.

    • @ashitabhmisra9123
      @ashitabhmisra9123 2 ปีที่แล้ว +1

      Hello, I have a doubt in 18:14. Taking into consideration the correction.
      Assume L = 1 (simple case), and Data = {d_1}
      Loss(u_s,sigma_s) = - log(P(u[0], X =d_1)) + log(q(u[0])) ----- Eq. 1
      Now, if we take diff of Loss() wrt to u_s, or sigma_s won't the first term just be 0? and we would be left with only the differentiation of the log of surrogate term ? So How would the joint play any role in the update stage of gradient descent.
      I've been stuck on this detail :( can't seem to figure it out. Any guidance would be appreciated! Thanks!!

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว +1

      Hi @@ashitabhmisra9123,
      thanks for the great question. It has been some time since I uploaded the video, so I will just give a quick answer. Maybe that already leads you into the right direction. Let me know if it is still unclear.
      You assumed L=1, so you sample one mu value from the proposed surrogate posterior. You use this value in the computation of -log(P(...)) and for log(q(...)). If the sampled mu would be just given as a constant, you were right that only second term contributes a non-zero derivative information. However, the parameters the autodiff is taking the derivative w.r.t. also appear during the sampling stage of mu. This is how the first term will contribute a non-zero gradient. That's of course because if you had a different parameter to the surrogate, the sample would (probably) be different.
      Hope that helped.

    • @ashitabhmisra9123
      @ashitabhmisra9123 2 ปีที่แล้ว

      @@MachineLearningSimulation this really helped! I completely missed this detail. Thanks a ton!

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว

      @@ashitabhmisra9123 Great :). You're welcome!

    • @johnsnow8591
      @johnsnow8591 ปีที่แล้ว

      This one should be pinned.

  • @davidlorell5098
    @davidlorell5098 2 ปีที่แล้ว +10

    This was an *extremely* valuable video for me, and that goes for your entire Variational Inference series. Extremely high quality content. Visually appealing, clearly narrated, and thorough. Thank you for your hard work!

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว +1

      I'm super glad the videos were helpful :). Thanks for the kind words!

    • @davidlorell5098
      @davidlorell5098 2 ปีที่แล้ว

      @@MachineLearningSimulation One remaining question from me: Why do we lean on the law of large numbers when approximating the ELBO but not the marginal? That is, someone might be inclined to say with a large enough N, P(D) is approximately (1/N)*SUM(P(X=D | Z=Z_i)) where each Z_i is sampled from P(Z). Getting P(D) this way, we can normalize the joint and get an approximated posterior without going through the trouble of variational inference.
      One possibility is that VI results in a proper distribution whereas this will never be sure to scale the integral over the join to exactly 1. Is there some other obvious reason I'm missing? EDIT: Ah, it might be that while we can evaluate the joint probability of specific proposals, it might be difficult to sample from the prior in general. With VI, we only need to sample from the tractable approximate distribution.

  • @andersgeil
    @andersgeil ปีที่แล้ว +1

    Thanks for an amazing series of high quality videos! As a suggestion for future content, I would love to see a few videos on MCMC approaches as well, especially on HMC/NUTS. Also, I'm curious if you happen to have any opinions on Pyro as opposed to TFP?

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว +1

      Great suggestions! And thanks for the kind feedback :).
      The probabilistic ML series is definitely not finished by now and the MCMC approaches are fascinating to look into, also with respect to their usage in a probabilistic framework. For now, I want to continue a bit with AD & adjoint methods and a series of Finite Difference methods. After that, I think I will return to the probability playlist.
      Regarding TFP & Pyro: I believe that I am not too qualified to really compare the two. The reason I started with TFP was my prior familiarity with TensorFlow. At some points, TFP's API feels a bit much, but for the application on education on this channel, I think it has done great so far. I haven't used Pyro yet.

  • @mullermann2899
    @mullermann2899 ปีที่แล้ว +1

    Klasse Video! Kurze Frage .. Wenn ich mein Modell per SVI fitte, egal ob über die tfp.vi.fit_surrogate_posterior()-Methode oder "per Hand", erhalte ich ab einer gewissen Iteration (meistens irgendwas ab der 3.000sten) nurnoch nan values als Loss (und alle Parameter gehen natürlich auch auf nan). Ich möchte die Parameter meiner latenten Gammavariablen fitten und habe schon alles probiert, Softplus dass sie nicht negativ werden, etc. etc. Ich bin am Ende meines Latein angelangt. Vielleicht weißt du ja eine Lösung?

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Erstmal danke fuer das positive Feedback :).
      Es ist jetzt schon ein bisschen her, seitdem ich TFP das letzte Mal angefasst habe. Auch sind probabilistische Modeele nicht mehr Teil meiner Forschung, deswegen kann ich nur beschränkt gute Empfehlungen geben.
      NaN Werte koennten auf eine zu hohe learning rate (=step size im Optimierer) hindeuten. Zum Beispiel koenntest du versuchen Adam mit 1e-4 laufen zu lassen, ansonsten ist (L)BFGS (ist auch in TFP implementiert) eine sehr robuste Wahl wenn der Parameter-Raum klein genug ist (

  • @dongweiye2906
    @dongweiye2906 ปีที่แล้ว +1

    Hi, thank you so much for the nice video. I have a question about your code example. At 30:01 you type the neg_elbo which with this minus sign at the beginning, should the log(q(z)) be with a + ahead of it instead of minus since the log-division is a '-' and additional negative ELBO make it positive?

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Hi, thanks for the comment and the kind words :)
      It's been some time since I uploaded the video, so I hope I got you correct. Please ask a follow-up question if sth is unclear.
      The values of the ELBO are (almost always) negative. The optimization problem for Variational Inference is to maximize the ELBO, hence to find the highest (still negative though) value and the corresponding design parameters to the proposed surrogate posterior. However, most of the optimization literature often frames problems as minimization problems. Luckily, we can change a maximization problem into a minimization by negating the target/objective. That's why I called it negative ELBO (since it's the ELBO with a minus sign in front). Of course, you are right if the ELBO takes (almost) only negative values and we negate it, it will be a positive value. Ultimately, it's a (personal) choice of naming your variables. :)
      Hope that helped.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 ปีที่แล้ว +1

    great TensorFlow Probability example

  • @danielscott6302
    @danielscott6302 ปีที่แล้ว +2

    Hi. Thank you so much for your work (indeed, not just in this video) in elaboration on all the various intricacies associated with Variational Inference. This has been very helpful for me, as it ties very closely to a research project that I'm currently working on. Would you be able to help me better understand how I could augment ADVI (Automatic Differentiation Variational Inference) with Copulas? If you are able to do this, please let me know how I could return the favour 😊.

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Thanks a lot for the kind feedback :). I am really happy if the videos are of great help.
      Unfortunately, this is the first time I hear about Copulas (had to google them). I don't think I can be of great help. Good luck with the research :)

  • @AshishPatel-yq4xc
    @AshishPatel-yq4xc 18 วันที่ผ่านมา

    Are there any introduction books on Variational Inference you can recommend so then I can build models ?

  • @JoshuaBartkoske
    @JoshuaBartkoske 5 หลายเดือนก่อน

    I just wanted to say this has been very useful, but a bug I found following your python, when I included the spaces in the names for the surrogate parameters, I ended up with this error "ValueError: 'mu surrogate_0_momentum' is not a valid scope name. A scope name has to match the following pattern: ^[A-Za-z0-9_.\\/>-]*". But when I replaced the spaces with _ then the code ran normally. Not sure if anyone else ran into this issue, but I am here to let others know what to do if you see this error.

  • @Celdorsc2
    @Celdorsc2 ปีที่แล้ว +1

    Thanks for the video. I feel I am quite close to understanding the topic thanks to your videos but I am still confused about what I should use in the expectation for a joint distr. and q*. I can't follow your practical part because I don't know TensorFlow and the functions you are using seem like black boxes that do something under the hood. TF documentation is not very helpful.
    it's not clear what is the joint and q* in the expectation. Should it be a product of a prior: N(mu_0, sigma_0) and N(x_i|mu; sigma)? But then should I sample from N(x|mu; sigma) to get the product? Would you be able to give more explanation to that? Thanks.

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Hi,
      thanks for the comment and the kind words :). It's been some time since I uploaded the video. It is hard to remember what I actually said when in the video. Can you give timestamps to the points in the video you are unsure about?

    • @Celdorsc2
      @Celdorsc2 ปีที่แล้ว

      @@MachineLearningSimulation Hi and thanks for your response. It's in the section of _Defining the log joint_. From 22:20 you are building TF components starting from generative model. I don't understand what exactly TF does and how the joint prob. is created. It seems it takes $\mu_0$ and $\sigma_0$ and generates samples $\mu_i$.
      Then, for each generated sample $\mu_i$ and fixed $\sigma_fix$ we generate $X_i$. Is this correct? That seems how it simulate the joint distribution.

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      You are right, this is the way the joint distribution is modelled in tensorflow probability. 😊 If you are interested in an introduction to this, check out this video of mine: th-cam.com/video/yBc01ZeaFxw/w-d-xo.html
      You are right the TFP documentation is not too insightful. Yet, I think there is a lot one can learn just by studying the API design. It helps me to understand what is the input and output of certain algorithms. It requires our log joint prob fixed to the data (which will then just be a function of the latent space alone) as well as the (parametrized) proposed surrogate posterior. It then solves an optimization problem in the parameter space to the surrogate model.
      I think this is not the best reply possible, I'm currently replying from mobile. Feel free to ask a follow-up question if sth is still unclear and I can come back to you next week.

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 ปีที่แล้ว +1

    At 25:30, it's not clear to me why you created a lambda function with mu as an argument. The generative model samples from the prior to get a sample mu, to use as mu parameter for the normal distribution. So there doesn't seem to be a need of mu as an input argument to the lambda function. Hope my question is clear.

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 ปีที่แล้ว

      I can understand that this is a little hard to understand. :)
      To start from the beginning:
      1) We have the joint distribution p(mu, X)
      2) We fix the joint to some data p(mu, X=D)
      3) This essentially yields a function that depends on mu. Think of it the following way: If I propose a mu, let's say 3.0 and want to calculate the joint with my fixed data, then I would do p(mu=3.0, X=D) (this is fully evaluatable and returns the scalar probability (though it might be better to do it in log-space)). I can propose another mu=2.0, and do p(mu=2.0, X=D). The lambda function I created is giving me a way to evaluate the joint with fixed data but any arbitrary choice of mu. This is similar to Python's Functool's Partial (stackoverflow.com/questions/15331726/how-does-functools-partial-do-what-it-does)
      4) I need this lambda (in log-space) for the optimization
      5) In the optimization, I approximate the ELBO by sampling (in essence, this is a Monte-Carlo approximation). For this, I sample from the surrogate and evaluate the log-probability of the surrogate on it, as well as the log-probability of the joint with fixed data on it.
      6) With this I calculate the loss (here the negative ELBO) by averaging over the samples
      7) I can then back-propagate over the computational graph to get the gradients of loss wrt the parameters of the surrogate.
      8) These gradients are then used to iteratively optimize the parameters.
      I don't think my reply was good at answering your question. Could you please elaborate on it, maybe also by the help of the additional information I gave here.

    • @davidlorell5098
      @davidlorell5098 2 ปีที่แล้ว

      The lambda function you are referencing has one purpose: Get the log_probability from the joint distribution (generative model) of (mu, data) pairs. In fact, the "data" input gets bound because it will be the same every time. Thus, the lambda function tells you what the log_probability is of (mu, DATA) where mu can be anything and DATA is determined beforehand. Another way to think of it is that doing this produces the function/distribution P(mu, data | data=DATA).

  • @timothyliu6110
    @timothyliu6110 9 หลายเดือนก่อน

    Hi I'm very new to variational inference and I may have couple of very naive questions. 1) In this problem setup, we have all the parameters (mu_0, sigma_0, sigma), what if I have unknows parameters in the model, then how could I calculate ELBO using sampling? 2) I also don't quite understand log p(mu_l, X=D), does it mean given mu = mu_l, the log probability of all data samples?

    • @MachineLearningSimulation
      @MachineLearningSimulation  9 หลายเดือนก่อน

      Hi,
      thanks a lot for the comment 😊. I would love to help you, could you give me some timestamps as to which point in the video you are referring to? It's been some time since I uploaded it, so I do not remember every point I made.

  • @longfellowrose1013
    @longfellowrose1013 2 ปีที่แล้ว +1

    I have a question: for multiple latent variables, such as the problem in previous video, is it possible to use sampling method to solve? Should I first use the Mean Field Approach to get the unormalized posterior distribution of these variables, and then use tensorflow's auto differentation to learn the parameters? Can you describe the general steps? Hope my question is clear.

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว

      Hi, thanks for the comment. :)
      I am a bit unsure whether I can follow you correctly. Generally speaking, VI methods are approaches to obtain full (surrogate) posterior distributions whereas sampling strategies (like versions of MCMC) only give you statistics on the posterior distribution (like mean, variance etc.). Maybe you can elaborate a bit, I would like to help to you.

  • @mashfiqulhuqchowdhury6906
    @mashfiqulhuqchowdhury6906 ปีที่แล้ว

    This is also an interesting topic. You explain beautifully to this topic. Do you have any plan to make a video on variational auto encoder (VAE) and Generative adversarial network (GAN)? As I understand VAE is a very nice application to understand variational inference. I also believe that this topic also cover generative modeling and unsupervised clustering as well.

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Hi, yes you are right. VAEs are a great example for showing variational inference, in particular in combination with automatic Differentiation. I definitely want to cover this, but not at the moment. The playlist on probabilistic ml is on halt right now, since I am currently covering topics that are closer to my own research.
      I will return to it though at some point in the future :)

  • @BoneySinghal
    @BoneySinghal ปีที่แล้ว +1

    Hi
    I am wondering about gradients, here we didnt compute the formula to update the gradient but the tf did it by itself ( I am asusming since the example models were simple so tf can compute for this case). Will this also be true if we take generative models which are more complicated than this (if we use posisson and gamma distribution instead of normal and normal distribution for mu and X)? Do we in that case need to figure out the gradient formula and write it by ourself?

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว

      Hi,
      that's a great question. Indeed, it is possible for tf to take the gradients of arbitrary models created within the limits of the operations tf supports. That's the beauty of automatic Differentiation. I think that your question arose because you thought tf took the gradients by symbolic differentiation (i.e. what you'd do by hand or if you used sympy etc.). AD works differently from that.
      At the moment, I am producing quite some videos on AD as I find it intriguingly beautiful. Maybe this is also of interest to you.

  • @McSwey
    @McSwey 2 ปีที่แล้ว +1

    Indeed, without the q in the mean, the algorithm started working properly (I made my own hand-crafted implementation :D). I wonder, is there a method, that tries to estimate the expectation ElogP(X, Z) without sampling, by a function that approximates it?
    For instance, we are given observed variables: Bernoulli B = step_function(X + m), X ~ N(mu, sigma^2), where step_function(x) = 1 if x > 0 else 0. We would like to estimate m, with uncertainty modeled by a normal distribution. Unfortunately, the step function has derivative equal to zero, so by sampling we still get derivative equal to zero (which is false, the derivative of the expectation is non-zero, i.e. the expectation of the derivative is not the derivative of the expectation: E'f(X) != Ef'(X)).
    However, if we do not sample the m, we could say that given X = -2, B = 0 and current estimate of m's uncertainty N(a, b), the P(step_function(X + m) = B | a, b) = P(X + m

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว

      Hi,
      yes, indeed, that was a rather big mistake with having the q in the mean. :D
      I am not quite sure, whether I get you correctly. Could you maybe also give a timestamp to the point in the video you are referring to?

    • @McSwey
      @McSwey 2 ปีที่แล้ว +1

      @@MachineLearningSimulation Yeah, sorry, it's a bit off-topic 😅. I was referring in general to the method, but let's look at 15:24 for reference. There's the equation with sampling. In general, instead of reading the following badly written, overly complex elaboration, I think you would understand the problem faster, by trying to solve it. The question is what are the other options of elbo approximation, because sampling sometimes (I recon) doesn't work. (well, it's not exactly sampling's fault, more gradient's).
      So in my example, we have two observed variables: X and B. Actually, we don't want or need assumptions for the X's distribution. Let's say, we have one observation. Now, what's the probability:
      p := P(X=X_1, B=B_1, m=m_l)
      It's a part of the equation inside the log (P.../ Q), we are looking at the l'th sample. Well, if the m_l is sufficiently large or small, the probability is equal to one, and if it's not, the probability is equal to zero. Example:
      Let's say:
      * X_1 = 1
      * B_1 = 1
      * and m_l is -1.1
      Step by step:
      --- B = step_function(X + m)
      --- But B_1 = 1 != step_function(-0.1) = step_function(1 - 1.1) = step_function(X_1 + m_l)
      --- The probability p is zero.
      --- Importantly, when we change the m_l by a tiny bit, the probability is still zero!
      --- So the gradient with respect to m_l is zero.
      --- Because m_l = mu_m + std_m * randn(), and gradient with respect to m_l is zero, the gradient with respect to parameters of m's distribution (mu_m and std_m) is zero.
      --- This works for all samples
      --- Also in the case, when p = 1, gradient is similarly 0.
      --- and if the gradients of p are zero, so have to be by the chain rule gradients of ELBO approximation
      Now I see, there's also a problem with log(0) = -inf, so the whole equation doesn't make sense, but let's skip that for now. We could introduce a more complex model, where the B a bit randomized B = flip_first_if_second(step_function(.), Bernoulli(0.1)). Not the same model, but it has the same gradient problem, without the log problem.
      The point is (had we fixed the log(0) issue) it seems like the changes of m's distribution do not change the ELBO incrementally. So the gradient is zero. But it's false! When we change the mu_m and std_m by a tiny bit, the probability that we sample the correct m changes.
      --- Let H be a set of all m, where B_1 = step_function(X_1 + m).
      --- In our case, the H = (-1, infty] and is measurable.
      --- supposing m ~ q
      p = P(X=X_1, B=B_1, m=m) = P(B_1 | X_1, m)P(X_1, m) = P(step_function(X_1 + m) = B_1 | m=m, X=X_1)P(m, X_1) = I(m in H)P(m, X_1)
      --- changing mu_m and std_m changes P(m in H) incrementally.
      --- So the ELBO also changes.
      --- overall gradient is non-zero with respect to mu_m and std_m
      Uffff, it's a lot. But it's simple in the idea.
      Of course, our example is simple, and we could solve it analytically. The problem with more complex systems is that, we usually cannot do it, so the sampling is a good option, when the gradient is informative.
      So I was wondering, what's the name of the method, when we try to approximate the ELBO, especially for complex systems. The idea is to create approximation of elbo, that we could differentiate (the gradients would be somewhat correct, for instance non-zero) and optimize the model.
      PS: After writing it, I realize if you get to this point, I'm sorry I'm wasting your time 😬.

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว

      @@McSwey Thanks for the elaborate reply. :) I think I am getting closer to what you want to ask. However, I would say that I am probably not qualified enough to give a good answer. For my applications (so far only VAEs & Normalizing Flows) sampling expectations have never an issue. Frankly, my applications have also never involved too complicated problems. Therefore, I do not have good advice on how to proceed in case that does not work.
      Maybe someone more qualified stumbles across this question here in the comments. I would recommend you to maybe also ask your problem in the forum of some probabilistic programming language. When I was casually browsing, I found the Stan forum (discourse.mc-stan.org/ ) to be a helpful resource, from time to time.

  • @ccuuttww
    @ccuuttww 3 ปีที่แล้ว +1

    LOL never try this instead of EM
    I think this is a Bayesian neural network

    • @ccuuttww
      @ccuuttww 3 ปีที่แล้ว

      23:03 this is cheating I don't know TensorFlow can describe the Bayesian network so easy

    • @ccuuttww
      @ccuuttww 3 ปีที่แล้ว

      I don't use TensorFlow but I find some of the functions I don't understand
      log_prob : I think it is log of pdf
      TrasnformedVariable : unknown purpose

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 ปีที่แล้ว

      Probably the lines are blurry between where VI/EM for Directed Graphical Models end and Bayesian Neural Networks start :D
      I think that nowadays, due to the advent of popularly available automatic differentiation frameworks (like TensorFlow, Pytorch, Theano, Julia Zygote etc.), many techniques from different branches of science merge and allow for fascinating discoveries.

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 ปีที่แล้ว

      Regarding the 23:03 comment: I think the fact of easily modelling Bayesian Networks/Directed Graphical Models is at the heart of all probabilistic languages/frameworks. There are more projects that allow to easily do this, e.g. PyMC3 ( docs.pymc.io/ ) or Julia Turing ( github.com/TuringLang/Turing.jl )

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 ปีที่แล้ว

      Regarding the questions on TensorFlow Probability:
      1) yes, log_prob evaluates the logarithm of the probability density (the equivalent in SciPy is logpdf), working in log space is advantageous for low probability densities (which are quite common for complicated models/big datasets) in order to avoid underflow
      2) The TransformedVariable is needed to transform our constrained optimization problem into an unconstrained one (see e.g. chapter 10.3 of algorithmsbook.com/optimization/files/optimization.pdf ). The constraint is necessary because the variance/standard deviation has to be strictly positive.
      Let me know if that helped :) I can also elaborate on the second point if that was not clear.