Introduction to KL-Divergence | Simple Example | with usage in TensorFlow Probability

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ก.ย. 2024

ความคิดเห็น • 15

  • @yongen5398
    @yongen5398 3 ปีที่แล้ว +5

    this channel needs more attention, such good contents :D

    • @MachineLearningSimulation
      @MachineLearningSimulation  3 ปีที่แล้ว

      Thanks a lot :)
      Feel free to share the channel with friends/peers to help it grow in reach. I would extremely appreciate that.

  • @maartendevries4678
    @maartendevries4678 2 ปีที่แล้ว +3

    Such a great channel! Sharing this with my friends and colleagues.

  • @vincentwolfgramm-russell7263
    @vincentwolfgramm-russell7263 2 ปีที่แล้ว +2

    Great video! Best explanation on this topic on TH-cam, thank you!

  • @mullermann2899
    @mullermann2899 ปีที่แล้ว +1

    Amazing!

  • @anupkulkarni6986
    @anupkulkarni6986 ปีที่แล้ว +1

    Great video

  • @InquilineKea
    @InquilineKea ปีที่แล้ว +2

    When p and q do not overlap (or when q is 0), how does this affect D(p||q)?

    • @MachineLearningSimulation
      @MachineLearningSimulation  ปีที่แล้ว +1

      That's a great question. I do not have a good answer. Maybe this stackexchange question can be helpful: stats.stackexchange.com/questions/362860/kl-divergence-between-which-distributions-could-be-infinity

  • @chadgregory9037
    @chadgregory9037 2 ปีที่แล้ว +1

    Potentially silly question... Since the KL is dealing with distances between distributions.... if we are building a model which is forecasting a distribution, whether say a distributional lambda layer, or even a regular dense mutli output with a mu/sigma, and we have a loss function that is the neg log likelihood stuff... that set up is basically forecasting the mu, but making it more bayesian if u will, in that its making a distribution by also giving a sigma which is kind of like a confidence interval, and of course, a dense network output would give the same distribution for the same inputs, where as if we use densevariational layers and other probabilistic layers, the weights and biases would be drawn from distributions at inference, meaning our output can vary just slightly every time we draw a prediction....
    however what I am wondering is this...what if a priori we have distribution(s) as inputs to the model, rather than just say, a value we want to forecast and call it mu, we have a fully defined distribution in a training set... are we then back to just say, a multiple output regression where we'd literally just want to min an mse or something around the model learning those distribution variables outright.... or would there be something more meaningful, if the idea of say KL divergence is used as a loss function.
    I'm not even sure if this makes sense... but I do think it makes sense in that, how can it be logical to think about optimizing a distribution using a normal style loss function, which would essentially be just minimizing a euclidean distance between the mu and sigma, or whatever distribution we're dealing with.... but it doesnt make sense to treat the mean and std dev this way, because how can you just linearly scale both toward the goal. I feel like there must exist some deeper something going on. Obviously a mu thats further away from reality can be "saved" to an extent by allowing a larger std dev....
    And obviously I could imagine taking that last concept, of having actual distributions, and outputting the mu and sigma both as their own separate distributions... which now Im not sure how muddy the water is starting to get. Then I start wondering about the potential for a model to output parameters for a multivariate distribution, and the implications of the results of that vs treating them jointly
    I can mention, my motivations for exploring these topics deal with building predictive models around non-stationary data that "randomly" has stationary segments through time. So there can be a large number of sequential inferences made such that the desired output of the std dev wouldn't change much, until it does.... and at the same time, its not just as simple as equating the std dev to a confidence around how accurate the mu prediction will be... because one desired trait of the models I'm building is, for them to obviously be as accurate as possible, but I also don't want them constantly making a new prediction and shifting the values, if you will.... to the extent that, the longer the same prediction going forward, remains valid and doesn't need updated, should be rewarded. Almost imagine a box thru time, and a process existing mostly within said box, doesnt have to be perfect, and then jumps occur where the box shifts.... which doesn't necessarily imply a shift in the std dev though.
    So obviously I'm dealing with modeling volatility... and really, one of the BIGGEST things I'm searching for now, is what kind of component, and in what way, I can add to a model, which will allow it to learn the concept of a jump, and not just go nuts on the confidence interval around the mu prediction. Of course the last and most important step to this being the engineering of a custom loss function... which I've really been trying to conceive how it could be possible to utilize something like a UO process logic, to create a framework describing how model outputs should be behaving based on how the underlying process is behaving. I've also been down the rabbit hole of the potential of using a copula to describe how things should be related to each other, and even adding in trainable variables and branches to a network with extra outputs to represent variable relationships between correlations with the progression of time.
    Just 3 weeks ago I knew very little about machine learning... and 4 weeks ago I knew little about stochastic processes.... but I've taken the deep dive... and now here we are. And tbh after finding tensorflow probabilities, I feel like a kid in a candy store lol. The possibilities seem endless, and the amount of power at your fingertips is insane... So many ways to build models and engineer data flows thru a model, and defining custom variables that can be trained also, and custom loss functions.... I mean it's literally the entire framework to play god. One thing I will mention though is, I am approaching things from the perspective of hacking the loss function, and using extra data (training data to calculate loss, but not the actual desired target variables) rather than going the reinforcement route. I think the only thing that matters is as long as the loss function is differentiable, then backprop is possible, which means SGD or something similar can be used, since we have either a differentiable reward function or a differentiable loss function. I'm still new to all of this, but not stupid, and would like to assume I have intuition for things lol.

    • @chadgregory9037
      @chadgregory9037 2 ปีที่แล้ว

      another thing I wondered about was using some kind of copula to tie together the random processes that govern the overall process, and sampling current distributions, and using those with mcmc or whatever in order to forecast forward using current distributions and correlation levels. I feel like there are so many ways to skin the sheep lol

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 ปีที่แล้ว +3

      Hey Chad,
      sorry for the late reply, I wanted to give you a thorough answer to your comment and did not have the time for it yet. But now I will try my best :D
      Let me first answer to your last point regarding TensorFlow Probability: I can totally understand you. It was around one year ago, when I discovered the package as part of a submission for a Uni course, and I somehow fell in love with it. That was because I enjoy seeing implementations and in particular interfaces to concepts I read in text books (like Christopher Bishop's "Pattern Recognition and Machine Learning" or Kevin Murphy's "Machine Learning: A Probabilistic Perspective). It helps me determine the intricate difficulties of the algorithms as well as their practical relevance. And it was an eye-opening moment for me to see how they implemented the KL Divergence between distributions. Since distributions in TFP are just instantiated objects all information relevant to computing the (analytical) KL resides within them, i.e., only the definition of the two random variables (with their associated distributions) is necessary and no random variates (i.e., samples from the distribution) to compute it. Although, of course, in the numerical approximation you would work with samples. Retrospectively speaking, this is surely not different from how other probabilistic programming systems like Stan, PyMC3 or Turing.jl are handling it, but it was the first time I figured it out.
      And I can understand the feeling of being overwhelmed. Unproductively for the course submission, I became eager to understand TFP as a whole :D. That took quite some time and effort, maybe was not the most effective way of learning about probabilistic programming, but definitely one that I enjoyed a lot. Was it necessary: absolutely not :D
      Regarding your other aspects in the comment: If I understand you correctly, then the first section of your comment is on loss functions in optimization (or in more particular in optimization for Machine Learning). You make a good observation that we can build arbitrary loss functions that depend on arbitrary computations and, in case all intermediary steps are differentiable, the entire computation is. Then we can essentially use a gradient based optimization framework to adjust the parameters within the computation, like the weights in a Neural Network. More generally speaking, this is called "Differentiable Programming" and is gaining ever more attention over the last years, especially in the Julia Programming Language (en.wikipedia.org/wiki/Differentiable_programming ).
      You were then wondering about optimization of full distributions? That reminded me of Variational Inference (th-cam.com/video/HxQ94L8n0vU/w-d-xo.html ), or more generally Calculus of Variations. It is indeed mathematically possible to optimize over function spaces (or in the context of probabilistic programming over distribution spaces). However, this is not commonly done due to the lack of analytical optima, they are only available in certain simplified contexts (like here: th-cam.com/video/J7U8mRew2g0/w-d-xo.html ). In real-work scenarios, one then usually resorts to somehow discretizing the problem like proposing a parametric family of distributions (like Normals with unknown mean) and then optimizing in a finite-dimensional parameter vector space.
      You then also mentioned training data in the form of a distribution. In a purely probabilistic interpretation of Machine Learning your training data is a distribution, but instead of having access to the underlying analytical form of it, you just have samples from it. Say you have Euclidean data that is 20-dimensional, and you have 10'000 samples of it. Then, you have 10'000 random variates (i.e., draws) from a 20-dim distribution over your input space. This distribution can be arbitrarily complex and for most analyses only exists in theory, but its existence is typically relevant for expressing certain concepts in Machine Learning. You could of course look at statistics of your training data, like the 20-dimensional empirical mean or the 20x20-dim empirical covariance matrix.
      You then mentioned sth on your application regarding "building predictive models around non-stationary data". I must admit I could not quite follow your explanations there. Could you elaborate? Maybe you have an example? Potentially also for why you want to learn a jump function. Since this is a discontinuous function, it might come with some mathematical challenges.
      I hope that could give you some first insight, :) feel free to point out what I missed in your comment.

  • @BillHaug
    @BillHaug 10 หลายเดือนก่อน +1

    thank you