This is the most intuitive derivation of Policy Gradient Theorem on the internet. Thank you for being my teacher! Every RL intro class should begin with this video.
@@elliotwaite seriously, the best. Andrey Karpathy has got some tough competition :) Please do a follow up relevant to RLHF as used in LLMs, ideally focusing on the newer/more approachable algos like DPO. Pleeease :)
@@lmfaow00t thanks! I may start making videos again soon, but I'm not sure if I'll make videos similar to this one. However, I'll keep your suggestion in mind. Thanks for the support.
I tried to implement the algorithm by hand. I created a 3 layers network with one input(2) layer, one hide layer(32),and one output(2) layer. Before training, I collect 100 episodes ,then training the network, and repeat training 10000 times,finally,the probability of X0,X3,X5 converge to (0.999877,0.97339,0.99900). this is my first reinforcement learning program, thank you for your help,great tutorial!
I can't believe I get this for fee. Such an amazing visualization. I have a vague understanding about backpropagation and expected return. however, I never imagine the relationship can be this straightforward and clear.
One of the best! I'll have to watch the video a few times, but it's already helped me figure things out more intuitively. The example and animations are excellent. Great work.
Great Video! Using graphics and combining the logical explanation with pseudocode was really helpful to me. Most of the time you only see one or the other.
Thanks for the feedback. I've been wondering if adding all those different ways of explaining it was exessive, so I'm glad to hear you found it helpful.
Wow, thanks YaGuang for the Super Thanks! It's much appreciated. I'm glad you found the explanation intuitive. I saw you are working on the Google DeepMind team up in Mountain View. I worked at the Google in Mountain View from 2015 - 2017 and have fond memories, lots of great people. I hope you are also enjoying your time there. I would imagine it's an exciting time to be working with that team.
Hey Elliot, Loved this explanation so much, man! Keep up the awesome work. I'm a beginner and I feel RL is often looked upon as a difficult paradigm due to the heavy math in there, but people like you are a blessing, for putting the ideas in such an outstanding fashion :)
This was a very helpful video for me, thank you Elliot! The toy problem with the robot was so helpful to visualize everything, especially showing the the state-action sequences with the probabilities
superb bro, this is something which is missing is most of the other videos...This really help building the intuition for beginners. Keep building such videaos , thatnks a ton
Well, watched a few times, I can say that there's no concept left untouched from statistics, probability, math, NN, DL domains in this video. I'll list them as much as possible: reinforcement learning (of course), neural networks, backtracking, gradients/derivatives, chain rule, partial derivatives, probability trees, multiple random events, sampling, probability sampling, monte carlo, mean, standard deviation, pytorch and some more I can't count. Thanks to author again.
This was a phenomenal video and should be required viewing for any class on machine learning. Any chance you are considering a video about policy gradients with continuous action spaces?
Thanks, Andrew! I don't have any plans for making new videos in the near future, but I may eventually, and that would be a good potential topic to cover. I'll add that to my list of future video ideas. Thanks for the suggestion.
This is one of the counterintuitive parts. I start explaining it at 13:49. This dR/dx is only calculating the first step in the backward pass, as in we are only calculating the partial derivates for the expected return with respect to the outputs from our neural network (as in the outputs from the softmax function at the end of our neural network). So in this step, we ignore the fact that the probability of going right (prob(right)) has to equal 1 - prob(left) because we don't make any assumptions about what the outputs for our neural network are, we don't assume they are probabilities that have to add up to one. It's only after this step when we continue backpropagating these gradients through the softmax function of our neural network that this restriction is taken into account, that prob(left) has to equal 1 - prob(right). This is because it is the softmax function that normalizes the probabilities to require them to add up to 1. So after the gradients get backpropagated through the softmax, they will be correct, even though they start off looking a bit strange since we start off not assuming anything about our neural network outputs, meaning we start off ignoring the fact that they have to sum to 1.
Excellent! This approach of developing the intuition works best for me . I often wonder how anyone ever came up with the end product of something this? A bunch of dense math as shown in the beginning? I suspect it starts with intuition like this and then is formalized to equations. I think that to improve something or invent new things, one has to go logically and intuitively first, highly formalized math second. When things are too abstract from the start, it’s too easy to get lost in the weeds. Thank you for these videos!
I agree. However, sometimes I figure out new things (or at least new to me) by working out the math first, and then after I see that the math leads to a simple result, I try to figure out an intuitive reason for why it does. So I think both modes, working intuitively and just working with the math symbols, can be useful for exploring new ideas. But I agree that just working with the math symbols feels a bit like exploring in the dark, and it's hard to know where to go next without turning on the lights of intuition.
Truly logical, clear from theory to implementation, bravo. One question I still have is "is it really true that the mathematical derivations at the start of video tell so much?" Anyway, they didn't speak to me.
Thanks! For me, those mathematical derivation shown at the beggining of the video are hard to interpret, but after working through the logic I explained in the video, I can make sense of them. One trick that helps me is removing the logs in those equations by using the rule that "the gradient of the log of x" equals "1/x times the gradient of x". So wherever it says "the gradient of the log of P(...)" I just think of it as "1 / P(...) times the gradient of P(...)". Which can be read as "The gradient of the probability of that trajectory given the current neural network weights, divided by the probability of that trajectory given the current neural network weights." I'm not sure if that helps others, but it makes it easier to for me to understand. I think they just write out the math in the log form to make the equation more concise.
Very nice and intuitive video, thank you! I wish you visualized several iterations of the optimization procedure to show us the convergence of the algorithm to the desired policy. Maybe we could get a better understanding of exploration/exploitation from your amazing example.
Thanks for the great video and detailed explanation! I'm curious about the reason of multiply e.g. 1/0.7 at 36:30, is it just the strategy we apply during the sampling such that later the sum and the calculation of gradient will equals to the derivation we got theoretically? Thanks!
Yep. You can also think of it as balancing out the fact that we are sampling actions. Because when we add a value to our total sum, it is kind of like a nudge to our policy. As in, if we get a positive reward after an action, we nudge our policy to take that action again, and if we get a negative reward, we nudge our policy to not take that action again, but we want our nudges to be balanced so that if we are getting the same positive reward for going left as we do for going right, we want the nudges to go left more to equal the nudges to go right more so that the policy doesn't actually change (because when there is no benefit to going left or right, we don't want the policy to change). So, for example, if our policy was left 0.1 and right 0.9, we'd sample going left 1 out of 10 times and right 9 out of 10 times. So when we add our going left sample to our total sum, we multiply the value by 1/0.1 (or we multiply that nudge by 10) and when we add our going right samples to the total sum, we multiply the value we add by 1/0.9 (or multiply those nudges by 10/9), so that the effect of both going left and right are considered equally (the going left nudge is multiplied by a bigger number to a account for the fact that it is sampled less often).
This is just amazing. One of the best explanations out there for vanilla PG. I do have a question though. At 48:20, you code probs/probs.detach() to detach prob from backpropagation. However, wouldn't that just return 1 since we divind it with itself? Shouldn't it be just prob.detach()?
That part of the code does look odd, and you are right, dividing the value by itself will equal one. But if we replaced `probs / probs.detach()` with just `probs.detach()`, then that line would be: loss = -(probs.detach() * future_returns).sum() / num_episodes ...but when you detach a tensor, it becomes equivalent to a constant (and by that I mean the tensor will no longer backpropagate any gradients through it). And since `future_returns` came from the environment, it's also a value that we can't backpropagate through, and `num_episodes` is a value that didn't depend on our policy, so it's also a constant that we can't backpropagate through. So if we called `loss.backward()` using the above `loss`, it wouldn't do anything because all the values that were used to calculate that loss were treated as constants. It's like saying if `loss = -(3 * 5) / 7`, how will changing the parameters of our policy change the loss, and the answer is that it won't since all the values used to calculate the loss are constants. But when we keep in the attached `probs` in the equation, it looks more like this: `loss = -((policy_output / 3) * 5) / 7`, and now we can see how changing the output of our policy will change the loss, and we can calculate a gradient for the parameters of our model. So then the question is why do we also need the detached probs, why not just use: loss = -(probs * future_returns).sum() / num_episodes ...and the reason is that adding the division of the detached probs balances out the fact that we are expected to only sample those specific actions with a probability of `probs`. For example, let's imagine we only have two actions, A and B, and that our current policy samples A 99/100 times and B only 1/100 times. And that action A returns a future reward of 1.01 and action B returns a future reward of 1.02. So even though B is sampled much less often, it's actually the slightly better action to take, and our gradient should favor taking action B more often. So now let's imagine we sample 100 actions and as expected we sample action A 99 times and B only 1 time. Then if we used the loss directly above, our loss would look like this: loss = -(.99 * 1.01 + .99 * 1.01 ... (that repeated a total of 99 times) ... + .01 * 1.02) / 100 But the `.99` and the `.01` are actual the outputs of our model, which are the prob for A and the prob for B, so we can think of it more like this: loss = -(prob_A * 1.01 + prob_A * 1.01 ... (that repeated a total of 99 times) ... + prob_B * 1.02) / 100 So now if we ask how can we minimize this loss, the answer will be to increase prob_A, because that will make this sum more negative, but this isn't what we want. We actually want to increase B since it returns a better future return. So let's instead try it again but this time also with the `probs.detach()` values, it will look like this: loss = -((prob_A / .99) * 1.01 + (prob_A / .99) * 1.01 ... (that repeated a total of 99 times) ... + (prob_B / .01) * 1.02) / 100 And now if we ask how can we minimize that loss, the answer will be to increase prob_B (it has a slight edge now over increasing prob_A). We can see this more clearly if we simplify the sum: loss = -(99 * (prob_A / .99) * 1.01 + 1 * (prob_B / .01) * 1.02) / 100 And now if we substitute dividing by .99 by multiplying by 100 / 99, and we also substitute dividing by .01 by multiplying by 100 / 1, we get: loss = -(99 * (prob_A * 100 / 99) * 1.01 + 1 * (prob_B * 100 / 1) * 1.02) / 100 And the 99s cancel out, and the 1s cancel out, and the 100s cancel out, and we are left with: loss = -(prob_A * 1.01 + prob_B * 1.02) And now we can clearly see that increasing prob_B rather than prob_A will make the loss more negative. So we need to divide our probabilities by the detached versions of themselves to compensate for which probabilities will be sampled more often or less often. Dividing by `probs.detach()` is like saying, "we are only going to sample this action with a probability of `prob`, so we need to boost the weight of this sample in our loss by the inverse of that probability (if it's a low probability, boost it by a lot, and if it's a high probability, only boost it by a little)." So in `probs / probs.detach()`, the numerator is there so that we can propagate gradients, the denomintor is there to boost those calculated gradients by the inverse of those probabilites. I hope this helps.
@@elliotwaite Hey, thank you SOOO much for this detailed explanation. You are just too awesome and to properly explain like this is really commendable. So we can say that prob/prob.detach() is something like a normalizing term? Because the equation after prob/prob.detach() looks similar to a supervised learning technique where are finding the loss on the basis of how good the prediction was. Did I get that right?
@@sarvagyagupta1744 yeah, the 1/probs.detch() part can be thought of as a kind of normalizing term. However, I usually think of normalizing as making sure a sum adds up to 1, which I don't think that term does, so I think of it more as a re-weighting term, or a balancing term. But yeah, it's very similar to the supervised learning equation, but because we aren't given the dataset, but instead are generating the dataset by sampling actions using the probabilities output from our current policy, we have to add that 1/probs.detach() term to compensate for that difference.
@@elliotwaite yeah. I kinda felt that I was using the term very loosely. I was thinking of it as clipping instead of normalising. So is that a better comparison? Like how we do in PPO?
@@sarvagyagupta1744 I think of clipping as something that is conditionally applied. For example, in PPO-clip, the clipping only kicks in if the new policy starts to get too far away from the old policy. Whereas the 1/probs.detach() term is always applied. So for me, clipping means something different. But that might just be how I think of that word.
Thanks for the comment and suggestion. I do like the bandits problems. I haven’t been making videos recently, but maybe I’ll get back into it eventually. I was first introduced to the bandits problems in David Silver’s RL course lecture here, which you might also like: th-cam.com/video/sGuiWX07sKw/w-d-xo.htmlsi=4MtbQCR4kO4FL-DD
@@elliotwaite Thank you! I actually finished that series up - it's amazing :) But even this covers contextual bandits very briefly (video #9 in the series) And if I liked anything on RL after that lecture series, it's your video - kudos to you on it!
Thanks for amazing lecture. I have question At 32:41 i cannot understanding green term I have thought that 0.7 is one of the parameter of x1 and we change this x1 for best policy but at the time of video yhat has n*0.7*x1*avg return after x0 could you elavorate about why we have to muliply 0.7?
At that point in the video (32:41), Ŷ is not the correct estimation of Y. I probably should have clarified that by renaming that Ŷ to be something like "An initial guess of Ŷ". I was trying to show how you could go about trying to figure out a way to estimate Y by coming up with an initial guess for Ŷ, and then modifying that initial guess as needed to make it be correct. For the initial guess, I intentional use the wrong algorithm of "sample episodes, and then for each action, add that action times the reward received after taking that action to a sum, and that sum will be our initial guess of Ŷ". After this initial guess, I see that the sum will be incorrect by having the extra "n" multiple and the extra "0.7" multiple. So to correct for this, I update the algorithm that I use by adding the step: "also, divide each item in the sum by n (the total number of episodes sampled) and by the constant value of the action probability (which would be 0.7 when sampling the x0 action). After these changes, Ŷ (our sampled estimate of Y) is correct, meaning that it will tend to converge towards Y as we sample more episodes. And we have to divide by 0.7 (the value of x0) instead of x0 the variable to get the correct estimate of Y. This is because if we divided by x0, it would cancel out the x0 variable instead of the 0.7 value, and then Ŷ would not look like the Y equation above it that we are trying to estimate. Does that help? Let me know if you have any follow-up questions.
Great video! Can you make another one like this (with animations instead of maths) that demonstrates how LLMs work (including fine training). I don't understand math at all but am very interested in learning more about LLMs. I just cannot understand the math and statistics behind it.
33:40 Could someone explain this? x5 is the probability of taking the action to go right to receive the reward (as they're the variables for the action probabilities 27:52), so why do we do .2 * x5? As .2 and x5 are both the probabilities of going right, why are we multiplying them with each other? Why do we multiply the probability of going right by itself? It doesn't make sense. :/
Great question, this is a confusing part, and you are correct, we should not be multiplying by both .2 and x5. I probably should have made it more clearly at that part of the video that that example is what should not be done. The reason I started off showing the wrong way to do it is that at 33:40, where I am trying to figure out how to estimate that Y value, I wanted show how someone might go about figuring out how to estimate that value if they didn't already know the correct method, so I pretend I don't actually know how to estimate it and I guess at a solution (which at first is the wrong solution, but I figure out how to correct it later). My first attempt at a solution is to sample episodes, and each time I take an action, I add to my sum that action probability (as a variable) multiplied by the reward received after taking that action. The sum I actually estimate with this method is the Ŷ value at 33:40, which as you noticed is an incorrect estimate of Y. I have an extra multiple of n, and I am multiplying by the action probabilities twice, once as a constant value and once as a variable. So to correct for this, in my next attempt at estimating the sum, I divide the whole thing by n, and also, each time I take an action, I add to the sum that action probability (as a variable) divided by that action probability (as a constant value) multiplied by the reward received after taking that action. By dividing by that action probability as a constant, I cancel out the implicit multiplication by that action probability that is added just by the fact that I only add this new value to the sum after having already taken that action. This corrected estimate can be seen as the Ŷ value at 35:40, and this method is how to correctly estimate Y. I hope this helps. Let me know if there is anything you are still confused about.
Thanks for the video! But I'm confused why the formula used in the log probability code at the end or at 37:22 was different from the formula at 38:05 (the one with log). These formulas have different gradients for the same variables.
Are you talking about the difference between the one at 37:22 that starts with Ŷ = (1 / 4) * (log(x0) * [3 + 0 + 0] + ...), and the one at 38:05 that starts with Ŷ = (0.7) * log(x0) * [avg return after x0] + ... ? If so, the first one is what we got from sampling 4 random episodes from the environment, and the second version is what we would expect to get on average if we kept sampling more and more episodes, meaning, the value would converge to that equation. So those equations do have different gradients. The second equation gives us the gradient we actually want, and the first equation just gives us an estimate of that gradient, using only 4 random episodes. The first equation is an example of what we will actually get in practice when we sample episodes, and the second equation is the rationale for why using the first equation works, because on average, it will point us in the direction we are trying to go. Let me know if this didn't answer your question, or if you have any other questions.
I have a doubt. At 21:00, shouldn't the expected reward of each path be: prob1*reward_1 + prob2*reward_2 as compared to the one used as (prob1*prob*2) * (reward_1 + reward_2).
If the example is that prob2 comes after prob1, it should be prob1*reward1 + (prob1*prob2)*reward2. Said another way, only prob1 of the time will you ever receive reward1, and only (prob1*prob2) of the time will you ever receive reward2. This is because the probability of receiving each reward is the product of all the probabilities along the path to that reward. Does that make sense? Let me know if you still have any doubts.
@@elliotwaite yeah my bad, so assuming we get the rewards at each particular state as well and not all net rewards at the end of reaching the terminal state, estimated reward would be (prob1*reward_1) + (prob1*prob2*reward2). For the example that you are using, total rewards along the path are only received by the user after termination is achieved ( like Chess and unlike MDP process where at each state you get some reward), hence (prob1*prob2) * (net rewards along the way works). I am assuming if the first case was used the "expected rewards" would have been different for each path as well the policy gradients regarding it. Would love if you can elaborate on it, that would it work also with case 1, the above example.
@@AmanSharma-bj2yc, I think I'm confused. Can you let me know if we agree on the following things? In the example I am using, the rewards are received along the way, not just at the end. It can be modeled as an MDP with a start state and terminal states. However, the MDP does not allow any cycles, so it's like a one way graph. The expected reward of the policy is the expected reward from the start state. The expected reward from any of the other states could be different. Do we agree on these points or are we thinking about these things differently?
Thanks, glad you liked it. I haven't been making videos lately, but I appreciate the suggestion of making more RL videos, I may make more in the future. Also, I currently don't have the images I used for this video in slide format.
@@elliotwaite I am watching the video for the second time, in 7:05 how you calculated the expected return. based on my calculation for the policy given the expected return for state 0 is (1*3)+1*0.1*-10=2.9 but you obtained -0.7
@@Torabi_ahmad18973, there are different ways to break up the sum of the expected return. I was summing over all the paths, and for each path, adding the probability of taking that path times the total reward received along that path. There are two possible paths. Path A has transition probabilities (1, 1, 0.1) giving a total probability of 0.1, and the total reward on that path is (3 - 10 = -7). And path B has transition probabilities (1, 1, 0.9, 1) and a total reward of (3 + 0 + 10 = 13), so to get the expected return we sum over both paths, which is ((0.1 * -7) + (0.9 * 13)), which equals (-0.7 + 11.7 = 11). If you want, you can also break it up by each reward and the probability of getting that reward. So there is a 1.0 chance of getting the reward of +3, a 0.1 chance of getting the -10, and a 0.9 chance of getting the +10. So ((1.0 * 3) + (0.1 * -10) + (0.9 * 10)) = 11. You get the same result, just in a different way. Let me know if this helped clarify or if there is anything that is still confusing.
Thanks! I haven't been interested in making more educational videos latley, but I've added your suggestion to my potential future videos list, so maybe I'll get to it eventually.
@@elliotwaite I guess u should give it a shot. There are hardly few people on TH-cam who are explaining such complex topics in an intutive way. All the best.
At 29:19 in the video, while writing the expression for Y, when you take the derivative wrt x0, you have only considered the value .7 as constant. However .7 is the exact value of x0. Why is this valid ??
I see why this is confusing, I probably should have explained this more clearly in the video. The first thing to remember is that Y is just a function we are inventing to make it easier to calculate the gradient of R (the expected return). When calculating the gradient of R, we have have to consider x0 as a variable wherever it shows up in that function, and we can't just use it's 0.7 value as a constant (like you mentioned in your comment). But we can use the 0.7 value as a constant in Y, because we are just inventing a function that will give us the same gradient as R. The reason we can use the 0.7 value as a constant in those places, is because we add in those [avg return after x0] values, which essentially compensate for us using the 0.7 value as a constant in the other places. It's kind of like we are saying, "okay, I know the 0.7 is really the value of x0, but just consider it a constant here, and we'll correct for that by adding in these [avg return after x0] values over there." For example at 30:06, I show an example of taking the gradient of both Y and R with respect to x3. In Y, x3 is a variable in one place, and in the other places we use the value of x3 value (0.6) as a constant. However, in R, we do as you would expect, we use x3 as a variable in all places. But you can see that dY/dx3 and dR/dx3 end up equaling the same thing, which is what we are looking for, ad is why we can take the gradient of Y to get the gradient of R. Let me know if this clears it up of if you're still confused in this area.
@@elliotwaite Thanks for the explanation, however I do not understand how the [avg return after x0] compensate for using .7 is a constant. Although we are constructing a new made up function .7 is exactly x0 and had x0 been .3 then we would have used .3 instead of .7. So I am confused as the made up function clearly seems like a function of all the [x0,x1,x2....]s exactly ????
@@rishipatel8010perhaps looking at a simpler example of an R will help (this equation of R won't be related to a specific RL environment, it's just a random simple equation). Let's imagine R = 10 * x0 + 11 * x1 + 12 * x0 * x1. And let's say x0 = 0.7 and x1 = 0.3. Then we could create this Y function: Y = 10 * x0 * [1 + (1 / 10) * 12 * 0.3] + 11 * 0.7 * x1 * [(1 / 0.7) + (1 / 11) * 12]. And if we take the partial derivatives of R and Y with respect to x0 and x1 they will be the same. But if we rewrite Y as: Y = 10 * x0 * [extra stuff for x0] + 11 * 0.7 * x1 * [extra stuff for x1], you might ask why we use 0.7 as a constant in this equation rather than using x0 as a variable. And it's because if we replaced the 0.7 with x0, the gradient of Y would no longer be the same as the gradient of R. This is because the parts in the square brackets compensate for the fact that 0.7 is used as a constant in that place rather than using x0 as a variable there. The same thing is happening with the [avg return after x0] parts of the original example. Does this example help?
@@elliotwaite Thank you so much for the lightning fast reply, this one finally helped me develop my intuition, and I convinced myself with proof. Again thanks a ton, you are probably the best content creator on policy gradient explanation. I thought you will be interested in what I developed from my understanding. So here it goes. So what I understand from your explanation is that for any polynomial function depending on x0,x1,x2..... ie f(x0,x1,x2,x3....) where (xi has a max power of 1). ie not (x0)*(x0) I can always create a g(x0,x1,x2,x3.....) = xo*(partial diff of f wrt x0 at points x1,x2,x3...) + x1*(partial diff of f wrt x1 at points x0,x2,x3...) + .......... The quantity in the brackets (partial diff of f wrt x0 at points x1,x2,x3...) will have no dependency on x0 and will have a constant values as we substitute x1,x2,x3... fixed values. Which is exactly where we want to calculate out derivative for our gradient ascent!!!!!!! at some fixed point combination of x0,x1,x2......... The point is both f and g have the same partial derivatives. And this is exactly what you have done in the video as well.
great video. The only thing I can't quite grasp after multiple watchings is ... what are the targets to the neural network prior to backpropagating? Can you break it down a little more for each time step. I've got the forward propagation. Just need the per-time-step targets that are required in order to calculate the partials. I'm always asking you a question at 45:00. Tell me more about the 'loss' what dimension is this? Is it d=2 because you have two actions? Doesn't the loss function require two arguments: the predicted output and the target? Thanks a lot.
Ah, I can see why this is confusing. In this algorithm we actually don't use target values. If we knew which actions we were supposed to take, we could use those acitons as target values, but don't have that information. We could also try to say that our targets are to receive infinite reward for each episode, and the farther away we are from that target, the greater our loss will be, but then all our losses would be infinity, so that does work so well either. So what we do instead is just come up with a loss value directly, without using target values. And this loss just needs to be a value that when we update the parameters of our model in the direction that minimizes that loss, it results in our policy receive more total reward on average. The reasoning behind why we came up with the loss that we did is explained in the video, but once we have that loss, we just call loss.backward() on it directly, without using any target values, and we update the parameters of our policy using that gradient. Let me know if you're still confused about anything and I can try to explain it further.
@@elliotwaite ok this is new! We come up with a loss value directly he says . . . Is this the same as the difference between output and target value? In order to backpropagate I'm used to calculating the errors at the output nodes. What role does the loss play in the backpropagation algorithm. Thank you for your reply, very kind of you to write.
@@hoddy2001 the loss we use here is not the difference between an output and a target since we don't have a target value. This loss is just something that produces the correct gradient when we try to minimize it. Unlike a loss that is the difference between the output and a target, the loss we use here can be negative, which is impossible with an output-and-target loss which represents an error amount, cause you can't have a negative error amount. But for the loss we use, a negative loss is fine, we just try to make it lower. So if it's negative, we just try to make it more negative. The main insight is that our loss isn't an error amount, it's just a value that generates the correct gradient when we back propagate from it. Let me know if anything is still confusing.
@@hoddy2001 ... The role the loss plays in the backpropagation algorithm is that loss is what is getting minimized. The gradient that we get from calling loss.backward() is the gradient that moves in the parameters in the direction that minimizes the loss value the fastest. For example, if our model only has 3 parameters, we can think of the parameter space of that model as a 3D space, and calculating the gradient of the loss figures out which direction in that 3D space from the current parameter values will decrease the loss the fastest. Then we update the parameters by taking a small step in the direction of that gradient. If instead we have a million parameters, it's the same thing but the parameter space is a million dimensional space (which is hard to visualize, but the concept is the same).
@@elliotwaite thank you for your patience my good man. I am in a context where I don't have access to loss.backward() and need to know computationally how to compute the partial derivatives using the loss value. I understand now loss is a scalar value representing what would otherwise be the value of the error function. It is a value, that when the partials are calculated with respect to it, it points us in the direction of the gradient. Can you refer me to the definition of loss.backward()? How does it compute partials from this scalar value?
@@NeuralGamerAI yes, that is an interesting point, that vector will be a vector of ones, but we don't just use the values of that vector (which would be somewhat useless since they are all just ones, as you mentioned), we instead use the values of the gradient of that vector with respect to the loss. Another way to think about it is that when we are back propagating the gradient values through that operation, whatever the gradient for x is at that point will get multiplied by whatever the value of 1/x was in the forward pass, and then that new gradient will continue being back propagated.
This is the most intuitive derivation of Policy Gradient Theorem on the internet. Thank you for being my teacher! Every RL intro class should begin with this video.
Thank you. Much appreciated.
@@elliotwaite seriously, the best. Andrey Karpathy has got some tough competition :) Please do a follow up relevant to RLHF as used in LLMs, ideally focusing on the newer/more approachable algos like DPO. Pleeease :)
@@lmfaow00t thanks! I may start making videos again soon, but I'm not sure if I'll make videos similar to this one. However, I'll keep your suggestion in mind. Thanks for the support.
True dude, the intuitive derivation is always the best for most of the people
I wasn't expecting to see such a nice, graphic explanation. It's timely, thank you.
Thanks! Glad you liked it.
@@elliotwaite This is the best explanation, you have to create many AI explanation video, thank you🔥
@@ManusiaSetengahChiKuadrat thanks! I hope to make more videos at some point in the future.
I tried to implement the algorithm by hand. I created a 3 layers network with one input(2) layer, one hide layer(32),and one output(2) layer. Before training, I collect 100 episodes ,then training the network, and repeat training 10000 times,finally,the probability of X0,X3,X5 converge to (0.999877,0.97339,0.99900). this is my first reinforcement learning program, thank you for your help,great tutorial!
Awesome! That's one of my favorite ways to better understand an algorithm, seeing if I can implement it myself. I'm glad to hear you got it working.
I can't believe I get this for fee. Such an amazing visualization. I have a vague understanding about backpropagation and expected return. however, I never imagine the relationship can be this straightforward and clear.
Thank you for the kind comment 😊. I'm glad to hear it helped you develop a more clear understanding of those concepts.
This is the only video you need to learn the intuition and basics of Reinforcement Learning. Amazingly done! Thanks!
Thanks, Pablo!
This is pure gold! Thank you so much for time, hard work, and energy you put into this video. It's highly appreciated.
😀 I'm glad you liked it. Thanks for the kind comment.
One of the best! I'll have to watch the video a few times, but it's already helped me figure things out more intuitively. The example and animations are excellent. Great work.
Thanks!
This is a unique and awesome channel. Seeing these cool animations really helps to build an intuition. Thank you for uploading.
Best RL learning video for beginners
Thanks!
Great Video! Using graphics and combining the logical explanation with pseudocode was really helpful to me. Most of the time you only see one or the other.
Thanks for the feedback. I've been wondering if adding all those different ways of explaining it was exessive, so I'm glad to hear you found it helpful.
Watching such quality educational video for free is insane, thank you so much.
:) I'm glad you liked it.
Best video on the subject. Thank you, Elliot for creating the such a thorough content.
Thanks, Kanai!
Thank you for the super nice and intuitive explanation!
Wow, thanks YaGuang for the Super Thanks! It's much appreciated. I'm glad you found the explanation intuitive. I saw you are working on the Google DeepMind team up in Mountain View. I worked at the Google in Mountain View from 2015 - 2017 and have fond memories, lots of great people. I hope you are also enjoying your time there. I would imagine it's an exciting time to be working with that team.
Thank you Elliot! Glad to hear this that we were colleagues. It is indeed an exciting time and I wish the best of luck to your startup.
@@YaguangLi Thanks!
Damn this is like a gift to the learners of RL. You should do more videos on RL if you can, TH-cam only has a handful of good RL videos
Thanks, I'm glad you liked it. I may make more videos in the future, thanks for the encouragement.
it's so entertaining and intriguing, thanks for your work, Elliot!
Hey Elliot, Loved this explanation so much, man! Keep up the awesome work. I'm a beginner and I feel RL is often looked upon as a difficult paradigm due to the heavy math in there, but people like you are a blessing, for putting the ideas in such an outstanding fashion :)
Thanks!
Really great video! I enjoyed the graphical representation of the gradient. Can't wait to see more in-depth policy gradient related videos.
Thanks, Mohsen!
Thank you! It is extremely helpful to see a detailed worked example. Your video is very much appreciated.
I'm glad to hear you found it helpful.
A perfect video, I have never seen anything so good 👏👏 Thanks from Brazil 😁
Thanks, Vinicius!
One of the best Videos I have ever watched for RL AI! Great work and thanks a lot.
Thanks! Glad you liked it.
This was a very helpful video for me, thank you Elliot! The toy problem with the robot was so helpful to visualize everything, especially showing the the state-action sequences with the probabilities
Thanks! I'm glad you found it helpful.
This is simply the best video I have seen on this topic this year!
Thanks!
the best way for beginner to understand the policy method! thanks for your hard work.
Thanks! I'm glad you liked it.
Oh my god this is so clear, thank you! You make great visualizations!
Thanks!
Keep up the good work man! This was amazing :)
Thanks!
Love it , never been this much clear as a visual person thank you, and we need alot alot more, Keep up, subscribe 100%
Thanks!
Thank you for a simplified great explanation. You took a lot of trouble to create this material and it shows in your presentation.
Thanks! I'm glad you found it helpful.
The best policy gradient explanation
Thanks!
superb bro, this is something which is missing is most of the other videos...This really help building the intuition for beginners. Keep building such videaos , thatnks a ton
Thanks, Neeraj! Glad you found it helpful.
This has to be the best explaination ever!
Thanks!
Amazing video! Thank you!!
Well, watched a few times, I can say that there's no concept left untouched from statistics, probability, math, NN, DL domains in this video. I'll list them as much as possible:
reinforcement learning (of course), neural networks, backtracking, gradients/derivatives, chain rule, partial derivatives, probability trees, multiple random events, sampling, probability sampling, monte carlo, mean, standard deviation, pytorch and some more I can't count. Thanks to author again.
I'm glad you liked the video. That's one of the things I love about reinforcement learning, it involves so many different areas of interesting math.
very nice video. i searched for a long time an explanation and finally found it
Thanks, kolo!
This was a phenomenal video and should be required viewing for any class on machine learning. Any chance you are considering a video about policy gradients with continuous action spaces?
Thanks, Andrew! I don't have any plans for making new videos in the near future, but I may eventually, and that would be a good potential topic to cover. I'll add that to my list of future video ideas. Thanks for the suggestion.
This video is a life saver. Thanks a lot for the great explanation
Glad to hear you liked. Thanks for the comment.
BEST ONE SOR FAR I HAVE SEEN. SO INTUITIVE
Thanks!
At 12:44 if want to calculate dR/dx, you changed 0.7 by x but you still need to replace 0.3 by 1-x, otherwise you can not derivate.
This is one of the counterintuitive parts. I start explaining it at 13:49. This dR/dx is only calculating the first step in the backward pass, as in we are only calculating the partial derivates for the expected return with respect to the outputs from our neural network (as in the outputs from the softmax function at the end of our neural network). So in this step, we ignore the fact that the probability of going right (prob(right)) has to equal 1 - prob(left) because we don't make any assumptions about what the outputs for our neural network are, we don't assume they are probabilities that have to add up to one. It's only after this step when we continue backpropagating these gradients through the softmax function of our neural network that this restriction is taken into account, that prob(left) has to equal 1 - prob(right). This is because it is the softmax function that normalizes the probabilities to require them to add up to 1. So after the gradients get backpropagated through the softmax, they will be correct, even though they start off looking a bit strange since we start off not assuming anything about our neural network outputs, meaning we start off ignoring the fact that they have to sum to 1.
Excellent! This approach of developing the intuition works best for me . I often wonder how anyone ever came up with the end product of something this? A bunch of dense math as shown in the beginning? I suspect it starts with intuition like this and then is formalized to equations. I think that to improve something or invent new things, one has to go logically and intuitively first, highly formalized math second. When things are too abstract from the start, it’s too easy to get lost in the weeds. Thank you for these videos!
I agree. However, sometimes I figure out new things (or at least new to me) by working out the math first, and then after I see that the math leads to a simple result, I try to figure out an intuitive reason for why it does. So I think both modes, working intuitively and just working with the math symbols, can be useful for exploring new ideas. But I agree that just working with the math symbols feels a bit like exploring in the dark, and it's hard to know where to go next without turning on the lights of intuition.
This is really awesome... the way you explained was just perfect
Thanks, Ruchin! Glad you found it helpful.
It's a shame I only can like this once. Subscribed. Looking forward to your future contents.
Thanks, Shunz!
really like your explanation !!! Great video and thanks for your sharing!!!
Truly logical, clear from theory to implementation, bravo. One question I still have is "is it really true that the mathematical derivations at the start of video tell so much?" Anyway, they didn't speak to me.
Thanks! For me, those mathematical derivation shown at the beggining of the video are hard to interpret, but after working through the logic I explained in the video, I can make sense of them. One trick that helps me is removing the logs in those equations by using the rule that "the gradient of the log of x" equals "1/x times the gradient of x". So wherever it says "the gradient of the log of P(...)" I just think of it as "1 / P(...) times the gradient of P(...)". Which can be read as "The gradient of the probability of that trajectory given the current neural network weights, divided by the probability of that trajectory given the current neural network weights." I'm not sure if that helps others, but it makes it easier to for me to understand. I think they just write out the math in the log form to make the equation more concise.
This explanation is so insightful.
@@user-wr4yl7tx3w Thanks. I'm glad you liked it.
Thank you! Such a clear explanation. I finally understand this!
Thanks for this video
The best explanation anyone can ask for
Really amazing
Thanks!
This is awesome! Thank you for this detailed explanation!
Wow... this video was simply amazing! Thanks a lot... like others have mentioned, I wasn't expecting this! :)
Thanks! I'm glad you liked it.
Just awesome!! Thanks for all your efforts.
amazing video and good explaination
Thanks!
Thanks a lot for making me understand the policy gradient theorem. Super intuitive
@@sidhiquepynkulam Thanks! I'm glad to hear you found it intuitive.
Very nice and intuitive video, thank you!
I wish you visualized several iterations of the optimization procedure to show us the convergence of the algorithm to the desired policy. Maybe we could get a better understanding of exploration/exploitation from your amazing example.
Thanks for the suggestion. That's some good feedback for future videos.
Amazing video, clear and very detailed explanation 👍. Please, keep up doing such a beautiful educational job.
Thanks!
Wonderful Explanation of Policy Gradient. Please do Actor Critic methods also!
@@inforoundup9826 Thanks. I haven’t been making videos lately, but I may again eventually. Thanks for the recommendation.
great explanation, everything make sense now😃
@@kashifzaheer7804 thanks, that's great to hear.
Thank you for this elaborate explanation.
I'm glad you liked it.
Fenomenal explained, many thanks
Thanks, Björn!
This is brilliant
Thanks!
Marvelous explanation!
Thank you so much for this video! You are a great teacher :)
Thanks, Till! Glad you liked it.
Thanks for the great video and detailed explanation! I'm curious about the reason of multiply e.g. 1/0.7 at 36:30, is it just the strategy we apply during the sampling such that later the sum and the calculation of gradient will equals to the derivation we got theoretically? Thanks!
Yep. You can also think of it as balancing out the fact that we are sampling actions. Because when we add a value to our total sum, it is kind of like a nudge to our policy. As in, if we get a positive reward after an action, we nudge our policy to take that action again, and if we get a negative reward, we nudge our policy to not take that action again, but we want our nudges to be balanced so that if we are getting the same positive reward for going left as we do for going right, we want the nudges to go left more to equal the nudges to go right more so that the policy doesn't actually change (because when there is no benefit to going left or right, we don't want the policy to change). So, for example, if our policy was left 0.1 and right 0.9, we'd sample going left 1 out of 10 times and right 9 out of 10 times. So when we add our going left sample to our total sum, we multiply the value by 1/0.1 (or we multiply that nudge by 10) and when we add our going right samples to the total sum, we multiply the value we add by 1/0.9 (or multiply those nudges by 10/9), so that the effect of both going left and right are considered equally (the going left nudge is multiplied by a bigger number to a account for the fact that it is sampled less often).
This was incredible, thank you
Thanks, Matt!
Great explanation! so intuitive
Thanks!
Awesome explanation!
what an excellent lecture thanks
Thanks!
This is just amazing. One of the best explanations out there for vanilla PG. I do have a question though. At 48:20, you code probs/probs.detach() to detach prob from backpropagation. However, wouldn't that just return 1 since we divind it with itself? Shouldn't it be just prob.detach()?
That part of the code does look odd, and you are right, dividing the value by itself will equal one. But if we replaced `probs / probs.detach()` with just `probs.detach()`, then that line would be:
loss = -(probs.detach() * future_returns).sum() / num_episodes
...but when you detach a tensor, it becomes equivalent to a constant (and by that I mean the tensor will no longer backpropagate any gradients through it). And since `future_returns` came from the environment, it's also a value that we can't backpropagate through, and `num_episodes` is a value that didn't depend on our policy, so it's also a constant that we can't backpropagate through. So if we called `loss.backward()` using the above `loss`, it wouldn't do anything because all the values that were used to calculate that loss were treated as constants.
It's like saying if `loss = -(3 * 5) / 7`, how will changing the parameters of our policy change the loss, and the answer is that it won't since all the values used to calculate the loss are constants. But when we keep in the attached `probs` in the equation, it looks more like this: `loss = -((policy_output / 3) * 5) / 7`, and now we can see how changing the output of our policy will change the loss, and we can calculate a gradient for the parameters of our model.
So then the question is why do we also need the detached probs, why not just use:
loss = -(probs * future_returns).sum() / num_episodes
...and the reason is that adding the division of the detached probs balances out the fact that we are expected to only sample those specific actions with a probability of `probs`. For example, let's imagine we only have two actions, A and B, and that our current policy samples A 99/100 times and B only 1/100 times. And that action A returns a future reward of 1.01 and action B returns a future reward of 1.02. So even though B is sampled much less often, it's actually the slightly better action to take, and our gradient should favor taking action B more often. So now let's imagine we sample 100 actions and as expected we sample action A 99 times and B only 1 time. Then if we used the loss directly above, our loss would look like this:
loss = -(.99 * 1.01 + .99 * 1.01 ... (that repeated a total of 99 times) ... + .01 * 1.02) / 100
But the `.99` and the `.01` are actual the outputs of our model, which are the prob for A and the prob for B, so we can think of it more like this:
loss = -(prob_A * 1.01 + prob_A * 1.01 ... (that repeated a total of 99 times) ... + prob_B * 1.02) / 100
So now if we ask how can we minimize this loss, the answer will be to increase prob_A, because that will make this sum more negative, but this isn't what we want. We actually want to increase B since it returns a better future return.
So let's instead try it again but this time also with the `probs.detach()` values, it will look like this:
loss = -((prob_A / .99) * 1.01 + (prob_A / .99) * 1.01 ... (that repeated a total of 99 times) ... + (prob_B / .01) * 1.02) / 100
And now if we ask how can we minimize that loss, the answer will be to increase prob_B (it has a slight edge now over increasing prob_A). We can see this more clearly if we simplify the sum:
loss = -(99 * (prob_A / .99) * 1.01 + 1 * (prob_B / .01) * 1.02) / 100
And now if we substitute dividing by .99 by multiplying by 100 / 99, and we also substitute dividing by .01 by multiplying by 100 / 1, we get:
loss = -(99 * (prob_A * 100 / 99) * 1.01 + 1 * (prob_B * 100 / 1) * 1.02) / 100
And the 99s cancel out, and the 1s cancel out, and the 100s cancel out, and we are left with:
loss = -(prob_A * 1.01 + prob_B * 1.02)
And now we can clearly see that increasing prob_B rather than prob_A will make the loss more negative.
So we need to divide our probabilities by the detached versions of themselves to compensate for which probabilities will be sampled more often or less often. Dividing by `probs.detach()` is like saying, "we are only going to sample this action with a probability of `prob`, so we need to boost the weight of this sample in our loss by the inverse of that probability (if it's a low probability, boost it by a lot, and if it's a high probability, only boost it by a little)."
So in `probs / probs.detach()`, the numerator is there so that we can propagate gradients, the denomintor is there to boost those calculated gradients by the inverse of those probabilites.
I hope this helps.
@@elliotwaite Hey, thank you SOOO much for this detailed explanation. You are just too awesome and to properly explain like this is really commendable. So we can say that prob/prob.detach() is something like a normalizing term? Because the equation after prob/prob.detach() looks similar to a supervised learning technique where are finding the loss on the basis of how good the prediction was. Did I get that right?
@@sarvagyagupta1744 yeah, the 1/probs.detch() part can be thought of as a kind of normalizing term. However, I usually think of normalizing as making sure a sum adds up to 1, which I don't think that term does, so I think of it more as a re-weighting term, or a balancing term. But yeah, it's very similar to the supervised learning equation, but because we aren't given the dataset, but instead are generating the dataset by sampling actions using the probabilities output from our current policy, we have to add that 1/probs.detach() term to compensate for that difference.
@@elliotwaite yeah. I kinda felt that I was using the term very loosely. I was thinking of it as clipping instead of normalising. So is that a better comparison? Like how we do in PPO?
@@sarvagyagupta1744 I think of clipping as something that is conditionally applied. For example, in PPO-clip, the clipping only kicks in if the new policy starts to get too far away from the old policy. Whereas the 1/probs.detach() term is always applied. So for me, clipping means something different. But that might just be how I think of that word.
THIS IS GOLD!
Thanks!
Thank you very much! I finally understand gradient!
Hi Elliot, do you have a sample code say using pg to play games on gym? I can't find anything can explain as clean as you do!!
I would pay good money for this video
Thanks! I'm glad you found it valuable.
This is really helpful thank you.
😄 I'm glad you found it helpful.
Great work! Thank you!
🙂 I'm glad you liked it.
Loved this - Please do a video on contextual bandits too!
Thanks for the comment and suggestion. I do like the bandits problems. I haven’t been making videos recently, but maybe I’ll get back into it eventually. I was first introduced to the bandits problems in David Silver’s RL course lecture here, which you might also like: th-cam.com/video/sGuiWX07sKw/w-d-xo.htmlsi=4MtbQCR4kO4FL-DD
@@elliotwaite Thank you! I actually finished that series up - it's amazing :) But even this covers contextual bandits very briefly (video #9 in the series) And if I liked anything on RL after that lecture series, it's your video - kudos to you on it!
@@pk_1320 Thanks!
TY very much! It was really good!
@@eduardtsuranov712 thanks!
This video really helps! Thx Elliot!
Thanks for amazing lecture. I have question At 32:41 i cannot understanding green term
I have thought that 0.7 is one of the parameter of x1 and we change this x1 for best policy but at the time of video yhat has n*0.7*x1*avg return after x0 could you elavorate about why we have to muliply 0.7?
At that point in the video (32:41), Ŷ is not the correct estimation of Y. I probably should have clarified that by renaming that Ŷ to be something like "An initial guess of Ŷ". I was trying to show how you could go about trying to figure out a way to estimate Y by coming up with an initial guess for Ŷ, and then modifying that initial guess as needed to make it be correct. For the initial guess, I intentional use the wrong algorithm of "sample episodes, and then for each action, add that action times the reward received after taking that action to a sum, and that sum will be our initial guess of Ŷ". After this initial guess, I see that the sum will be incorrect by having the extra "n" multiple and the extra "0.7" multiple. So to correct for this, I update the algorithm that I use by adding the step: "also, divide each item in the sum by n (the total number of episodes sampled) and by the constant value of the action probability (which would be 0.7 when sampling the x0 action). After these changes, Ŷ (our sampled estimate of Y) is correct, meaning that it will tend to converge towards Y as we sample more episodes. And we have to divide by 0.7 (the value of x0) instead of x0 the variable to get the correct estimate of Y. This is because if we divided by x0, it would cancel out the x0 variable instead of the 0.7 value, and then Ŷ would not look like the Y equation above it that we are trying to estimate.
Does that help? Let me know if you have any follow-up questions.
it helps me a lot, thank you
Great one
tnx
Great video!
Thanks!
excellent tutorial !!! thanks a ton
Thanks. Glad you liked it.
Great video! Can you make another one like this (with animations instead of maths) that demonstrates how LLMs work (including fine training). I don't understand math at all but am very interested in learning more about LLMs. I just cannot understand the math and statistics behind it.
Thanks! I haven't been making videos lately, so I probably won't make an LLM video soon. But I appreciate the suggestion.
33:40 Could someone explain this? x5 is the probability of taking the action to go right to receive the reward (as they're the variables for the action probabilities 27:52), so why do we do .2 * x5? As .2 and x5 are both the probabilities of going right, why are we multiplying them with each other? Why do we multiply the probability of going right by itself? It doesn't make sense. :/
Great question, this is a confusing part, and you are correct, we should not be multiplying by both .2 and x5. I probably should have made it more clearly at that part of the video that that example is what should not be done. The reason I started off showing the wrong way to do it is that at 33:40, where I am trying to figure out how to estimate that Y value, I wanted show how someone might go about figuring out how to estimate that value if they didn't already know the correct method, so I pretend I don't actually know how to estimate it and I guess at a solution (which at first is the wrong solution, but I figure out how to correct it later). My first attempt at a solution is to sample episodes, and each time I take an action, I add to my sum that action probability (as a variable) multiplied by the reward received after taking that action. The sum I actually estimate with this method is the Ŷ value at 33:40, which as you noticed is an incorrect estimate of Y. I have an extra multiple of n, and I am multiplying by the action probabilities twice, once as a constant value and once as a variable. So to correct for this, in my next attempt at estimating the sum, I divide the whole thing by n, and also, each time I take an action, I add to the sum that action probability (as a variable) divided by that action probability (as a constant value) multiplied by the reward received after taking that action. By dividing by that action probability as a constant, I cancel out the implicit multiplication by that action probability that is added just by the fact that I only add this new value to the sum after having already taken that action. This corrected estimate can be seen as the Ŷ value at 35:40, and this method is how to correctly estimate Y.
I hope this helps. Let me know if there is anything you are still confused about.
Thank you for posting this question. I got stuck there as well
thank you so much! god bless you
Good work bro.
Thanks!
This is great
Thanks!
This is amazing
Thanks!
Dude you're awesome!
😊
Yesss! Awesome video
Thanks!
Extremely crisp
Thanks, Olav!
Nice one!!!
Thanks!
Thanks for the video! But I'm confused why the formula used in the log probability code at the end or at 37:22 was different from the formula at 38:05 (the one with log). These formulas have different gradients for the same variables.
Are you talking about the difference between the one at 37:22 that starts with Ŷ = (1 / 4) * (log(x0) * [3 + 0 + 0] + ...), and the one at 38:05 that starts with Ŷ = (0.7) * log(x0) * [avg return after x0] + ... ? If so, the first one is what we got from sampling 4 random episodes from the environment, and the second version is what we would expect to get on average if we kept sampling more and more episodes, meaning, the value would converge to that equation. So those equations do have different gradients. The second equation gives us the gradient we actually want, and the first equation just gives us an estimate of that gradient, using only 4 random episodes.
The first equation is an example of what we will actually get in practice when we sample episodes, and the second equation is the rationale for why using the first equation works, because on average, it will point us in the direction we are trying to go.
Let me know if this didn't answer your question, or if you have any other questions.
@@elliotwaite Thanks alot!
Masterpiece
Eureka! Thank You!
yay! I found a gem :) thanks!
Thanks, Ufuk!
I have a doubt. At 21:00, shouldn't the expected reward of each path be: prob1*reward_1 + prob2*reward_2 as compared to the one used as (prob1*prob*2) * (reward_1 + reward_2).
If the example is that prob2 comes after prob1, it should be prob1*reward1 + (prob1*prob2)*reward2. Said another way, only prob1 of the time will you ever receive reward1, and only (prob1*prob2) of the time will you ever receive reward2. This is because the probability of receiving each reward is the product of all the probabilities along the path to that reward. Does that make sense? Let me know if you still have any doubts.
@@elliotwaite yeah my bad, so assuming we get the rewards at each particular state as well and not all net rewards at the end of reaching the terminal state, estimated reward would be (prob1*reward_1) + (prob1*prob2*reward2). For the example that you are using, total rewards along the path are only received by the user after termination is achieved ( like Chess and unlike MDP process where at each state you get some reward), hence (prob1*prob2) * (net rewards along the way works). I am assuming if the first case was used the "expected rewards" would have been different for each path as well the policy gradients regarding it. Would love if you can elaborate on it, that would it work also with case 1, the above example.
@@AmanSharma-bj2yc, I think I'm confused. Can you let me know if we agree on the following things? In the example I am using, the rewards are received along the way, not just at the end. It can be modeled as an MDP with a start state and terminal states. However, the MDP does not allow any cycles, so it's like a one way graph. The expected reward of the policy is the expected reward from the start state. The expected reward from any of the other states could be different. Do we agree on these points or are we thinking about these things differently?
Wow. Very İntuitive explanation. İ hope you can create such videos for all RL approaches. Btw, may i be provided with the slides of this lecture?
Thanks, glad you liked it. I haven't been making videos lately, but I appreciate the suggestion of making more RL videos, I may make more in the future. Also, I currently don't have the images I used for this video in slide format.
@@elliotwaite I am watching the video for the second time, in 7:05 how you calculated the expected return. based on my calculation for the policy given the expected return for state 0 is (1*3)+1*0.1*-10=2.9 but you obtained -0.7
@@Torabi_ahmad18973, there are different ways to break up the sum of the expected return. I was summing over all the paths, and for each path, adding the probability of taking that path times the total reward received along that path. There are two possible paths. Path A has transition probabilities (1, 1, 0.1) giving a total probability of 0.1, and the total reward on that path is (3 - 10 = -7). And path B has transition probabilities (1, 1, 0.9, 1) and a total reward of (3 + 0 + 10 = 13), so to get the expected return we sum over both paths, which is ((0.1 * -7) + (0.9 * 13)), which equals (-0.7 + 11.7 = 11).
If you want, you can also break it up by each reward and the probability of getting that reward. So there is a 1.0 chance of getting the reward of +3, a 0.1 chance of getting the -10, and a 0.9 chance of getting the +10. So ((1.0 * 3) + (0.1 * -10) + (0.9 * 10)) = 11. You get the same result, just in a different way.
Let me know if this helped clarify or if there is anything that is still confusing.
One of the best ways to introduce policy gradients. Can u also make a video on actor critics
Thanks! I haven't been interested in making more educational videos latley, but I've added your suggestion to my potential future videos list, so maybe I'll get to it eventually.
@@elliotwaite I guess u should give it a shot. There are hardly few people on TH-cam who are explaining such complex topics in an intutive way. All the best.
Beautifully done.
At 29:19 in the video, while writing the expression for Y, when you take the derivative wrt x0, you have only considered the value .7 as constant. However .7 is the exact value of x0.
Why is this valid ??
I see why this is confusing, I probably should have explained this more clearly in the video.
The first thing to remember is that Y is just a function we are inventing to make it easier to calculate the gradient of R (the expected return). When calculating the gradient of R, we have have to consider x0 as a variable wherever it shows up in that function, and we can't just use it's 0.7 value as a constant (like you mentioned in your comment). But we can use the 0.7 value as a constant in Y, because we are just inventing a function that will give us the same gradient as R. The reason we can use the 0.7 value as a constant in those places, is because we add in those [avg return after x0] values, which essentially compensate for us using the 0.7 value as a constant in the other places. It's kind of like we are saying, "okay, I know the 0.7 is really the value of x0, but just consider it a constant here, and we'll correct for that by adding in these [avg return after x0] values over there." For example at 30:06, I show an example of taking the gradient of both Y and R with respect to x3. In Y, x3 is a variable in one place, and in the other places we use the value of x3 value (0.6) as a constant. However, in R, we do as you would expect, we use x3 as a variable in all places. But you can see that dY/dx3 and dR/dx3 end up equaling the same thing, which is what we are looking for, ad is why we can take the gradient of Y to get the gradient of R.
Let me know if this clears it up of if you're still confused in this area.
@@elliotwaite Thanks for the explanation, however I do not understand how the [avg return after x0] compensate for using .7 is a constant.
Although we are constructing a new made up function .7 is exactly x0 and had x0 been .3 then we would have used .3 instead of .7. So I am confused as the made up function clearly seems like a function of all the [x0,x1,x2....]s exactly ????
@@rishipatel8010perhaps looking at a simpler example of an R will help (this equation of R won't be related to a specific RL environment, it's just a random simple equation). Let's imagine R = 10 * x0 + 11 * x1 + 12 * x0 * x1. And let's say x0 = 0.7 and x1 = 0.3. Then we could create this Y function: Y = 10 * x0 * [1 + (1 / 10) * 12 * 0.3] + 11 * 0.7 * x1 * [(1 / 0.7) + (1 / 11) * 12]. And if we take the partial derivatives of R and Y with respect to x0 and x1 they will be the same. But if we rewrite Y as: Y = 10 * x0 * [extra stuff for x0] + 11 * 0.7 * x1 * [extra stuff for x1], you might ask why we use 0.7 as a constant in this equation rather than using x0 as a variable. And it's because if we replaced the 0.7 with x0, the gradient of Y would no longer be the same as the gradient of R. This is because the parts in the square brackets compensate for the fact that 0.7 is used as a constant in that place rather than using x0 as a variable there. The same thing is happening with the [avg return after x0] parts of the original example. Does this example help?
@@elliotwaite Thank you so much for the lightning fast reply, this one finally helped me develop my intuition, and I convinced myself with proof.
Again thanks a ton, you are probably the best content creator on policy gradient explanation. I thought you will be interested in what I developed from my understanding. So here it goes.
So what I understand from your explanation is that for any polynomial function depending on x0,x1,x2..... ie f(x0,x1,x2,x3....) where (xi has a max power of 1). ie not (x0)*(x0)
I can always create a g(x0,x1,x2,x3.....) = xo*(partial diff of f wrt x0 at points x1,x2,x3...) + x1*(partial diff of f wrt x1 at points x0,x2,x3...) + ..........
The quantity in the brackets (partial diff of f wrt x0 at points x1,x2,x3...) will have no dependency on x0 and will have a constant values as we substitute x1,x2,x3... fixed values.
Which is exactly where we want to calculate out derivative for our gradient ascent!!!!!!! at some fixed point combination of x0,x1,x2.........
The point is both f and g have the same partial derivatives. And this is exactly what you have done in the video as well.
@@rishipatel8010 nice. Yeah, that's a good general way of thinking about it.
great video. The only thing I can't quite grasp after multiple watchings is ... what are the targets to the neural network prior to backpropagating? Can you break it down a little more for each time step. I've got the forward propagation. Just need the per-time-step targets that are required in order to calculate the partials. I'm always asking you a question at 45:00. Tell me more about the 'loss' what dimension is this? Is it d=2 because you have two actions? Doesn't the loss function require two arguments: the predicted output and the target? Thanks a lot.
Ah, I can see why this is confusing. In this algorithm we actually don't use target values.
If we knew which actions we were supposed to take, we could use those acitons as target values, but don't have that information.
We could also try to say that our targets are to receive infinite reward for each episode, and the farther away we are from that target, the greater our loss will be, but then all our losses would be infinity, so that does work so well either.
So what we do instead is just come up with a loss value directly, without using target values. And this loss just needs to be a value that when we update the parameters of our model in the direction that minimizes that loss, it results in our policy receive more total reward on average. The reasoning behind why we came up with the loss that we did is explained in the video, but once we have that loss, we just call loss.backward() on it directly, without using any target values, and we update the parameters of our policy using that gradient.
Let me know if you're still confused about anything and I can try to explain it further.
@@elliotwaite ok this is new! We come up with a loss value directly he says . . . Is this the same as the difference between output and target value? In order to backpropagate I'm used to calculating the errors at the output nodes. What role does the loss play in the backpropagation algorithm. Thank you for your reply, very kind of you to write.
@@hoddy2001 the loss we use here is not the difference between an output and a target since we don't have a target value. This loss is just something that produces the correct gradient when we try to minimize it. Unlike a loss that is the difference between the output and a target, the loss we use here can be negative, which is impossible with an output-and-target loss which represents an error amount, cause you can't have a negative error amount. But for the loss we use, a negative loss is fine, we just try to make it lower. So if it's negative, we just try to make it more negative.
The main insight is that our loss isn't an error amount, it's just a value that generates the correct gradient when we back propagate from it.
Let me know if anything is still confusing.
@@hoddy2001 ... The role the loss plays in the backpropagation algorithm is that loss is what is getting minimized. The gradient that we get from calling loss.backward() is the gradient that moves in the parameters in the direction that minimizes the loss value the fastest. For example, if our model only has 3 parameters, we can think of the parameter space of that model as a 3D space, and calculating the gradient of the loss figures out which direction in that 3D space from the current parameter values will decrease the loss the fastest. Then we update the parameters by taking a small step in the direction of that gradient. If instead we have a million parameters, it's the same thing but the parameter space is a million dimensional space (which is hard to visualize, but the concept is the same).
@@elliotwaite thank you for your patience my good man. I am in a context where I don't have access to loss.backward() and need to know computationally how to compute the partial derivatives using the loss value. I understand now loss is a scalar value representing what would otherwise be the value of the error function. It is a value, that when the partials are calculated with respect to it, it points us in the direction of the gradient. Can you refer me to the definition of loss.backward()? How does it compute partials from this scalar value?
But probs/probs.detach wouldn´t give a vector of ones?
@@NeuralGamerAI yes, that is an interesting point, that vector will be a vector of ones, but we don't just use the values of that vector (which would be somewhat useless since they are all just ones, as you mentioned), we instead use the values of the gradient of that vector with respect to the loss.
Another way to think about it is that when we are back propagating the gradient values through that operation, whatever the gradient for x is at that point will get multiplied by whatever the value of 1/x was in the forward pass, and then that new gradient will continue being back propagated.
@@elliotwaite Thank you very much, very usefull explaniation
Subscribed in 0.5s, congratulations 🎉
Thanks! What does 05s mean?
Sorry, I fat fingered it 😄. Corrected now!@@elliotwaite
@ah, hah, got it.