Great explanation, thank you! For me, the most confusing thing with PPO was what is the new policy vs the old policy (for the calculation of the ratio in the loss function) and how are gradients calculated when there are "two" policies. For some reason this is very often quickly glossed over, but it's pretty clear once you see the implementation. For anyone who was confused like I was, here's how I understand it. The old policy is the policy you use to choose actions - you calculate all the old_policy values during the environment sampling phase. These are numerical values which don't need to contribute directly to the gradients, so consider them scalars. Then you do several epochs of updates to the model: On epoch 0 new_policy = old_policy and therefore the ratio is always 1 and the loss is actually just the advantage function (and no clipping obviously) On the next epoch, you made an update to the model, which is now considered the new policy; but the old policy is still the same (already calculated). Therefore, the ratio may now differ from 1 and you make an update according to the loss formula (clipping and all)
You know I’ve seen a lot of people get confused over PPO. But in reality it is not that different for any other policy gradient method. The agent(s) has an initial policy that it uses to collect samples for its trajectory bank using its initial parameters from its actor function approximation. With enough trajectories, it just moves on to update its initial estimate of parameters with the modified surrogate cost function (optimizes L). Then it gets its updated policy. It makes that its old policy. And finds new experiences with that.
I've been looking for a straight forward explanation of a functioning PPO implementation for a while until I found this. Thank you for taking the time to make it. I plan on using an LSTM model but am a little confused on how I would grab previous states to feed to the lstm as a 3dim tensor. If the environments are run in parallel I would grab -n timesteps from each environment to feed the network? It seems like that would throw off the preprocessing before sending it to the model to update.
In the paper, they mention T should be much less than the total number of steps in the episode, but here you have it as 256 (which is greater than 200 steps per pendulum for example). What's your experience with small T vs large ones?
both. standard deviation represents the weight of the distribution. Mean represents the offset from 0. Std must pe optimized to return values in a constrained range of values (this way the agent's action will be not so random, and the mean to have a general value ( negative/positive, small/large). Hope i do not miss somthing.
Why do you need entropy loss here? Entropy of the normal distribution is a function of standard deviation only, and your standard deviation seems to be fixed.
Thank you for telling that actor critic is sensitive, everyone said it's state of the art but when I tried A2C , I am convinced there is nothing wrong with my code but I have been trying different network and learning rate but nothing worked I even got nan as action for continuous action, I mean why
Hey there, thank you for posting your videos. I have watched them all several times. I wish I ended up starting with pytorch instead of Keras :) Can you explain why the actor and critic losses have opposite signs? (around 16:09)
I understand that in the GAE you add the last value of the states visited, to all other values of states visited. See this line: values = values + [next_value]. Why do you do that?
No. I've had success on simpler problems running just one env. Take a look at the original paper. Algo is in there and it's easy to follow: arxiv.org/abs/1707.06347
Parallel environments are just you collect more trajectories from the old theta. By the law of large numbers you’ll have a better estimate of the the quality of the policy the larger amount of trajectories you get your hands dirty with. No need for parallel computation, it just means you ll have to let your single agent loose for a little while more. That’s the only purpose it serves
ppo python with full comment add user play mode github.com/fatalfeel/PPO_Pytorch ppo lessons from beginning to experience on 30 days fatalfeel.blogspot.com/2013/12/ppo-and-awr-guiding.html
Great explanation, thank you!
For me, the most confusing thing with PPO was what is the new policy vs the old policy (for the calculation of the ratio in the loss function) and how are gradients calculated when there are "two" policies.
For some reason this is very often quickly glossed over, but it's pretty clear once you see the implementation. For anyone who was confused like I was, here's how I understand it.
The old policy is the policy you use to choose actions - you calculate all the old_policy values during the environment sampling phase. These are numerical values which don't need to contribute directly to the gradients, so consider them scalars.
Then you do several epochs of updates to the model:
On epoch 0 new_policy = old_policy and therefore the ratio is always 1 and the loss is actually just the advantage function (and no clipping obviously)
On the next epoch, you made an update to the model, which is now considered the new policy; but the old policy is still the same (already calculated). Therefore, the ratio may now differ from 1 and you make an update according to the loss formula (clipping and all)
You know I’ve seen a lot of people get confused over PPO. But in reality it is not that different for any other policy gradient method. The agent(s) has an initial policy that it uses to collect samples for its trajectory bank using its initial parameters from its actor function approximation. With enough trajectories, it just moves on to update its initial estimate of parameters with the modified surrogate cost function (optimizes L). Then it gets its updated policy. It makes that its old policy. And finds new experiences with that.
Great talk, very clearly presented
Thanks a lot!!! This video is a true masterpiece. Got a lot of learning "clicks" out of it.
thanks for sharing I was looking for a code that implements PPO for parallel environments, you rules
I've been looking for a straight forward explanation of a functioning PPO implementation for a while until I found this. Thank you for taking the time to make it. I plan on using an LSTM model but am a little confused on how I would grab previous states to feed to the lstm as a 3dim tensor. If the environments are run in parallel I would grab -n timesteps from each environment to feed the network? It seems like that would throw off the preprocessing before sending it to the model to update.
You would definitely need to adjust the pre-processing to accommodate that, but not difficult.
In the paper, they mention T should be much less than the total number of steps in the episode, but here you have it as 256 (which is greater than 200 steps per pendulum for example). What's your experience with small T vs large ones?
Great talk! I'm just wondering what can I do if each episodes don't have the same length?
Is "std" of actions kept constant during the training? Shouldn't we update both mean and std of actions?
both. standard deviation represents the weight of the distribution. Mean represents the offset from 0. Std must pe optimized to return values in a constrained range of values (this way the agent's action will be not so random, and the mean to have a general value ( negative/positive, small/large). Hope i do not miss somthing.
excellent explanation
Why do you need entropy loss here? Entropy of the normal distribution is a function of standard deviation only, and your standard deviation seems to be fixed.
Thank you for telling that actor critic is sensitive, everyone said it's state of the art but when I tried A2C , I am convinced there is nothing wrong with my code but I have been trying different network and learning rate but nothing worked I even got nan as action for continuous action, I mean why
Hey there, thank you for posting your videos. I have watched them all several times. I wish I ended up starting with pytorch instead of Keras :) Can you explain why the actor and critic losses have opposite signs? (around 16:09)
I recommend you look at the original PPO paper (2017). Long story short, we want to *maximize* the actor's term (i.e. minimize the negative).
I understand that in the GAE you add the last value of the states visited, to all other values of states visited. See this line: values = values + [next_value].
Why do you do that?
Can you provide the specific timestamp you are referring to?
@@SkowsterTheGeek 13:23
7:10
Is it necessary to have several parallel environments? Can I achieve good performance with just one ?
No. I've had success on simpler problems running just one env. Take a look at the original paper. Algo is in there and it's easy to follow: arxiv.org/abs/1707.06347
Parallel environments are just you collect more trajectories from the old theta. By the law of large numbers you’ll have a better estimate of the the quality of the policy the larger amount of trajectories you get your hands dirty with. No need for parallel computation, it just means you ll have to let your single agent loose for a little while more. That’s the only purpose it serves
Thanks for share
ppo python with full comment add user play mode
github.com/fatalfeel/PPO_Pytorch
ppo lessons from beginning to experience on 30 days
fatalfeel.blogspot.com/2013/12/ppo-and-awr-guiding.html
may you please help me "how to use roboschool" i am getting compatibility issue in gym and it's dependency