Thank you for the tutorial. In this class, why do we need max_policy_train_iters and value_train_iters ? Because it will repeat the training multiple times.. ppo = PPOTrainer( model, policy_lr = 3e-4, value_lr = 1e-3, target_kl_div = 0.02, max_policy_train_iters = 40, value_train_iters = 40)
Hi Edan, First of all thank you for the lecture, very good. Although, would you mind answering a couple questions, please? (sorry I'm a newby in RL). 1) The new_log_probs (min 19:17) is done in the same policy which generated the old_log_probs, right? Because at this point you haven't trained it yet. My guess is that this will be true only for the first iteration, am I right? 2) KL_div value is 1%, which feels very small in hindsight considering the clipping allows up to 20% of the ratio, which can be big at the beginning of training. With that the first step would be the only step and the training loop would not use the ratio at all. Could that be the case? (hope my question is not too confusing).
I'm happy you found the video helpful! To answer your questions: 1) Yes, they will be the same which will make the policy ratio 1 for the first iteration, but for every iteration after that the new_log_probs will change while the old_log_probs will stay the same, so the ratio between the two gives you the difference in policy from the start of the update iterations. 2) If I understand your question correctly, then yes, you are right that this could terminate after the first iteration. If it does however, then that means that our policy is just changing very quickly, which is exactly what we are trying to avoid. I think we should be less concerned with how many iterations we do, and more with how much the sample KL divergence changes because that is what we are trying to control. So if it stops after one iteration, that shouldn't be an issue. If your question comes from more of a place of trying to understand why we use such low values for the threshold, I unfortunately cannot give you a great answer. I don't remember the details of the PPO paper, but I think these values were come to via an ablation study and they just happened to be what worked best (I would double check this with the paper).
@@EdanMeyer Hi Edan, awesome and thank you for your answers. You did answer both, I was more concerned about finishing after the first iteration but your explanation on this makes total sense. Also the training loop is something I could not infer well from any papers I've read before and now it also feel like things finally came to place regarding the ratio problem. I guess the KL is not the biggest of problems anyways but good to know we should be aiming for something very low. Thank you again for taking the time to respond. Great stuff!
Hi Edan. Thank you for the video. I just found your ipynb and ran it. It worked fine. The reward is increasing over time. However, I notice that the value loss is also increasing instead of decreasing. Can you give some insights about that? Thanks!
my suspicion: the 'returns' is calculated wrong because the discount_rewards function requires that the input 'rewards' is in chronological order but the actual inputs has been permuted. The 'returns' need be first computed using unshuffled rewards and then permuted.
Hi Edan. Thank you so much for this super informative tutorial on PPO in which you also introduced the KL divergence and GAES that were not used in the original paper. I think the KL divergence in PPO was borrowed from the TRPO. The way you calculated the KL divergence is simply the subtraction between old probabilities and new probabilities. I am wonder would it be better if we used the authentic KL divergence formula, KL(P || Q) = - sum x in X P(x) * log(Q(x) / P(x)). Thank you again. I am waiting for your reply.
I think its because of the way the way the data is given. The actions x in X are already determined but at least follow the policy aka probebility P(x) so we can make a monte carlo estimate on log(Q(x))-log(P(x)) over many samples to get the KL divergence. Im still learning and am not fully sure on all the details so take it as a best guess.
The video that hopefully will save my semester . Not all heroes wear cap ❤
Thank you for the tutorial.
In this class, why do we need max_policy_train_iters and value_train_iters ? Because it will repeat the training multiple times..
ppo = PPOTrainer(
model,
policy_lr = 3e-4,
value_lr = 1e-3,
target_kl_div = 0.02,
max_policy_train_iters = 40,
value_train_iters = 40)
Hi Edan,
First of all thank you for the lecture, very good.
Although, would you mind answering a couple questions, please? (sorry I'm a newby in RL).
1) The new_log_probs (min 19:17) is done in the same policy which generated the old_log_probs, right? Because at this point you haven't trained it yet. My guess is that this will be true only for the first iteration, am I right?
2) KL_div value is 1%, which feels very small in hindsight considering the clipping allows up to 20% of the ratio, which can be big at the beginning of training. With that the first step would be the only step and the training loop would not use the ratio at all. Could that be the case? (hope my question is not too confusing).
I'm happy you found the video helpful! To answer your questions:
1) Yes, they will be the same which will make the policy ratio 1 for the first iteration, but for every iteration after that the new_log_probs will change while the old_log_probs will stay the same, so the ratio between the two gives you the difference in policy from the start of the update iterations.
2) If I understand your question correctly, then yes, you are right that this could terminate after the first iteration. If it does however, then that means that our policy is just changing very quickly, which is exactly what we are trying to avoid. I think we should be less concerned with how many iterations we do, and more with how much the sample KL divergence changes because that is what we are trying to control. So if it stops after one iteration, that shouldn't be an issue. If your question comes from more of a place of trying to understand why we use such low values for the threshold, I unfortunately cannot give you a great answer. I don't remember the details of the PPO paper, but I think these values were come to via an ablation study and they just happened to be what worked best (I would double check this with the paper).
@@EdanMeyer Hi Edan, awesome and thank you for your answers. You did answer both, I was more concerned about finishing after the first iteration but your explanation on this makes total sense. Also the training loop is something I could not infer well from any papers I've read before and now it also feel like things finally came to place regarding the ratio problem. I guess the KL is not the biggest of problems anyways but good to know we should be aiming for something very low.
Thank you again for taking the time to respond. Great stuff!
Nice and concise. Great job!
where can I access the code? There is no link.
Where can i find the ipynb? Thank you for this video.
Edit: found in a comment below
Hi Edan. Thank you for the video. I just found your ipynb and ran it. It worked fine. The reward is increasing over time. However, I notice that the value loss is also increasing instead of decreasing. Can you give some insights about that? Thanks!
my suspicion: the 'returns' is calculated wrong because the discount_rewards function requires that the input 'rewards' is in chronological order but the actual inputs has been permuted. The 'returns' need be first computed using unshuffled rewards and then permuted.
Can you make a video on exploring FINRL 2.0 ? Another topic of simple implementation of multi agent RL ?
Thank you so much!
can I have the notebook?
I think this is it colab.research.google.com/drive/1MsRlEWRAk712AQPmoM9X9E6bNeHULRDb?usp=sharing
04:34 --> It is not a reward Edan :(
Cool Thanks a lot, very helpful!
Hi Edan. Thank you so much for this super informative tutorial on PPO in which you also introduced the KL divergence and GAES that were not used in the original paper. I think the KL divergence in PPO was borrowed from the TRPO. The way you calculated the KL divergence is simply the subtraction between old probabilities and new probabilities. I am wonder would it be better if we used the authentic KL divergence formula, KL(P || Q) = - sum x in X P(x) * log(Q(x) / P(x)). Thank you again. I am waiting for your reply.
I think its because of the way the way the data is given. The actions x in X are already determined but at least follow the policy aka probebility P(x) so we can make a monte carlo estimate on log(Q(x))-log(P(x)) over many samples to get the KL divergence.
Im still learning and am not fully sure on all the details so take it as a best guess.
Nice one thank you
Can we do it from scratch? Im more interested in learning the maths involved.
can you share your jupiter notebook?
Can this implementation code work for "computation offloading in edge computing "
Do you have a video for coding ChatGPT?