Colin, great tutorial. Can you explain how the new policy probs are different from the old policy probs? The new policy is given the same observations and actions taken, and since at the onset of training the old and new policy are the same neural net, how do we get an update? My score of np.exp(new_log_probs - old_log_probs) is 1 because the policies are the same, the update is nonzero initially only due to the entropy bonus. Do I need a target network similar to DDQN? Thanks for making these btw, they are solid.
Hi! It's a nice video. I wonder if I have continuous action values ranging from 50 to 150, which activation function should I use in the output Actor-network, and how to sample between those values from its probability?
I think you still use the tanh activation multiply with range of your action values and add bias to it. The bias should move tanh function to mean of the action values!
I rarely write comments but I really like your tutorials, as a RL beginner I didn't find anyone explaining it as simple and clear as you. Thank you!
This tutorial series is great. Linear, concise, clear. Thank you so much
This explanation helps me a lot! Thank you!
Thank you for the nice explanations and video. It is useful. I hope your videos about ML&Data Science will continue.
Thanks for the video. I just started looking into RL and this helped me solve OpenAI's mountain car in continuous action space.
Very clear explaining. Thank you so much.
This is greate tutorial. Thanks for the talk.
why use obscure libraries like ptan in your code? It just makes it frustrating to work with...
sir, can you share any book name or any resources link or something to get more knowledge about continuous action space RL concepts. Please,please
Very good explanation,, please update with more algorithm like DDPG, TD3
nice explanation, but you could mention maxim lapan, since you take all the code from him.
Thanks for this tutorial!
The tutorial is for a stochastic policy or continuous action spaces?
Colin, great tutorial. Can you explain how the new policy probs are different from the old policy probs? The new policy is given the same observations and actions taken, and since at the onset of training the old and new policy are the same neural net, how do we get an update? My score of np.exp(new_log_probs - old_log_probs) is 1 because the policies are the same, the update is nonzero initially only due to the entropy bonus. Do I need a target network similar to DDQN? Thanks for making these btw, they are solid.
Is AC still the leading algorithm for tasks such as self-driving?
Very good man!
Great lecture. Is there a paper that study what you introduced?
The code he uses is from a book "Deep reinforcement learning hands on" by Maxim Lapan
Really helpful! Thx a lot.
Hi! It's a nice video. I wonder if I have continuous action values ranging from 50 to 150, which activation function should I use in the output Actor-network, and how to sample between those values from its probability?
I think you still use the tanh activation multiply with range of your action values and add bias to it. The bias should move tanh function to mean of the action values!
When is the next video coming?
Thank's !!!
U r awesome !!!!
Thanks!!
Moo
Thanks!!