initialize_random_policy does not need to assign a random value for the action as it serves no purpose, at least in the current use in calculate_greedy_policy. the value is anyway replaced with the best_action_value result. btw, very good job with explaining the subjects.
I suggest you write as 6) "sum of looked up values V[s'] multiplied by their probabilities for each possible s' ". The idea is to show that you are doing something similar in both 5 and 6.
I'm really taking it very slowly going through your videos. Thank you for doing a great job addressing the exact points a newbie and math illiterate needs explained. Having said that, I have one issue with the adapted bellman's equation: you replaced V(s') by the sum of probabilities(s,a,s')V(s'). I get that part. But should you not also add probability to the part R(s,a)? Two reasons I'm saying that: 1) your 5) in the Value Iteration Algorithm: Sum of all possible rewards MULTIPLIED BY THEIR PROBABILITIES 2) your best_action_value function also calculates the reward probabilities, and not just the deterministic reward from taking a particular action.
The short answer is that it is already included. These formulas are recursive, meaning that each square's value is determined by the value of the squares around it. The reward term R(s, a) is only active in the squares which have a reward. Take the princess for example. That state has a value of 1. Now let's move to the square to the left. There is no reward for being here. There is, however, a reward for moving to the right. But this move's reward is included in the calculations, since one of the V(s') IS moving to the right. Therefore, the reward and probability for this square is already present in the equation. In the same way, each square's calculations of value will include this possibility by incorporating all of its possible moves V(s'). Hope this helps. Note: The same logic applies to the -1 reward square.
Yeah he made a mistake. At 4:10, it should say "initialize a table V of value estimates for each gray square to 0, the princess square to 1, and the lava square to -1". Then, everything else is correct.
It`s cool to see a different workflow. Thank you.
initialize_random_policy does not need to assign a random value for the action as it serves no purpose, at least in the current use in calculate_greedy_policy. the value is anyway replaced with the best_action_value result. btw, very good job with explaining the subjects.
Great series. Thank you!!!
This shit went from 0 to 100 real fast
Thank you for doing such an amazing tutorial!
I suggest you write as 6) "sum of looked up values V[s'] multiplied by their probabilities for each possible s' ". The idea is to show that you are doing something similar in both 5 and 6.
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
Great videos. Thanks for doing them :)
Super clear, Thanks a lot!
thank you for explanation
I'm really taking it very slowly going through your videos. Thank you for doing a great job addressing the exact points a newbie and math illiterate needs explained.
Having said that, I have one issue with the adapted bellman's equation: you replaced V(s') by the sum of probabilities(s,a,s')V(s'). I get that part. But should you not also add probability to the part R(s,a)?
Two reasons I'm saying that:
1) your 5) in the Value Iteration Algorithm: Sum of all possible rewards MULTIPLIED BY THEIR PROBABILITIES
2) your best_action_value function also calculates the reward probabilities, and not just the deterministic reward from taking a particular action.
The short answer is that it is already included. These formulas are recursive, meaning that each square's value is determined by the value of the squares around it. The reward term R(s, a) is only active in the squares which have a reward. Take the princess for example. That state has a value of 1. Now let's move to the square to the left. There is no reward for being here. There is, however, a reward for moving to the right. But this move's reward is included in the calculations, since one of the V(s') IS moving to the right. Therefore, the reward and probability for this square is already present in the equation. In the same way, each square's calculations of value will include this possibility by incorporating all of its possible moves V(s'). Hope this helps.
Note: The same logic applies to the -1 reward square.
Yeah he made a mistake. At 4:10, it should say "initialize a table V of value estimates for each gray square to 0, the princess square to 1, and the lava square to -1". Then, everything else is correct.
Do you mean that we can use recursive approach (dynamic programming) to find value of all states.
Or
We can find value of all states by iteration
Great!! Keep on!!
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
If you feel the video misses a numerical example before jumping into the code, you can take a look at this: th-cam.com/video/l87rgLg90HI/w-d-xo.html
good one, ty!
It's very on point. I like learning from you and please make more videos. The course link is not working for me.
Siraj is the worst. 10 points from Gryffindor.
exactly. he just acts cool but in reality he is worst as he merely reads the slides. even grade 5 student can read the slides out loud
You mean slytherin?
link below?
th-cam.com/video/DiAtV7SneRE/w-d-xo.html
THANKS!
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
I dont understand so much
This video is very confusing. Far from the previous two videos on this subject which is graphical and easy to understand.
I agree, try this for better understanding. Helped me a lot
th-cam.com/play/PLQyWwjpavAmGrpyfnR28Kqeq_VV2xeV00.html