initialize_random_policy does not need to assign a random value for the action as it serves no purpose, at least in the current use in calculate_greedy_policy. the value is anyway replaced with the best_action_value result. btw, very good job with explaining the subjects.
I'm really taking it very slowly going through your videos. Thank you for doing a great job addressing the exact points a newbie and math illiterate needs explained. Having said that, I have one issue with the adapted bellman's equation: you replaced V(s') by the sum of probabilities(s,a,s')V(s'). I get that part. But should you not also add probability to the part R(s,a)? Two reasons I'm saying that: 1) your 5) in the Value Iteration Algorithm: Sum of all possible rewards MULTIPLIED BY THEIR PROBABILITIES 2) your best_action_value function also calculates the reward probabilities, and not just the deterministic reward from taking a particular action.
The short answer is that it is already included. These formulas are recursive, meaning that each square's value is determined by the value of the squares around it. The reward term R(s, a) is only active in the squares which have a reward. Take the princess for example. That state has a value of 1. Now let's move to the square to the left. There is no reward for being here. There is, however, a reward for moving to the right. But this move's reward is included in the calculations, since one of the V(s') IS moving to the right. Therefore, the reward and probability for this square is already present in the equation. In the same way, each square's calculations of value will include this possibility by incorporating all of its possible moves V(s'). Hope this helps. Note: The same logic applies to the -1 reward square.
Yeah he made a mistake. At 4:10, it should say "initialize a table V of value estimates for each gray square to 0, the princess square to 1, and the lava square to -1". Then, everything else is correct.
I suggest you write as 6) "sum of looked up values V[s'] multiplied by their probabilities for each possible s' ". The idea is to show that you are doing something similar in both 5 and 6.
It`s cool to see a different workflow. Thank you.
This shit went from 0 to 100 real fast
Great series. Thank you!!!
initialize_random_policy does not need to assign a random value for the action as it serves no purpose, at least in the current use in calculate_greedy_policy. the value is anyway replaced with the best_action_value result. btw, very good job with explaining the subjects.
Thank you for doing such an amazing tutorial!
Great videos. Thanks for doing them :)
thank you for explanation
Super clear, Thanks a lot!
I'm really taking it very slowly going through your videos. Thank you for doing a great job addressing the exact points a newbie and math illiterate needs explained.
Having said that, I have one issue with the adapted bellman's equation: you replaced V(s') by the sum of probabilities(s,a,s')V(s'). I get that part. But should you not also add probability to the part R(s,a)?
Two reasons I'm saying that:
1) your 5) in the Value Iteration Algorithm: Sum of all possible rewards MULTIPLIED BY THEIR PROBABILITIES
2) your best_action_value function also calculates the reward probabilities, and not just the deterministic reward from taking a particular action.
The short answer is that it is already included. These formulas are recursive, meaning that each square's value is determined by the value of the squares around it. The reward term R(s, a) is only active in the squares which have a reward. Take the princess for example. That state has a value of 1. Now let's move to the square to the left. There is no reward for being here. There is, however, a reward for moving to the right. But this move's reward is included in the calculations, since one of the V(s') IS moving to the right. Therefore, the reward and probability for this square is already present in the equation. In the same way, each square's calculations of value will include this possibility by incorporating all of its possible moves V(s'). Hope this helps.
Note: The same logic applies to the -1 reward square.
Yeah he made a mistake. At 4:10, it should say "initialize a table V of value estimates for each gray square to 0, the princess square to 1, and the lava square to -1". Then, everything else is correct.
good one, ty!
I suggest you write as 6) "sum of looked up values V[s'] multiplied by their probabilities for each possible s' ". The idea is to show that you are doing something similar in both 5 and 6.
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
If you feel the video misses a numerical example before jumping into the code, you can take a look at this: th-cam.com/video/l87rgLg90HI/w-d-xo.html
Great!! Keep on!!
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
Do you mean that we can use recursive approach (dynamic programming) to find value of all states.
Or
We can find value of all states by iteration
It's very on point. I like learning from you and please make more videos. The course link is not working for me.
Siraj is the worst. 10 points from Gryffindor.
exactly. he just acts cool but in reality he is worst as he merely reads the slides. even grade 5 student can read the slides out loud
You mean slytherin?
THANKS!
Hi , please how to get the code of this tutorial,,,,, Dynamic Programming Tutorial for Reinforcement Learning
link below?
th-cam.com/video/DiAtV7SneRE/w-d-xo.html
I dont understand so much
This video is very confusing. Far from the previous two videos on this subject which is graphical and easy to understand.
I agree, try this for better understanding. Helped me a lot
th-cam.com/play/PLQyWwjpavAmGrpyfnR28Kqeq_VV2xeV00.html