Ok i will indulge your quiz time questions since your videos are really great! Question 1: A is correct. it would not learn at all, since the target policy is the policy which we are trying to learn. Setting it fixed would imply it not changing, which would imply it staying random, therefore we are not learning Question 2: Im not completely sure but i would say B is correct, since SARSA uses its target policy both to choose action and to "look" (by taking the action according to the target policy) at its follow up state Hope more people comment so the algorithm boosts your channel!
Ding ding ding! You have been paying attention :) Also thanks a ton for indulging me here. I am trying new ways to make sure this content is engaging and educational at the same time. So the more people like yourself that participate, the more I see the value in this content.
I just applied for a ML research position as a mech e freshman, and as part of the interview process I have to present a paper on the underlying algorithms. This helped SO MUCH in my understanding, even as someone without any practical experience or knowledge in the field, so thank you and great job :)
Great video. Would like to point out a mistake at 13:59 where you talk about ON policy but the heading says "Off Policy". I think that needs correction. Also would love to see content on multi-agent reinforcement learning and Decision Transformers.
If you are talking about the heading in the algorithm, it is correctly labeled off-policy. The screenshot is labeled from a text book in the description. And yea. Still scoping out the best concepts to do here in the reinforcement learning playlist! Thanks for the suggestion!
Do we really update Q value function at the exploration step in Sarsa method? Seems that we have to skip this update since we make random step while exploring
QT-1: "Target policies" are supposed to learn from experimental actions undertaken by "Behavior policies" to set their Q values right. If the "Target policy" were set to be "random" instead of "greedy learning", then there is no learning at all. Hence the answer should be first option - The agent does not learn at all.
is Soft Actor Critic parts of Value based or Policy-based? or neither both, since you only mentioned RL categories are only VF and Policy-based. When you will clarify, there are three categories in RL, VF, Policy-based, and Actor-Critic
I think i found an error in the summary, you wrote twice "Off Policy RL Algorithms". Apart from that, thanks so much for the video, it helped me a lot.
Question 1: Option A is correct since the target policy will never be stable and the Q values will be changed randomly resulting no learning. Question 2: Option B.
Ok i will indulge your quiz time questions since your videos are really great!
Question 1: A is correct. it would not learn at all, since the target policy is the policy which we are trying to learn. Setting it fixed would imply it not changing, which would imply it staying random, therefore we are not learning
Question 2: Im not completely sure but i would say B is correct, since SARSA uses its target policy both to choose action and to "look" (by taking the action according to the target policy) at its follow up state
Hope more people comment so the algorithm boosts your channel!
Ding ding ding! You have been paying attention :) Also thanks a ton for indulging me here. I am trying new ways to make sure this content is engaging and educational at the same time. So the more people like yourself that participate, the more I see the value in this content.
@@CodeEmporium i taking a course on rl at the moment which is quite disorganized, your content definitely helps a ton with understanding!
@@CodeEmporium I love quiz time! It felt best when professors would quiz us on topics so I can re-engage.
I just applied for a ML research position as a mech e freshman, and as part of the interview process I have to present a paper on the underlying algorithms. This helped SO MUCH in my understanding, even as someone without any practical experience or knowledge in the field, so thank you and great job :)
Great video. Would like to point out a mistake at 13:59 where you talk about ON policy but the heading says "Off Policy". I think that needs correction.
Also would love to see content on multi-agent reinforcement learning and Decision Transformers.
If you are talking about the heading in the algorithm, it is correctly labeled off-policy. The screenshot is labeled from a text book in the description.
And yea. Still scoping out the best concepts to do here in the reinforcement learning playlist! Thanks for the suggestion!
@@CodeEmporium No I meant in the summary slide, bullet No. 6 ( the last bullet point)
Amazing Video, thank you!
Do we really update Q value function at the exploration step in Sarsa method? Seems that we have to skip this update since we make random step while exploring
Where is the normalization term for state probability for offpolicy algorithms ?
QT-1: "Target policies" are supposed to learn from experimental actions undertaken by "Behavior policies" to set their Q values right. If the "Target policy" were set to be "random" instead of "greedy learning", then there is no learning at all. Hence the answer should be first option - The agent does not learn at all.
is Soft Actor Critic parts of Value based or Policy-based? or neither both, since you only mentioned RL categories are only VF and Policy-based. When you will clarify, there are three categories in RL, VF, Policy-based, and Actor-Critic
Nice video, well explained. Question, why would I use one or the other? Are there advantages or disadvantages?
Great video, thanks!
Thanks for the video! ☺
You are very welcome :)
Great stuff
Good video!there is a small typo at the summary page about on-policy
thank you
I think i found an error in the summary, you wrote twice "Off Policy RL Algorithms". Apart from that, thanks so much for the video, it helped me a lot.
Well explained!
Very nice video man
well explained brother
Question 1: Option A is correct since the target policy will never be stable and the Q values will be changed randomly resulting no learning.
Question 2: Option B.
Thank you so much dude
C
C