Policy and Value Iteration

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ธ.ค. 2024

ความคิดเห็น • 82

  • @TheClockmister
    @TheClockmister ปีที่แล้ว +24

    My bald teacher will talk about this for 2 hours and I won’t understand anything. This helps a lot

    • @Moch117
      @Moch117 9 หลายเดือนก่อน

      lmfaooo

    • @fa7234
      @fa7234 17 วันที่ผ่านมา

      does your bald teacher name starts with Charles Isbell

  • @kiranmurphy3887
    @kiranmurphy3887 3 ปีที่แล้ว +15

    Great video! Walking through the first few its of the VI on a gridworld problem helped me to understand the algorithm much better!

  • @studyaccount9662
    @studyaccount9662 ปีที่แล้ว +32

    this is better explantation than my teacher from MIT, thanks

    • @allantourin
      @allantourin 6 หลายเดือนก่อน +3

      you're not from MIT lol

    • @rickkar6789
      @rickkar6789 2 หลายเดือนก่อน

      @@allantourin cool deduction, but to be fair they only said their teacher's from MIT haha

  • @harbaapkabaap2040
    @harbaapkabaap2040 8 หลายเดือนก่อน +1

    Best video on the topic I have seen so far, to the point and well explained! Kudos to you brother!

  • @furkanbaldir
    @furkanbaldir 3 ปีที่แล้ว +3

    I seached many many times to find this solution, and finally I found. Thank you.

  • @parul821
    @parul821 2 ปีที่แล้ว +8

    Can you provide example of policy iteration too

  • @quentinquarantino8261
    @quentinquarantino8261 2 หลายเดือนก่อน +2

    Doesn't R(s,a,s') actually mean ending up in s' by chosing action a and being in a? So why is this not the same as being in state s'?

  • @anishreddyanam8617
    @anishreddyanam8617 10 หลายเดือนก่อน

    Thank you so much! My professor explained this part a bit too fast so I got confused, but this makes a lot of sense!

  • @Joseph-kd9tx
    @Joseph-kd9tx หลายเดือนก่อน

    For the value update equation, wouldn't it be simpler to take the R(s,a,s') out of both the sum and the argmax? Given that R(s,a,s') would equal just R(s) in this case? So it would be R(s) + argmax(sum(P*gamma*V(S'))). The sum over all possible next states for P always equals 1. Thus this R(s)/R(s,a,s') term would be the same in or out of this part

  • @kkyars
    @kkyars ปีที่แล้ว +6

    For v1, would the two terminal states not be 0.8, since you have to multiply by the probability to get the expected value?

    • @aidan6957
      @aidan6957 5 หลายเดือนก่อน

      Remember, we're taking the sum and as all probabilites add to 1, we get 0.8x1 + 0.1x1 + 0.1x1 = 1, same for -1

    • @kkyars
      @kkyars 5 หลายเดือนก่อน

      @@aidan6957 thanks, can you send me the timestamp since I no longe have it?

    • @physixaid4334
      @physixaid4334 3 หลายเดือนก่อน

      @@kkyars 9:03 I think

  • @jackdoughty6457
    @jackdoughty6457 2 หลายเดือนก่อน

    For v2 why would (2,3)=0 if there is still a small chance we go right torwards -1? Wouldn't (2,3)=-0.09 in this case?

  • @jlopezll
    @jlopezll 7 หลายเดือนก่อน +1

    9:06 Why when iterating v2, the values of the all other squares are 0's? Shouldn't the squares near the terminal states have non-zero value?

    • @alexwasdreaming9440
      @alexwasdreaming9440 2 หลายเดือนก่อน

      I believe it's because not moving is a valid move, otherwise I feel you are right

    • @Joseph-kd9tx
      @Joseph-kd9tx หลายเดือนก่อน +1

      I know why. When evaluating any state adjacent to the -1 terminal state, the argmax will always prefer the action that yields 0 rather than -1. Thus it stays at 0. The argmax is choosing the action that goes directly away from the -1 state so that there's no chance in hell that it could land there, even if it slips.
      However, there is an interesting case where a state adjacent to a -1 would update: if the state is sandwiched between two -1 terminal states. In this case, no matter what action you take, there is a chance of slipping into one of the negative states, and it would therefore update negatively.

    • @NehaKariya-d1f
      @NehaKariya-d1f 19 วันที่ผ่านมา

      ​​@@Joseph-kd9txThanks for clarifying!

  • @sanchitagarwal8764
    @sanchitagarwal8764 6 วันที่ผ่านมา

    Excellent explanation sir

  • @abdullah.montasheri
    @abdullah.montasheri 8 หลายเดือนก่อน

    the state value function Bellman equation includes the policy action probability at the beginning of the equation which you did not consider in your equation. any reason why?

  • @imreezan
    @imreezan 27 วันที่ผ่านมา

    why is it V2 0.72? and not 0.8? The reward for moving right from 3,3 is suppoused to be +1 right? and the V(S`) is suppoused to 0 since there will no value if we are in that state, since it is terminal. So V2 is suppoused to be 0.8 right?

  • @nwudochikaeze6309
    @nwudochikaeze6309 ปีที่แล้ว +2

    Please can you explain how you got the 0.78 in V3.?

    • @ThinhTran-ys4mr
      @ThinhTran-ys4mr ปีที่แล้ว

      do you understand :( if yes please explain for me

    • @2010mhkhan
      @2010mhkhan 3 หลายเดือนก่อน +2

      if we are at point (3,3) the optimal path is to go to (4,3). There is a 0.8 probability of this occurring. Hence value update is 0.8*[0+0.9(1)] (same as before). Now there its a probability of 0.1 of moving up, hence: 0.8*[0+0.9(0.72)]; and a 0.1 chance of moving down: 0.1*[0+0.9(0)]. Add these up and you will get ~ 0.78!

  • @keyoorabhyankar5863
    @keyoorabhyankar5863 5 หลายเดือนก่อน +1

    Is it just me or are the subscripts in the wrong directions?

  • @kyrohrs
    @kyrohrs ปีที่แล้ว

    Great video but how can we use policy iteration for a MDP when the state space grows considerably with each action? I know there’s various methods of approximation for policy iteration but I just haven’t been able to find anything, do you have any resources on this?

  • @eklavyaattar1810
    @eklavyaattar1810 ปีที่แล้ว +1

    why would you substitute value of +1 in equation in green? the formula says it should the V(S') and not reward value!!!

  • @ellyjessy5044
    @ellyjessy5044 ปีที่แล้ว

    I see the values at V3 are for gamma only, shouldn't they be for gamma squared?

  • @tahmidkhan8132
    @tahmidkhan8132 หลายเดือนก่อน +2

    like this if you're todd neller.

  • @newtondurden
    @newtondurden ปีที่แล้ว

    fantastic video, man I was so confused for some reason when my lecturer was talking about it, not supposed to be hard iguess, just how exactly it worked this video helped fill in the details

  • @SaloniHitendraPatadia
    @SaloniHitendraPatadia ปีที่แล้ว

    According to bellman equation, I got the value 0.8 * (0.72 + 0.9 * 1) + 0.1 * (0.72 + 0.9 * 0) + 0.1 * (0.72 + 0.9 * 0) = 1.62. Please correct where I got wrong.

    • @mghaynes24
      @mghaynes24 ปีที่แล้ว +1

      The living reward is 0, not 0.72. 0.72 is the V at time 2 for grid square (3,3). Use the 0.72 value to update grid squares (2,3) and (3,2) at time step 3.

  • @Leo-di9fq
    @Leo-di9fq ปีที่แล้ว

    In second iteration V = 0.09 ?

  • @jemtyjose7088
    @jemtyjose7088 ปีที่แล้ว

    In V2 why is it that there is no Value for (2,3)? Doesnt the presence of -1 give it a value of 0.09. I am confused there.

    • @Leo-di9fq
      @Leo-di9fq ปีที่แล้ว

      lol same. any confirmations?

    • @stuartgill6060
      @stuartgill6060 ปีที่แล้ว

      I think there is a value V for (2,3) at V2-- it is 0. You get that value taking the "left" action and bumping into the wall, thereby avoiding the -1 terminal state. What action could you take that would result in a value of .09?

    • @citricitygo
      @citricitygo ปีที่แล้ว

      Remember you are taking the max of action values. So for (2,3), the max action is to move left, which may result in (2, 3) or (3,3) or (1, 3). The value is all 0.

  • @UsefulArtificialIntelligence
    @UsefulArtificialIntelligence 3 ปีที่แล้ว +3

    nice explaination

  • @daved1113
    @daved1113 ปีที่แล้ว

    Helped me learn it. Thank you.

  • @yottalynn776
    @yottalynn776 2 ปีที่แล้ว +2

    Thanks for the video. In v3, how do you get 0.52 and 0.43?

    • @HonduranHunk
      @HonduranHunk 2 ปีที่แล้ว +9

      Instead of starting in square (3, 3), you start in squares (3, 2) and (2, 3). After that, you do the same calculations to get 0.78. The optimal reward in square (3, 2) would be to go up, so the equation will look like: 0.8[0 + 0.9(0.72)] + 0.1[0 + 0.9(0)] + 0.1[0 + 0.9(-1)] = 0.43. The optimal reward in square (2, 3) would be to go right, so the equation will look like: 0.8[0 + 0.9(0.72)] + 0.1[0 + 0.9(0)] + 0.1[0 + 0.9(0)] = 0.52.

    • @ThinhTran-ys4mr
      @ThinhTran-ys4mr ปีที่แล้ว +6

      @@HonduranHunk How can we calculate to have 0.78. please help me sir

  • @aymanadam7825
    @aymanadam7825 2 ปีที่แล้ว +1

    great video!! thanks!!

  • @anoushkagade8091
    @anoushkagade8091 3 ปีที่แล้ว +7

    Hi, thank you for the explanation. Can you please explain how you got 0.78 for (3,3) in 3rd iteration (V3) ? According to bellman equation, I got the value 0.8 * (0.72 + 0.9 * 1) + 0.1 * (0.72 + 0.9 * 0) + 0.1 * (0.72 + 0.9 * 0) = 1.62. Please correct where I got wrong. Assignment due tomorrow :(

    • @ankurparmar5414
      @ankurparmar5414 3 ปีที่แล้ว

      +1

    • @anoushkagade8091
      @anoushkagade8091 3 ปีที่แล้ว +1

      @@maiadeutsch4424 Thank you so much for the detailed explanation. This was really helpful. I was not considering the agent's own discounted value when going towards the wall and coming back.

    • @donzhu4996
      @donzhu4996 3 ปีที่แล้ว +11

      @@maiadeutsch4424 we don't need to multiply 0.1??

    • @Ishu7287
      @Ishu7287 2 ปีที่แล้ว

      @@maiadeutsch4424 hey ! nice explaination but can you tell if we will get a table regarding probabilities like 0.8 0.1 0.3 etc for going right left up visa versa

    • @vangelismathioudis3891
      @vangelismathioudis3891 2 ปีที่แล้ว

      @@maiadeutsch4424 hey there nice explanation, but for the cases with 10% chance it should be 0.1*(0 + 0.9*V_previter).

  • @ziki5993
    @ziki5993 2 ปีที่แล้ว

    great explanation ! thanks.

  • @gate_da_ai
    @gate_da_ai 2 ปีที่แล้ว

    Thank God, get RL videos from an Indian....

  • @sunnygla4323
    @sunnygla4323 3 ปีที่แล้ว

    This is helpful, thank you

  • @yjw8958
    @yjw8958 ปีที่แล้ว +3

    If you also suffer from the vague explanation in GT's ML course, here comes Upenn to rescue you!

    • @711tornado
      @711tornado 8 หลายเดือนก่อน

      Literally why I'm here. CS7641 has been pretty good so far but the RL section was honestly crap in the lectures IMO.

  • @tirth8309
    @tirth8309 2 หลายเดือนก่อน +2

    Suyog sir idhar se kuch sikhlo

  • @huachengli1786
    @huachengli1786 3 ปีที่แล้ว

    quick question: 6:10, is the R(s,a,s_prime) always 0 in the example.

    • @cssanchit
      @cssanchit 3 ปีที่แล้ว +2

      yes it is fixed to zero

    • @Leo-di9fq
      @Leo-di9fq ปีที่แล้ว

      @@cssanchit except in terminal states

  • @VIJAYALAKSHMIJ-h2b
    @VIJAYALAKSHMIJ-h2b 7 หลายเดือนก่อน

    Cant understand how it it 0.52

  • @user-canon031
    @user-canon031 8 หลายเดือนก่อน

    Good!

  • @stevecarson7031
    @stevecarson7031 3 ปีที่แล้ว

    Nice job, thanks

  • @dailyDesi_abhrant
    @dailyDesi_abhrant 2 ปีที่แล้ว

    Yes ! Finally found suck a video! Yay!

  • @tower1990
    @tower1990 ปีที่แล้ว

    There shouldn’t be any value for the terminal state… my god…

  • @don-ju8ck
    @don-ju8ck 10 หลายเดือนก่อน

    🙏🙏🏿

  • @prengbiba3474
    @prengbiba3474 3 ปีที่แล้ว

    nice

  • @pietjan2409
    @pietjan2409 ปีที่แล้ว +1

    Seriously people cant explain this in a easy way. Same for this video

  • @alialho7309
    @alialho7309 3 ปีที่แล้ว

    For the first moment, you do not need to calculate terminal states and get +1, -1 for them. its wrong !
    we have things like terminal state in grid world. use it.

    • @Leo-di9fq
      @Leo-di9fq ปีที่แล้ว

      what do you mean?