Model Based Reinforcement Learning: Policy Iteration, Value Iteration, and Dynamic Programming

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ก.ย. 2024

ความคิดเห็น • 60

  • @yiyangshao2003
    @yiyangshao2003 2 ปีที่แล้ว +2

    This is just awesome, especially for an undergraduate without much pre-knowledge about machine learning.Many thanks from a Chinese freshman.

    • @yiyangshao2003
      @yiyangshao2003 2 ปีที่แล้ว

      Relationship between different concepts always confuses me, but your video explained it in a explicit diagram and this really helpes me a lot. Feeling really thrilled.Thanks again!

  • @fredflintstone7924
    @fredflintstone7924 4 หลายเดือนก่อน

    i love the way you explain it through the formula's most experts tell you the formula then go to an actual case, which leaves the learner disconnected from the math, thanks!

  • @samueldelsol8101
    @samueldelsol8101 7 หลายเดือนก่อน +1

    your videos are increadibly well thought out and very educational, I should have known about them sooner. greetings from Munich, Germany!

  • @kevinchahine7553
    @kevinchahine7553 หลายเดือนก่อน

    Thank you for making these videos. I'm learning so much! You are such a great explainer.

  • @ghazal246486
    @ghazal246486 2 ปีที่แล้ว +4

    I've watched other lectures on RL before, I can understand the formulas much better now, the way you explain formulas is brilliant, you're a wonderful math lecturer

  • @aaroncollinsworth9365
    @aaroncollinsworth9365 2 ปีที่แล้ว +1

    I actually feel smarter after watching this. Excellent video on all fronts!

  • @august4633
    @august4633 ปีที่แล้ว +1

    Thank you so much. I've watched a lot of videos and didn't fully get these concepts for some reason. Now I think I finally get it. You're a great teacher.

  • @RasitEvduzen
    @RasitEvduzen 2 ปีที่แล้ว +6

    Optimal control, Control Theory, Reinforcement Learning, Machine Learning, System Theory, System Identification are intellectual banquet.

  • @AnnieBhalla-pj9yu
    @AnnieBhalla-pj9yu 2 หลายเดือนก่อน

    probably the best explanation

  • @adinovitarini6173
    @adinovitarini6173 2 ปีที่แล้ว +6

    Thank you Prof! this video really helpful to classify RL's methods. I really appreciate your diagram and your explanation.

    • @Eigensteve
      @Eigensteve  2 ปีที่แล้ว

      Thanks -- glad it is helpful!

  • @Moonz97
    @Moonz97 2 ปีที่แล้ว +8

    Love this series! Hoped the video to go on and on but it ended too quickly. Can't wait for the next part! Keep up the great work :)

    • @Eigensteve
      @Eigensteve  2 ปีที่แล้ว +1

      Thanks so much!!

  • @suri6294
    @suri6294 ปีที่แล้ว

    SUPERBBBBBB! Now I understand every inch of the research paper I was reading. Thanks!!!!

  • @robinwang6399
    @robinwang6399 26 วันที่ผ่านมา

    The bellman’s equation reminds me of sequential games from game theory, where you traverse a game tree and optimal choice at each branch leads to globally optimal states.

  • @matthewchunk3689
    @matthewchunk3689 2 ปีที่แล้ว +1

    This is an excellent companion to your book. Thanks for both!

  • @asier6734
    @asier6734 ปีที่แล้ว

    Very well structured and layed out, clearly explained, thank you

  • @huyvuquang2041
    @huyvuquang2041 ปีที่แล้ว +1

    At 3:57, I think the R(s', s, a) function you are referring to is the "reward function", which returns the "Immediately reward (r) if you are at stage (s) and do the action (a) which lead to stage (s')". That would make more sense than "returning a PROBABILITY of a reward (r) given (s, a and s')". I saw this in your book also but cannot find this kind of function anywhere else. All other resources I found, when talking about this function R, that means the "immediately reward" of doing action a given stage s and new stage s', NOT the "probability of the reward".
    Later on in the clip, when you uses it in value function, I also see you use it as a mean for measuring the "Value of reward", not the "Probability of reward", therefore I think this might really be a mistake or something.
    If I'm getting it wrong somewhere, please help me clear my thought. I'm just being curious.
    Love your great work.

  • @AliRashidi97
    @AliRashidi97 2 ปีที่แล้ว +1

    Tnx a lot professor Brunton!
    You're creating great materials!

  • @NaveenKumar-yu3vw
    @NaveenKumar-yu3vw 2 ปีที่แล้ว

    Thank you for simplifying a lot of things. I had read corresponding chapters from Sutton and Barto book but I got more clarity on practical aspects from this video.

  • @micknamens8659
    @micknamens8659 2 ปีที่แล้ว +2

    16:55 The value iteration function (VI) differs slightly from Bellman's equation (BE) because VI uses max on a (hence uses a single value), whereas BE uses max on all pi. Because pi is a probabilistic function, i.e. is yielding a specific action value 'a' with a certain probability, VI would need to have another level of summation over a multiplying the terms by pi(s,a).
    20:05 Here we construct pi(s,a) as the argmax of VI. This means we set pi(s, argmax(s))=1, and pi(s, a')=0 for all other values a' /= argmax(s). This means pi(s,a) is deterministic, instead of probabilistic.

  • @mariogalindoq
    @mariogalindoq 2 ปีที่แล้ว +31

    Beautiful. Please continue. Will you explain algorithms like PPO, TD3, DDPG, etc.? If so, I will appreciate each one. Also, it will be very interesting if you can give your opinion on some RL libraries like ray/RLlib, baselines3, etc. I know that this may be much more than what you are thinking of including in this course, but I do not lose anything by suggesting those topics to you :) Thank you.

    • @Eigensteve
      @Eigensteve  2 ปีที่แล้ว +17

      Great suggestions! I will think about how to add these in the future. Might need to be in a future filming session, since it might take some time.

    • @superuser8636
      @superuser8636 2 ปีที่แล้ว +4

      PPO would be very welcome. Deep RL is big now. Thanks for your videos, Dr. Long time fan

    • @cisimon7
      @cisimon7 2 ปีที่แล้ว

      Good suggestion, hope we get videos on those soon

  • @nj4089
    @nj4089 2 ปีที่แล้ว

    Thank you so much. Really appreciated the explanation at 24:20

  • @samirelzein1095
    @samirelzein1095 2 ปีที่แล้ว

    now i know i ll understand well RL when you ll explain it!

  • @paaabl0.
    @paaabl0. 2 ปีที่แล้ว

    Great and clear explanation, Steve! Thank you.

  • @minapagliaro7607
    @minapagliaro7607 6 หลายเดือนก่อน

    great video thank you for your contribution 🎉

  • @pavybez
    @pavybez 3 หลายเดือนก่อน

    Hi professor. Thanks for the wonderful videos. I was wondering why you classify actor critic as model based when the model of the environment is not learnt in this algorithm?

  • @danielmilyutin9914
    @danielmilyutin9914 2 ปีที่แล้ว +1

    I love the way you give the material.
    Became curious about how do you project those formulae onto screen and able to see them?
    Is it glass screen and projector on side of camera? Or is it special screen?

  • @jeroenritmeester73
    @jeroenritmeester73 2 ปีที่แล้ว +4

    Hi Steve, could you please add the videos to a playlist to avoid accidentally skipping videos?

    • @Eigensteve
      @Eigensteve  2 ปีที่แล้ว +1

      Good call -- just added to playlist

  • @imolafodor4667
    @imolafodor4667 8 หลายเดือนก่อน

    thank you for the video, i wonder if there is a value function algorithm which is V(s,t)? Value of state s in time t

  • @metluplast
    @metluplast 2 ปีที่แล้ว

    Thanks Professor Steve

  • @mohammadabdollahzadeh268
    @mohammadabdollahzadeh268 ปีที่แล้ว

    Dear Dr. Steve I have a question
    I think according to what you explain to us, in value iteration we need to use an optimal algorithm; however, in policy iteration we don’t need to use that isn’t it
    Im looking forward to hearing from you
    Sincerely mohammad

  • @rishabsingh6933
    @rishabsingh6933 2 ปีที่แล้ว

    Amazing Content

  • @h2o11h2o
    @h2o11h2o ปีที่แล้ว

    Thank you

  • @parmachine470
    @parmachine470 2 ปีที่แล้ว

    Recursion must be what supply's the reinforcement (feedback) to the value functions and eventually policy. Otherwise we're flying blind.

  • @mohammadabdollahzadeh268
    @mohammadabdollahzadeh268 ปีที่แล้ว

    Dear Dr.steve I have a question
    I think in value iteration we need to use an optimal algorithm;however, in policy iteration we don’t need to use that is it true?

  • @hassannawazish9300
    @hassannawazish9300 2 ปีที่แล้ว +3

    Can i find some more detail? or a code with example of bellman equation??

    • @RobinCarter
      @RobinCarter 2 ปีที่แล้ว +6

      I strongly recommend the book Reinforcement Learning an Introduction by Sutton and Barto. Also the Winter 2019 online lectures by Stanford (on TH-cam). Both have lots of maths and programming exercises.

    • @hassannawazish9300
      @hassannawazish9300 2 ปีที่แล้ว

      @@RobinCarter thanks for your reply.

    • @Eigensteve
      @Eigensteve  2 ปีที่แล้ว +1

      @@RobinCarter Agreed, these are great resources

    • @mariogalindoq
      @mariogalindoq 2 ปีที่แล้ว +1

      Let me suggest the book:
      Grokking Deep Reinforcement Learning by Miguel Morales

  • @lookman_
    @lookman_ ปีที่แล้ว

    thank you, but why does it always have to be so theoratical. Why cant you show an example like the tic tac toe which you mentioned to explain value iteration

  • @esmaeelmohammadi4683
    @esmaeelmohammadi4683 2 ปีที่แล้ว

    Hi, Thank you for this great video. Can you please explain how we can use a model of system (for example LSTM) that predicts future as a simulator to run our reinforcement learning algorithm in it. So assume I trained a RL algorithm via model-free approach, but I can't test it on real environment and I need to test it on a simulated environment. How can we do this with having a model for prediction of the future via time-series data?

  • @herb.420
    @herb.420 ปีที่แล้ว

    WOOOOOOOOOO THERE IT IS, TIC TAC TOE HAS BEEN SOLVED

    • @herb.420
      @herb.420 ปีที่แล้ว

      th-cam.com/video/xJR1oTDt1Ak/w-d-xo.html

  • @emmanuelameyaw6806
    @emmanuelameyaw6806 2 ปีที่แล้ว

    How many agents can we have in the model?

  • @WhenThoughtsConnect
    @WhenThoughtsConnect 2 ปีที่แล้ว

    implicit rolles theorem.

  • @schumzy
    @schumzy 2 ปีที่แล้ว

    Interesting, funny that model based learning isn't highly regarded and so maybe not as explored. I get the feeling that this method will turn out to be as important as the data table function in excel. Quietly, and matter of factly determining a lot of our daily lives. The number of excel simulation models that impact our daily lives, is kinda scary. (think banks, insurance, etc back in the 90's and 2000's. think of all the mergers that were run through an "excel model", all the go/no go business decisions determined by excel models, all based on the data table simulation process, I'm sure model based deep learning has already taken over a lot of that, problem is no one wants to share their business secret sauce, and academia isn't interested in exploring this further. Shame.

  • @nononnomonohjghdgdshrsrhsjgd
    @nononnomonohjghdgdshrsrhsjgd 2 ปีที่แล้ว

    wow, you talked 28 minutes and didn't solve any optimization problem with the techniques. I hope you know practically how to apply anything.

  • @mohammadabdollahzadeh268
    @mohammadabdollahzadeh268 ปีที่แล้ว

    Dear Dr. Steve I have a question
    I think according to what you explain to us, in value iteration we need to use an optimal algorithm; however, in policy iteration we don’t need to use that isn’t it
    Im looking forward to hearing from you
    Sincerely mohammad

  • @mohammadsalah2307
    @mohammadsalah2307 2 ปีที่แล้ว

    Could you possibly explain more about "policy iteration and value iteration, leading to the quality function"? 25:40. Specifically, what is "redundant"?
    I believe there is a mistake. Here Q(s, a) and V_\pi(s) seem to have exactly the same formation. I still did not understand how this lead to the conclusion that quality function allows us to enable "model-free learning".
    I think the correct formula for Q is :
    Q(\mathbf{s}, \mathbf{a})=\mathbb{E}\left(R\left(\mathbf{s}^{\prime}, \mathbf{s}, \mathbf{a}
    ight)+\gamma Q\left(\mathbf{s}^{\prime}, a
    ight)
    ight)
    By the way, I am also a little confused about what is the "model" of the future reward is? 25:10

  • @cuongnguyenuc1776
    @cuongnguyenuc1776 ปีที่แล้ว

    Thanks for the lecture,Are value interation and policy interation learning aslo Temporal Difference learning?

  • @jimklm3560
    @jimklm3560 2 ปีที่แล้ว

    In 8:20, shouldn't we have considered all the possible states s1=s' we can possibly end up when we follow a policy π?

  • @frankdelahue9761
    @frankdelahue9761 2 ปีที่แล้ว

    I am too dumb to understand this.

  • @azadarashhamn
    @azadarashhamn 2 ปีที่แล้ว

    Another great work. Thanks again.