Q-Learning: Model Free Reinforcement Learning and Temporal Difference Learning

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 พ.ค. 2024
  • Here we describe Q-learning, which is one of the most popular methods in reinforcement learning. Q-learning is a type of temporal difference learning. We discuss other TD algorithms, such as SARSA, and connections to biological learning through dopamine. Q-learning is also one of the most common frameworks for deep reinforcement learning.
    Citable link for this video: doi.org/10.52843/cassyni.ss11hp
    This is a lecture in a series on reinforcement learning, following the new Chapter 11 from the 2nd edition of our book "Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control" by Brunton and Kutz
    Book Website: databookuw.com
    Book PDF: databookuw.com/databook.pdf
    Amazon: www.amazon.com/Data-Driven-Sc...
    Brunton Website: eigensteve.com
    This video was produced at the University of Washington
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 85

  • @thiagocesarlousadamarsola3990
    @thiagocesarlousadamarsola3990 2 ปีที่แล้ว +77

    I personally love the big picture perspective that Prof. Brunton always shows. Please, continue to make these high quality videos!

  • @anlehoang7030
    @anlehoang7030 3 หลายเดือนก่อน +1

    More casual example for TD-learning:
    Imagine a curious robot exploring a maze, searching for a hidden treasure. Unlike other methods that wait until it finds the treasure to learn, TD learning is all about learning on the fly. It uses what it already knows (like the estimated value of different paths) and immediate feedback (rewards) to improve its predictions about future moves.
    - The robot keeps track of a Q-value (Q(s_t, a_t)) for each path, which tells it how good it thinks that path is based on its past experiences.
    - When it takes a path and gets a Q-value (or reward) (like finding a clue), it compares that reward to what it expected (based on the Q-value - r_t + \gamma Q(s_{t + 1}, a_{t + 1})). This difference is called the prediction error.
    - If the reward is better than expected (positive error, or r_t + \gamma V^{old}(s_{t + 1}) - V^{old}(s_t) > 0), the robot increases the Q-value for that path, making it seem more attractive next time.
    - If the reward is worse than expected (negative error, or r_t + \gamma V^{old}(s_{t + 1}) - V^{old}(s_t) < 0), the robot decreases the Q-value, steering it away from less promising paths.

  • @TheSinashah
    @TheSinashah 4 หลายเดือนก่อน +1

    CS PhD student here. This video provides such amazing content. Highly recommended.

  • @usonian11
    @usonian11 2 ปีที่แล้ว +10

    Thank you for the outstanding production quality and content of these lectures! I especially enjoy the structure diagram organizing the different RL methods.

  • @davidelicalsi5915
    @davidelicalsi5915 ปีที่แล้ว +11

    Professor I must sincerely thank you for the astonishing quality of this video. You were able to clearly explain an advanced concept without simplifying, going into the details and providing brilliant insights. Also I sincerely thank you for saving my GPA from my R.L. exam 😆

  • @codypappa1667
    @codypappa1667 7 วันที่ผ่านมา

    Great videos. Definitely keep your face in the thumbnails. Even if the channel name doesn't ring a bell, sometimes I find your videos by just the thumbnails and always enjoy them

  • @jashwantraj2987
    @jashwantraj2987 10 หลายเดือนก่อน +2

    Prof. Burton, you are amazing. I never expected someone to take so much of time to explain a concept about TD. I'm one of the few people who hate reading text books to understand concepts. I rather see a video or learn about it from class.
    Thanks a lot

  • @OmerBoehm
    @OmerBoehm 2 ปีที่แล้ว +2

    Thank you dear Prof Brunton for this outstanding lecture. The detailed explanations and focus on subtleties are so important , Looking forward to your next videos.

  • @kalimantros845
    @kalimantros845 2 ปีที่แล้ว +1

    I was hoping that your next video would have been about Q-learning, and here it comes!

  • @areebayubi5469
    @areebayubi5469 ปีที่แล้ว +2

    Thank you so much for using very relevant analogies and very clear explanations. I think I have a much better grasp of the concepts behind Temporal Difference learning now.

  • @imanmossavat9383
    @imanmossavat9383 ปีที่แล้ว +1

    I enjoy your talks. They are very clear and well structured and have the right level of detail. Thank you,

  • @haotianhang3997
    @haotianhang3997 2 ปีที่แล้ว +3

    Thank you! It's a great video. My understanding in TD learning was deepened a lot.

  • @BoltzmannVoid
    @BoltzmannVoid 2 ปีที่แล้ว +3

    this was the best explanation ever!
    thank you so much, professor!

  • @alirezatavakoli7325
    @alirezatavakoli7325 21 วันที่ผ่านมา

    The videos are wonderful! Thank you, professor.

  • @cruise0101
    @cruise0101 ปีที่แล้ว +1

    Excellent class! Extremely easy to understand!

  • @multiversityx
    @multiversityx 2 ปีที่แล้ว

    You gave the best explanations I've ever seen!

  • @FRANKONATOR123
    @FRANKONATOR123 2 ปีที่แล้ว +2

    Hi Prof. Brunton. Great vídeo as always! Please keep producing quality ML content

  • @bevansmith3210
    @bevansmith3210 ปีที่แล้ว +1

    For MC you only get the reward at the end and then divide it up among all the states. But for TD, if you are only taking one step forward, where does the reward come from? A little confused here.

  • @complexobjects
    @complexobjects 2 ปีที่แล้ว +1

    I do like the description of Q Learning. I had come up with another analogy for why it makes sense. If you took the action of going out to a party, and then happened to make some mistakes while there, we wouldn't want to say "you should never go out again." We'd want to reinforce the action of going out based on the best* possible outcome of that night, not the suboptimal action that was taken once there.

  • @marzs.szzzzz
    @marzs.szzzzz ปีที่แล้ว +3

    These are fantastic lectures, I use these as an alternative explaination to David Silvers DeepmindxUCL 2015 lectures on the same topic, the different perspective really suits how my brain understands RL. Thank you!!

  • @TheFitsome
    @TheFitsome ปีที่แล้ว

    This is the best RL tutorial on the internet.

  • @farhadebrahimzadeh3420
    @farhadebrahimzadeh3420 ปีที่แล้ว

    Thank you for the clear picture. It was really well explained and others already mentioned, now I can say that I understand these techniques quite fairly well. 🙏

  • @denchen1950
    @denchen1950 2 ปีที่แล้ว +1

    The video quality is incredible lol and all the concept is discussed extremely clear OMG!!
    Brilliant masterpiece bro KEEP GOING !!

  • @krullebolalex
    @krullebolalex ปีที่แล้ว +2

    Thanks a bundle Steve, this was really well explained!

  • @caiolp4
    @caiolp4 2 ปีที่แล้ว +6

    Great lecture! Concerning the difference between SARSA and Q-Learning, I didn't get the emphasis on Q-Learning being better for exploration. In principle, one can choose a epsilon-greedy for both methods. As a matter of fact, the SARSA method is defined in Sutton's book with an epsilon-greedy policy. I get the point that the TD target of Q-Learning does not depend on the policy itself and, therefore, is called an off-policy method. However, if one can choose a exploratory policy (e.g., epsilon-greedy) for both methods, why would SARSA be safer or less exploratory?

    • @duncanw9901
      @duncanw9901 2 ปีที่แล้ว

      I got the impression he was asserting that the updates to the quality function can/will/often become an undesireable feedback loop when non-optimized states are used, and I would infer that means the training steps done on those states would have an undesireably high probability of entering such states.
      What you said does seem to be convincing evidence otherwise though.

    • @jorge-george6958
      @jorge-george6958 2 ปีที่แล้ว

      That is a correct comment. The difference between the two lies in the Q-function updates. The way you choose your action is orthogonal (and can be more/less exploratory in either method). Also, from the video, Q-learning comes off as "better" method than SARSA, at least in problems where you don't need safe exploration, which is not accurate. It's more like a trade-off, where no method is clearly better.
      I love your videos in general. I think though, that this particular one needs a bit of a revision. Hope you don't see this as a critique, but rather as constructive feedback.

  • @user-cj5ff8mw6v
    @user-cj5ff8mw6v 2 ปีที่แล้ว

    I very like your descriptions about deep learning

  • @antimon40
    @antimon40 2 ปีที่แล้ว

    Somehow I find that the explanations given by Prof. Brunton are easier to understand than those provided by video lectures from Stanford (which are also available on TH-cam).

  • @mariogalindoq
    @mariogalindoq 2 ปีที่แล้ว +3

    Steve: thank you again. I appreciate your work. Trying to help, let me say that I believe there is a small typo, at minute 5:29 you wrote π(s,a) = argmax_a Q(s,a), should it be written π(s) = argmax_a Q(s,a)? Also, at time 22:35, the first equation has a sum over k, should it be over n? Anyway, this is a very good video.

  • @prateekcaire4193
    @prateekcaire4193 6 หลายเดือนก่อน

    Thanks a lot. Not just math but also the intuition that i was looking for

  • @r.d.7575
    @r.d.7575 2 ปีที่แล้ว

    Read the chapter, and I've been waiting for this video for a while. Happy to know I'm the first to comment :) Thanks, Steve. Can you explain more the bias (and especially the variance) in RL in a later video ?

  • @jimlbeaver
    @jimlbeaver 2 ปีที่แล้ว +1

    Great explanation..very clear!

  • @maria4880
    @maria4880 5 หลายเดือนก่อน

    Thank you so much for these lectures sir!

  • @nahuelpiguillem2949
    @nahuelpiguillem2949 ปีที่แล้ว

    Thanksssss steveeee. I couldnt understand nothing before this video.Thanks again

  • @stuartferguson11
    @stuartferguson11 9 หลายเดือนก่อน

    This whole series was good, but this one pushed me past my confusion. My neural network finally learned to play tic-tac-toe!

  • @ajaykumar-rh2gz
    @ajaykumar-rh2gz 2 ปีที่แล้ว +4

    Hi Steve, Thanks for Amazing! lecture. I think the mail challenge in RL is designing our own custom environment (Multiple states and actions). It will be a great help if you can upload some lecture, suggest some link to do this job. Other comments are also welcome. Currently, I am doing some experiment on Retail pricing optimization using offline data. Looking forward.

  • @jicabe577
    @jicabe577 ปีที่แล้ว

    Thanks a lot, Prof. Brunton!

  • @XandreClementsmith
    @XandreClementsmith ปีที่แล้ว

    In the Monte Carlo method, you have the reward discounted by gamma. Why do you discount a reward function for the entire episode?
    Furthermore, 1/n(R) would not the average reward for any step k.

  • @anirudhthatipelli8765
    @anirudhthatipelli8765 ปีที่แล้ว

    Thanks for the fantastic explanation!

  • @Julsten3107
    @Julsten3107 6 หลายเดือนก่อน

    Such a great video, thank you!

  • @preston748159263
    @preston748159263 ปีที่แล้ว

    I would like to know what is being used to project the equations. It is apparently not video editing because he points directly to them.

  • @JustinMasayda
    @JustinMasayda ปีที่แล้ว

    One thing that seems to be either an error or just inconsistent notation is the use of TD(N) to mean an N-step TD. It seems like the value in the parentheses is supposed the value of lambda, not the number of steps of TD. TD(0) apparently should be read as, "N-step TD when lambda = 0, " while TD(1) means, "N-step TD when lambda = 1." I'm basing this off of the book "Reinforcement Learning: An Introduction - Second edition" by Sutton and Barto.

  • @Throwingness
    @Throwingness 2 ปีที่แล้ว

    Loud and clear.

  • @David-nw6rz
    @David-nw6rz 2 ปีที่แล้ว +1

    Great lecture, but I guess your definition of on/off policy is different from the definition of Sutton/Barto. On policy doesn't necessarily mean you always take the optimal action. "[on-policy] learns action values not for the optimal policy, but for a near-optimal policy that still explores" [excerpt from a different chapter but also valid for TD].
    SARSA usually still follows a epsilon-greedy strategy.

  • @miminh98
    @miminh98 ปีที่แล้ว +1

    Hello :) thank you for the video. I have a small question. I don't fully understand the difference between TD(0) and SARSA. Indeed, if SARSA uses the optimal action 'a' at each time step 'k', doesn't the Q-function in SARSA equal the Value function in TD(0) ? Or was there an error in my understanding ? Can you please help me see more clearly ? :)

  • @sunaxes
    @sunaxes ปีที่แล้ว

    For TD0, Why not say we do alpha x the new value (s_k+1) + (1 - alpha) x the old value (s_k). It's a very basic update method...

  • @anupamadhikari139
    @anupamadhikari139 ปีที่แล้ว

    Can you explain how off policy q learning can take a suboptimal route sometimes if you are always taking the max of the actions presented?

  • @polinagrinko1678
    @polinagrinko1678 4 หลายเดือนก่อน

    brilliant explanation

  • @martinschulze5399
    @martinschulze5399 2 ปีที่แล้ว

    being a phd student of you must be a gift :D

  • @yuktikaura
    @yuktikaura ปีที่แล้ว

    Very well explained.

  • @somethingirreversib
    @somethingirreversib 2 ปีที่แล้ว

    Great lecture!

  • @vamsimanoharreddy1468
    @vamsimanoharreddy1468 9 หลายเดือนก่อน

    Excellent Explaination

  • @Chetan_Hansraj
    @Chetan_Hansraj ปีที่แล้ว

    woww thank you , so well explained with lot of patience .. god bless

  • @chuanjiang6931
    @chuanjiang6931 10 หลายเดือนก่อน

    One question, in terms of updating the Q function using the observed(real) reward at state k + 1, how do we know the observed(real) reward at state k + 1 since it is one timestamp in future?

  • @JustinMasayda
    @JustinMasayda ปีที่แล้ว

    22:35 As someone else mentioned, the first equation has a sum over k, shouldn't it be over n?

  • @farhadebrahimzadeh3420
    @farhadebrahimzadeh3420 ปีที่แล้ว

    there is a question that pops up in my head; if SARSA is an on-policy method, then is it OK to use e-greedy algorithm in SARSA? as you mentioned it always take into account taking the safe and on-policy action rather than random and off-policy actions?

  • @dbracale
    @dbracale ปีที่แล้ว

    Nice talk al always. A question: at minute 5:00, does pi depend on a? It shoud not, right? The same holds for the previous video.

  • @tubege
    @tubege 2 ปีที่แล้ว

    Pi of s,a should only be a function of s since the RHS calculates a. What am I missing?

  • @titaniumsheepdog
    @titaniumsheepdog ปีที่แล้ว

    what is the reward for r_k if we don't receive a reward after every action? Is it just assumed zero and the value is based only off the quality function?

  • @rev0cdevs38
    @rev0cdevs38 2 ปีที่แล้ว

    I cannot access the new chapter in the 2nd edition. Has anybody accessed the link?

  • @BipinOli90
    @BipinOli90 ปีที่แล้ว

    Brilliant! Thanks 🙏

  • @faqeerhasnain
    @faqeerhasnain 4 หลายเดือนก่อน

    In value able Content.. Cant thank enough.

  • @Pedritox0953
    @Pedritox0953 2 ปีที่แล้ว

    Great video! Would be nice a simple example

  • @stevenchiu8560
    @stevenchiu8560 ปีที่แล้ว

    Could you please explain it with some examples, that will be really helpful to understand these formulas, thanks!

  • @chymoney1
    @chymoney1 2 ปีที่แล้ว

    Great stuff

  • @51nibbler
    @51nibbler ปีที่แล้ว

    ty 4 great explain greeze from switzerland :)

  • @npr1m991
    @npr1m991 ปีที่แล้ว +1

    This is amazing content.
    I just have a question (still struggle with the concept of on and off policy)... at 30:18, the max Q (in Q-learning) .. which Q is it old or new ?

    • @eugeneL_N1E104
      @eugeneL_N1E104 10 หลายเดือนก่อน

      old, but at next state $s_{k+1}$

  • @linyidai9076
    @linyidai9076 ปีที่แล้ว

    Help a lot with my AI course final!!!!

  • @samirelzein1095
    @samirelzein1095 2 ปีที่แล้ว

    top prof!

  • @tearistovic
    @tearistovic 4 หลายเดือนก่อน

    Thank youuu !

  • @user-jg4mh6hb2g
    @user-jg4mh6hb2g ปีที่แล้ว

    You are the best:))))))

  • @dam-ib9fs
    @dam-ib9fs ปีที่แล้ว

    very useful

  • @dominic_lee
    @dominic_lee ปีที่แล้ว

    wow, nice

  • @--JYM-Rescuing-SS-Minnow
    @--JYM-Rescuing-SS-Minnow 2 ปีที่แล้ว

    👍

  • @Throwingness
    @Throwingness 2 ปีที่แล้ว

    A+

  • @dihancheng952
    @dihancheng952 4 หลายเดือนก่อน

    I don't think the comparison of q learning and sarsa is accurate.

  • @mayfields5092
    @mayfields5092 2 ปีที่แล้ว

    isnt q-learning, learning to play master yi

  • @djsocialanxiety1664
    @djsocialanxiety1664 2 หลายเดือนก่อน

    too much talk too few examples

  • @robert-dr8569
    @robert-dr8569 ปีที่แล้ว +4

    Instead of using so many words to explain, why couldn't you just use a couple of examples to explain the relationship between Q(s, a) new vs Q(s', a')? It would be so easy to understand through examples.

    • @sharannagarajan4089
      @sharannagarajan4089 4 หลายเดือนก่อน

      It’s a theoretical formula.

    • @stevewu1920
      @stevewu1920 4 หลายเดือนก่อน +7

      Instead of using so many words to complain, why couldn’t you just make a video illustrating these yourself?

    • @romxpl4885
      @romxpl4885 2 หลายเดือนก่อน

      @@stevewu1920worst take ever