Monte Carlo Reinforcement Learning Tutorial

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ธ.ค. 2024

ความคิดเห็น • 17

  • @GoldenChuricken
    @GoldenChuricken 4 ปีที่แล้ว +15

    All of your tutorials are excellent! To avoid confusion, however, I’d just like to mention that it is actually argmax instead of max for the policy at 2:52. The policy maps states to actions. Whereas the value of a state returns a real number, the policy of a state returns an action (hence max for the first, argmax for the latter).

  • @picklerick3136
    @picklerick3136 5 ปีที่แล้ว

    Man, this is a great tutorial on reinforcement learning.

  • @paulbrown5839
    @paulbrown5839 3 ปีที่แล้ว

    This is nice code, i had a look through it, good job

  • @uthoshantm
    @uthoshantm 4 ปีที่แล้ว +4

    3:00 PI of s should use "argmax sub a" instead of "max sub a".

    • @cnbrksnr
      @cnbrksnr 4 ปีที่แล้ว

      yeah this confused me

  • @patite3103
    @patite3103 3 ปีที่แล้ว

    Great tutorial! Try to make a tutorial with a visualisation of this algorithm by using a very simple example. That would be great!

  • @sambo7734
    @sambo7734 4 ปีที่แล้ว

    your tutorials are awesome, thank you! :)

  • @karthik-ex4dm
    @karthik-ex4dm 6 ปีที่แล้ว +1

    You say g(t) is the immediate reward received and the discounted future reward received thereafter, So that should be G(t) = rt + gamma(Gt+1) and not G(t) = rt+1 + gamma(Gt+1) .. Right? @5:15

    • @xiaokejie3456
      @xiaokejie3456 5 ปีที่แล้ว +1

      it's just a writing notation, as far as I know, both of them are Okey

  • @michiuno2238
    @michiuno2238 4 ปีที่แล้ว

    Thanks a lot for your tutorials! One conceptual problem understanding:
    Running the Monte Carlo simulation takes so much longer to come up with approximately the same values as when calculating the perfect values with the value iteration algorithm. What is exactly the rational for having to use Monte Carlo? Why would I not let Mario simply visit all possible fields/states, record the rewards and then calculate the perfect values with value iteration?

    • @mcohammer909
      @mcohammer909 4 ปีที่แล้ว

      Because in certain simulations/environments there are simply too many possible states. Take for instance the "Bomberman" game. Here you can have up to 1,5 * 100^168 different possibilities. This is more than the square of atoms in the universe :D Same counts for games like Chess or Go. Therefore, you need a model-free algorithm such as the Monte Carlo one that tries to learn by itself without trying to achieve knowledge of all possible states. I hope this clarifies it :)

  • @Skandar0007
    @Skandar0007 5 ปีที่แล้ว

    Thank you! This is amazing and well explained

  • @emmanuellopez3471
    @emmanuellopez3471 6 ปีที่แล้ว +3

    *sigh, this is my second time writing this. It is for the best...
    Your videos suck, but I don't think you should stop, on the contrary, you should make more!
    Back when I was in presentation class, they told us NEVER TO READ OFF THE POWER POINT.
    NEVER READ OFF THE POWER POINT.
    When discussing algorithms in your slides, you tell the viewers what's written on the screen. This is really bad. But why is it bad? It's redundant. I could skip ahead 30 seconds, totally ignore what you're saying and be no worse off if I just read the slides. It's important to be aware of the medium you're presenting in, and to take full advantage of it. For example, do you ever see novels hosted on youtube? Of course not! Try to remember a moment when someone told a joke or recounted a funny moment, but used to many word and it came out kinda plain. This is like that.
    Solution: Instead of repeating what's on the screen, try to complement it. Use that time and to give insight on what the code is really doing. For example:
    Let's say I have to explain this line of code:
    `mul=lambda x, y: x*y`
    I could say "here I set mul equal to lamba with x and y as input, and x y product as output" but that is literally useless. A programmer could read that and get that out, but even worse a non-programmer wouldn't have any idea of what I'm saying.
    Instead I should say, "After I do this, you can use mul to get x times y." That's shorter and sweeter. It let's the programmer know what the function does. Even the non-programmer could probably walk away understanding that they can use mul for something related to multiplication.
    It's important to show multiple perspectives of what you're doing. Giving context to what you're doing is crucial for a good presentation. Another thing that I can tell in your videos, is that you read off a script. I can tell from your static intonation, your weird pauses, and inconsistent speed. So you're reading off a script. It's Ok to use a script but you shouldn't be completely dependent on it, because it will steal away all your voice dynamism and remove the context in which you're trying to speak. There are many ways to deal with this.
    My favorite method is to throw away the script, and replace it with a set of keywords. The keyword can act as reminder to set the order of your video, but not set so much that you feel you're trying to 'catch up'. Another thing I like is iteration. You write small script and iteratively practice and expand upon it. This is good if you have a very solid purpose. If you're a good writer, you can write a single script and repeat it until it sounds good, but it sounds like you're already doing that and it sounds boring as hell. You can throw the script away and simply ad-lib the whole thing. I like to rehearse my presentation over and over and write down the keyword and ordering I like.
    Though if I were you I would try to make 20 second and 5 minute "dummy videos" before making a real one. This would force you to simplify the information in a more holistic way that even a street person could understand.
    Another way to circumvent this, is to just write articles on the subject, which from a coding perspective can be even better than video!

    • @aryan_kode
      @aryan_kode 4 ปีที่แล้ว +1

      but i liked it

    • @michiuno2238
      @michiuno2238 4 ปีที่แล้ว +2

      Your points are valid, but given the choice of having this kind of presentation versus none, I pick this presentation anytime. Because it brings content across that is really valuable for people like myself who have no math background and are looking to learn reinforcement learning. Skowster bridges the gap between all these highly academic papers, I have no way to connect to and myself. So, bravo for doing this Skowster and thank you.

    • @rachelgilyard3430
      @rachelgilyard3430 หลายเดือนก่อน

      I think you're pretty lousy at giving criticism.