Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น • 31

  • @whatsinthepapers6112
    @whatsinthepapers6112 5 ปีที่แล้ว +19

    Not going to lie - was fooled up until magnetic chess board! Can't put anything past Schmidhuber

  • @herp_derpingson
    @herp_derpingson 5 ปีที่แล้ว +28

    Academics now have to use meme knowledge and tactics, to get their papers noticed. What a time to be alive.

  • @ronen300
    @ronen300 3 ปีที่แล้ว +3

    One of the funniest 3 minutes in the field ! I was seriously laughing out loud 😂

  • @michael-nef
    @michael-nef 5 ปีที่แล้ว +15

    starting strong, upside down characters in an academic paper. high teir memer

    • @michael-nef
      @michael-nef 5 ปีที่แล้ว +4

      @Dmitry Akimov Lighten up a bit, these people just want recognition for their work and using catchy titles and more light-hearted introductions draws attention. It's not really their fault when it's what they're incentivized to do, something something reward-action.

    • @herp_derpingson
      @herp_derpingson 5 ปีที่แล้ว

      @dmitry I dont think its going to happen. There are so many research papers, if you want to get noticed, you need to stand out.

    • @michael-nef
      @michael-nef 5 ปีที่แล้ว +3

      @Dmitry Akimov ok boomer

  • @CyberneticOrganism01
    @CyberneticOrganism01 2 ปีที่แล้ว +1

    interesting new perspective on how to do RL ☺️

  • @CosmiaNebula
    @CosmiaNebula 4 ปีที่แล้ว +6

    skip to 4:08 if you don't want memes

  • @richardwebb797
    @richardwebb797 4 ปีที่แล้ว +1

    If you have 2 actions A and B, and you explore / train an input of desired reward 0 to produce action A, how does that help you do the right thing with an input desired reward 1 (select action B)?

    • @YannicKilcher
      @YannicKilcher  4 ปีที่แล้ว

      I guess ideally you would learn both, or at least recongize that you now want a different reward, so you should probably do a different action

    • @richardwebb797
      @richardwebb797 4 ปีที่แล้ว

      @@YannicKilcher possible to explain in more concrete terms? The idea is to sample actions better than randomly, but seems hand-wavy to say optimizing a probability distribution given one input will make the output distrib for another input good. Then again I guess that's the exactly what a neural net tries to do

  • @softerseltzer
    @softerseltzer 3 ปีที่แล้ว

    Thank you for the video!
    One thing I don't understand though is why does the first paper says that you must use RNN's for non-deterministic environments, yet in the experiments paper, they just stack a few frames for the VizDoom example without any RNN's.

  • @foobar1231
    @foobar1231 5 ปีที่แล้ว +2

    Sorry, if something wrong, I'm not a specialist in RL.
    It is a kind of dynamic programming: agent remembers its previous experience (command) and acts according to observation and experience. Experience is from the episodes (positive and negative, they are like palps). The longer an episodes (more steps), the bigger the horizon. So, calculate the mean reward from episodes and demand a little bit more (on one standard deviation more). What does it mean (to demand more)? As I understood, remain and develop only successful episodes further and cut negative episodes (palps).

    • @quickdudley
      @quickdudley 4 ปีที่แล้ว +1

      Let's call the agent f, the observations s, the reward r, the demand d, and actions a. At each step of experience generation a = f(s,d). Then later once the reward is known f is updated such that f(s,r) is pulled towards a.

  • @robosergTV
    @robosergTV 5 ปีที่แล้ว +1

    what a great video, thanks!

  • @NanachiinAbyss
    @NanachiinAbyss 5 ปีที่แล้ว +1

    Can't you do the same by simply adding some logic to the function where the actions are chosen?
    If you have a Network that outputs expected values you can just choose actions that have the expected value match with what you want.

    • @YannicKilcher
      @YannicKilcher  5 ปีที่แล้ว

      The value function has a hard coded horizon (until the end of the episode), where as UDRL can deal with any horizon.

  • @justinlloyd3
    @justinlloyd3 ปีที่แล้ว

    during the first few minutes I am like "hmm I don't think that's gonna work" LOL

  • @scottmiller2591
    @scottmiller2591 5 ปีที่แล้ว +7

    My cursor, hovering, hovering over the downvote icon - "This guy totally neither read nor understood the paper..." Finally, he says "Just kidding!" and actually reviews the paper.

  • @DeepGamingAI
    @DeepGamingAI 5 ปีที่แล้ว +4

    Pronounced "Lara"?

  • @snippletrap
    @snippletrap 4 ปีที่แล้ว +3

    Negative 5 billion billion trillion is a pretty bad reward.

  • @jonathanballoch
    @jonathanballoch 3 ปีที่แล้ว

    This is just a generalization of goal-conditioned imitation learning, no?

    • @patf9770
      @patf9770 3 ปีที่แล้ว

      Or maybe that's just a special case of ⅂ꓤ ;)

  • @ambujmittal6824
    @ambujmittal6824 5 ปีที่แล้ว

    Hi, can you do a video on Capsule networks also? Thank you :)
    Btw, I love your videos.

    • @DanieleMarchei
      @DanieleMarchei 5 ปีที่แล้ว +2

      he already did it ^^
      th-cam.com/video/nXGHJTtFYRU/w-d-xo.html