Expected Return - What Drives a Reinforcement Learning Agent in an MDP

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 พ.ค. 2024
  • 💡Enroll to gain access to the full course:
    deeplizard.com/course/rlcpailzrd
    Welcome back to this series on reinforcement learning! In this video, we're going to build on the way we think about the cumulative rewards that an agent receives in a Markov decision process and introduce the important concept of return.
    We'll see that the return is exactly what's driving the agent to make the decisions it makes. We'll also introduce the idea of episodes and talk about episodic tasks vs. continuing tasks.
    Sources:
    Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow
    incompleteideas.net/book/RLboo...
    Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies
    www.cs.toronto.edu/~vmnih/doc...
    🕒🦎 VIDEO SECTIONS 🦎🕒
    00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
    00:30 Help deeplizard add video timestamps - See example in the description
    06:18 Collective Intelligence and the DEEPLIZARD HIVEMIND
    💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
    👋 Hey, we're Chris and Mandy, the creators of deeplizard!
    👉 Check out the website for more learning material:
    🔗 deeplizard.com
    💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
    🔗 deeplizard.com/resources
    🧠 Support collective intelligence, join the deeplizard hivemind:
    🔗 deeplizard.com/hivemind
    🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
    👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
    🔗 neurohacker.com/shop?rfsn=648...
    👀 CHECK OUT OUR VLOG:
    🔗 / deeplizardvlog
    ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
    Tammy
    Mano Prime
    Ling Li
    🚀 Boost collective intelligence by sharing this video on social media!
    👀 Follow deeplizard:
    Our vlog: / deeplizardvlog
    Facebook: / deeplizard
    Instagram: / deeplizard
    Twitter: / deeplizard
    Patreon: / deeplizard
    TH-cam: / deeplizard
    🎓 Deep Learning with deeplizard:
    Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
    Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
    Learn TensorFlow - deeplizard.com/course/tfcpailzrd
    Learn PyTorch - deeplizard.com/course/ptcpailzrd
    Natural Language Processing - deeplizard.com/course/txtcpai...
    Reinforcement Learning - deeplizard.com/course/rlcpailzrd
    Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
    🎓 Other Courses:
    DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
    Deep Learning Deployment - deeplizard.com/learn/video/SI...
    Data Science - deeplizard.com/learn/video/d1...
    Trading - deeplizard.com/learn/video/Zp...
    🛒 Check out products deeplizard recommends on Amazon:
    🔗 amazon.com/shop/deeplizard
    🎵 deeplizard uses music by Kevin MacLeod
    🔗 / @incompetech_kmac
    ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

ความคิดเห็น • 53

  • @deeplizard
    @deeplizard  5 ปีที่แล้ว +7

    Check out the corresponding blog and other resources for this video at:
    deeplizard.com/learn/video/a-SnJtmBtyA

  • @muhammadmustaqimbinsamsudi4784
    @muhammadmustaqimbinsamsudi4784 3 ปีที่แล้ว +31

    Example of continuous task: My assignment, after I finish one assignment there will always be another one.

  • @scottk5083
    @scottk5083 5 ปีที่แล้ว +10

    your definitions of episodic tasks and non episodic tasks and their relation to reward type is on point. Love the content

  • @SandwichMitGurke
    @SandwichMitGurke 5 ปีที่แล้ว +1

    I am SO glad that I found this series!!

  • @asdfasdfuhf
    @asdfasdfuhf 3 ปีที่แล้ว +3

    Clear and straight forward once again, no criticism at all so far.
    I really like how you can access the blog post for each video, I can't stress enough how useful that is!
    The blog posts look absolutely astonishing.

  • @hansrichter5227
    @hansrichter5227 2 ปีที่แล้ว +2

    There are many examples for continuous tasks in the context of traffic simulation, like traffic light signal phases for maximizing flows.

  • @ScienceMasterHK
    @ScienceMasterHK 9 หลายเดือนก่อน +1

    Thank you for RL series!

  • @tejasarlimatti8420
    @tejasarlimatti8420 5 ปีที่แล้ว +2

    another great one
    keep them coming!

  • @sarabucci8209
    @sarabucci8209 4 ปีที่แล้ว +2

    I don't understand how to calculate the expected return. How does the algorithm know about all the future rewards?

  • @supratikmondal9229
    @supratikmondal9229 5 ปีที่แล้ว +1

    Thanks.... These videos were really helpful

  • @tinyentropy
    @tinyentropy 5 ปีที่แล้ว +3

    What always confuses me here is the uncertainty on the precise return values in future steps. I understand that we therefore use the discount value, but still, what will be the concrete values of the returns R_(t+k) at future time steps that we sum up here for time step t?

  • @bjarke7886
    @bjarke7886 4 ปีที่แล้ว +1

    Question: Increasing the exponent of the factor gamma by 1 for each increase in time t seems arbitrary. Could we not use any other decreasing function with the range 1, 0?

  • @dallasdominguez2224
    @dallasdominguez2224 ปีที่แล้ว +1

    Another banger 🔥 👌

  • @sidddddddddddddd
    @sidddddddddddddd 2 ปีที่แล้ว +1

    The formula at 3:20 seems to be a bit incorrect, apart from that obvious typ0, there seems to be another mistake. G_t should be equal to R_t+1 + gamma*G_t+2 because for the time-step (t+1), we have already calculated the reward and hence we need to calculate the cumulative reward ahead of G_t+1 i.e. starting from the time step (t+2)

  • @ProfessionalTycoons
    @ProfessionalTycoons 5 ปีที่แล้ว +1

    thank you

  • @mateusbalotin7247
    @mateusbalotin7247 2 ปีที่แล้ว

    Thank you!

  • @chukwuka-steveorefo1812
    @chukwuka-steveorefo1812 5 ปีที่แล้ว +3

    Another dose of awesomeness and knowledge thank you! I was wondering if you will be covering multi-arm or multi-bandit reinforcement and it's benefits compared to using a single reinforcement algorithm. Thanks

    • @deeplizard
      @deeplizard  5 ปีที่แล้ว +1

      Thanks, Chukwuka! We're not currently planning to cover the multi-armed bandit problem explicitly, but we will be covering the exploration-exploitation trade-off and greedy action selection, which originate from the bandit problem. We'll make use of these concepts in our later discussions on Q-learning and deep Q-learning.

    • @chukwuka-steveorefo1812
      @chukwuka-steveorefo1812 5 ปีที่แล้ว +1

      Thanks that's great I'm looking forward to more of your awesome videos!

  • @herewego8093
    @herewego8093 ปีที่แล้ว

    Can someone help me? From some websites such as aistackexchange, or qoura, they said that value function means "expected return" which roughly mean the avg return in all episodes and I need some examples to understand more deeply above that. Let's say we are running a testbed of multi-armed bandit (2000runs), then what is valve function q(t) and what is Gt ? Is Gt cumulative sum of future reward from action t on each run while the q(t) the avg cumulative sum of future reward of action t over 2000 runs ? Thank you so muchh.

  • @sahand5277
    @sahand5277 5 ปีที่แล้ว +8

    In addition to the actual video I love the theme music at the end! can you tell the name of it?

    • @deeplizard
      @deeplizard  5 ปีที่แล้ว +3

      Thanks, Sahand! The song is called Thinking Music by Kevin MacLeod.

    • @Alchemist10241
      @Alchemist10241 2 ปีที่แล้ว

      And she has a very nice voice

  • @hazzaldo
    @hazzaldo 5 ปีที่แล้ว +2

    Many thank for the great video. I have one question. In the formula displaying the relationship of how returns at successive time steps are related to each other, I noticed that Rt + 3 is occurring twice. Is this a typo error? Second question, I didn't understand how in the third formula at the bottom, we changed it to yGt+1

    • @deeplizard
      @deeplizard  5 ปีที่แล้ว +2

      You're welcome, hazzaldo. For your first question, yes, that's a typo. Nice catch!
      For your second question, check out how Gt is defined a couple lines above. Then, think about how you would define Gt+1 given this definition of Gt. You should then see how Gt+1 is equal to the right hand side (after γ) of the line above.

    • @hazzaldo
      @hazzaldo 5 ปีที่แล้ว

      @@deeplizard Many thanks for the quick response. Make sense :)

  • @manuelkarner8746
    @manuelkarner8746 3 ปีที่แล้ว +1

    {
    "question": "Why do we need to discount the expected return of rewards ?",
    "choices": [
    "because for continuing tasks the return itself could be infinite",
    "because what is happening later is not realy important",
    "this is due to the vanishing gradient problem",
    "because we deal with episodic tasks "
    ],
    "answer": "because for continuing tasks the return itself could be infinite",
    "creator": "Hivemind",
    "creationDate": "2020-12-18T13:36:55.041Z"
    }

    • @deeplizard
      @deeplizard  3 ปีที่แล้ว +1

      Thanks, Manuel! Just added your question to deeplizard.com/learn/video/a-SnJtmBtyA :)

  • @alivecoding4995
    @alivecoding4995 7 หลายเดือนก่อน

    I still haven’t understood how this formula is of help when we cannot anticipate the specific future rewards without simulating all the various tasks that might lead us there.

  • @anupamkarn1009
    @anupamkarn1009 3 ปีที่แล้ว +2

    The process of improving maths skills, is a continuous task?

  • @louerleseigneur4532
    @louerleseigneur4532 4 ปีที่แล้ว

    merci

  • @kapilvishwakarma3250
    @kapilvishwakarma3250 5 ปีที่แล้ว +2

    Google chrome dinosaur and cactus game can be a continuing task.

  • @shivamarora2712
    @shivamarora2712 4 ปีที่แล้ว +1

    {
    "question": "In continuing tasks, How the expected return is finite even though the series of rewards is infinite?",
    "choices": [
    "Each reward in the series starting from the first one is multiplied by corresponding increasing powers of discount factor called gamma which lies between 0 and 1 (power starts from 0 for the first reward and so on), which allows the sum of the series of discounted rewards to converge.",
    "Agent just focuses on maximizing the immediate reward.",
    "Agent just maximizes the expected return of first n rewards in the inifinite series where n is an arbitrarily chosen number.",
    "Agent maximizes the return just like it does for an episodic task."
    ],
    "answer": "Each reward in the series starting from the first one is multiplied by corresponding increasing powers of discount factor called gamma which lies between 0 and 1 (power starts from 0 for the first reward and so on), which allows the sum of the series of discounted rewards to converge.",
    "creator": "Shivam Arora",
    "creationDate": "2020-02-23T13:13:22.623Z"
    }

    • @deeplizard
      @deeplizard  4 ปีที่แล้ว

      Thanks, Shivam! Just added your question to deeplizard.com/learn/video/SnJtmBtyA :)
      I modified the wording just slightly. Great question!

  • @fordatageeks
    @fordatageeks 5 ปีที่แล้ว +2

    A continuing task could be Robot localization and fulfilling tasks in an environment? or a stock prediction game. Not sure.

    • @deeplizard
      @deeplizard  5 ปีที่แล้ว +1

      You've got it, Sanni!

  • @pranavdave6973
    @pranavdave6973 4 ปีที่แล้ว

    Balancing a pole is continuous task right?

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 5 ปีที่แล้ว +1

    Rather than discrete number of states and rewards, what if you have a continuous case? Can that still be mathematically tractable to solve?

    • @deeplizard
      @deeplizard  5 ปีที่แล้ว +2

      Hey James - Yes, there is a technique called "discretization" for solving continuous state MDPs. It has its drawbacks though. Check out section 4 starting on the bottom of page 8 of Andrew Ng's lecture notes for full detail: cs229.stanford.edu/notes/cs229-notes12.pdf

  • @Bjarkediedrage
    @Bjarkediedrage 3 ปีที่แล้ว +1

    What does discounted return of rewards mean? I understood everything up
    until the word discounted was introduced, what's the difference between
    reward vs discounted reward?
    Edit: Okay I think I get it? This might be totally wrong^^ But so instead of accumulating rewards, and optimizing that (totalReward += newReward), You would maybe do (totalReward = totalReward * 0.95 + newReward) Preventing totalReward from reaching infinity.

    • @quonxinquonyi8570
      @quonxinquonyi8570 2 ปีที่แล้ว

      Very intuitive....convergence of infinite series will only happen if common ratio is less than 1 otherwise series will always diverge

  • @sarvagyagupta1744
    @sarvagyagupta1744 4 ปีที่แล้ว +1

    Generating own song could be a continuing task.

  • @tingnews7273
    @tingnews7273 5 ปีที่แล้ว

    It a bit hard for me. I watched twice and read the post.
    What I learned:
    1、What is return for t. The reword from t to T
    2、What is Episodic and continuing
    3、discount reward
    This is my question:
    1、Episodic and continuing is not that clear for me base on the video example . The robot arm can divide the task to episodic , I thought . Is episodic T is finite and continuing is infinite?
    2、If T is infinite(continuing) . The return is infinite too. So the learning is pointness(what ever agent action choose will cause the infinite return)?
    3、When we introduce the discount is just for infinite T?
    4、When we discount the reward . Is the future reward the same . More formalize: the first equation : Gt = Rt+1 + Rt+2 + ..... + RT . the second equation : Gt = Rt+1 + rR+2 + r^2R+3 + .....+ r^TRT . The first Rt+2 = rRt+2 or Rt+2 = Rt+2
    5、If question 4 answer is Rt+2 = Rt+2 , we introduce the discount reward just because we want?

    • @aleksx05
      @aleksx05 5 ปีที่แล้ว

      Hi,
      this is what i understood (i might be wrong, so correct me guys if it's the case)
      The discount take place only when T is infinite. When T is episodic we know the issue (the best result we can achieve, for pong game it is to won a point, not the final score), but suppose pong game plays an endless numbers of points, to define the return G when there is no limits, to make it understand it's on the good way, taking the right decision, and not doing it 100 steps later, we introduce the discount. Meaning, the reward closest to the return have much more influence than the reward we'll get 5 steps later, because we know wich is the best move at Rt +1 and before getting the best result at Rt +2 we need to go through Rt +1.

    • @tingnews7273
      @tingnews7273 5 ปีที่แล้ว

      @@aleksx05 After next course, I think beside the infinite T. Unknown T is also suit for. Episode game may end after 3 or 90 episodes. Learing like play a game without map. Along the learning progress getting more understandings.

  • @monirzaman337
    @monirzaman337 2 ปีที่แล้ว

    I think at at 4:59, the last term should be R_{t+4} instead of R_{t+3}.

    • @deeplizard
      @deeplizard  2 ปีที่แล้ว

      You're right :D
      It's corrected in the corresponding blog:
      deeplizard.com/learn/video/a-SnJtmBtyA

  • @JustinMasayda
    @JustinMasayda ปีที่แล้ว

    4:19 You say "the discounted return," but don't you mean, "the discount rate times the discounted return," or, "the discounted discounted return?"

  • @aussietalks
    @aussietalks 5 ปีที่แล้ว +3

    Diving an autonomous car is continuous task. Right?

  • @rainfeedermusic
    @rainfeedermusic 3 ปีที่แล้ว

    Why do we need the discounted return instead of the normal return?

    • @deeplizard
      @deeplizard  3 ปีที่แล้ว

      Because future rewards matter less to the agent than current rewards.