An introduction to Policy Gradient methods - Deep Reinforcement Learning

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 ก.พ. 2025

ความคิดเห็น • 196

  • @paulstevenconyngham7880
    @paulstevenconyngham7880 6 ปีที่แล้ว +218

    This is the best explanation of PPO on the net hands down

  • @Alex-gc2vo
    @Alex-gc2vo 6 ปีที่แล้ว +10

    easily the best explanation of PPO I've ever seen. most papers and lectures get too tangled up in the probabilistic principals and advanced mathematic derivations and completely lose sight of what these models are doing in high level terms.

  • @bigdreams5554
    @bigdreams5554 2 ปีที่แล้ว +3

    This guy actually knows what he's talking about. Excellent video.

  • @DavidSaintloth
    @DavidSaintloth 6 ปีที่แล้ว +74

    I actually understood your explanation cover to cover on first view and thought the 19 minutes felt more like 5.
    Outstanding work.

    • @oguretsagressive
      @oguretsagressive 6 ปีที่แล้ว +5

      One view, no pauses?! Not willing to be mean, but how can you be sure you've truly understood and weren't conquering Mount Stupid the whole time?

  • @maloxi1472
    @maloxi1472 4 ปีที่แล้ว +8

    The value you provide in these videos is insane !
    Thank you very much for guiding our learning process ;)

  • @arkoraa
    @arkoraa 6 ปีที่แล้ว +18

    I'm loving this RL series. Keep it up!

  • @sarahjamal86
    @sarahjamal86 5 ปีที่แล้ว +4

    As someone who is working in RL field .... you did very good job.

  • @tyson96
    @tyson96 ปีที่แล้ว

    Explained so well and it was intuitive as well. I learnt more from this video than all the articles I found in the internet. Great job.

  • @alializadeh8095
    @alializadeh8095 6 ปีที่แล้ว +4

    Amazing! This was the best explanation of PPO I have seen so far

  • @akshatagrawal819
    @akshatagrawal819 5 ปีที่แล้ว +325

    He is actually much better than Siraj Raval.

    • @oracletrading3000
      @oracletrading3000 5 ปีที่แล้ว +4

      @@ahmadayazamin3313 what kind of scandal?

    • @oracletrading3000
      @oracletrading3000 5 ปีที่แล้ว +1

      @@ahmadayazamin3313 I don't know it, just watch one or two videos of him demonstrate RL for trading

    • @DeanRKern
      @DeanRKern 4 ปีที่แล้ว +1

      He seems to know what he's talking about.

    • @joirnpettersen
      @joirnpettersen 4 ปีที่แล้ว +11

      He (Siraj) never explained any of what he was saying, and that was why I stopped watching him. He just rushed through the context and the results, explaining nothing.

    • @revimfadli4666
      @revimfadli4666 4 ปีที่แล้ว +5

      @@oracletrading3000 'how to predict stock market in 5 minute'? More like, how to expose oneself as a fraud & end career in that time

  • @yuktikaura
    @yuktikaura ปีที่แล้ว

    Keep it up. Brevity is the soul of wit, it is indeed a skill to summarize the crux of a concept in such lucid way..!

  • @4.0.4
    @4.0.4 6 ปีที่แล้ว +2

    Thank you for including links for learning more on the description.

  • @ColinSkow
    @ColinSkow 6 ปีที่แล้ว +3

    Great breakdown of PPO. You've simplified a lot of complex concepts to make them understandable! Hahaha... and you can't beat an octopus slap!!!

  • @BoltronRacingTeam
    @BoltronRacingTeam 2 ปีที่แล้ว +1

    Excellent video! Wonderful resource for anyone participating in AWS DeepRacer competitions.

  • @BDEvans
    @BDEvans 4 ปีที่แล้ว

    By far the best explanation on TH-cam.

  • @berin4427
    @berin4427 4 ปีที่แล้ว

    Fantastic review of policy gradients, and PPO as well! Best place for a refresh

  • @MShahbazKharal
    @MShahbazKharal 4 ปีที่แล้ว

    it is a long video, no doubt, but once you end watching it you think it was much better than actually reading the paper. thanks man!

  • @Fireblazer41
    @Fireblazer41 5 ปีที่แล้ว +1

    Thank you so much for this video! This is way more insightful and intuitive than simply reading the papers!

  • @xiguo2783
    @xiguo2783 5 ปีที่แล้ว

    Great explanation with enough details! Thumbs up for all the free knowledge on the internet!

  • @scienceofart9121
    @scienceofart9121 4 ปีที่แล้ว +2

    I watched this video more than 5 times and this was the best video about the PPO. Thank you for making great videos like this and keep up the good work. P.S: Your explanation was even simpler than the creator of this algorithm Schulman.)

  • @EddieSmolansky
    @EddieSmolansky 6 ปีที่แล้ว +2

    This video was very well done, I definitely got a lot of value out of it. Thank you for your work!

  • @zeyudeng3223
    @zeyudeng3223 5 ปีที่แล้ว +1

    I watched all your videos today, great works! Love them!

  • @Samuel-wl4fw
    @Samuel-wl4fw 3 ปีที่แล้ว

    Coming back to this after thoroughly understanding Q-learning and looking into the advantage function in another network, this explanation is FAST, I wonder who would understand all that is happening without background knowledge

    • @bigdreams5554
      @bigdreams5554 2 ปีที่แล้ว

      Well for AI/ML some background info is needed. If you taking multivariable calculus, its assumed you know calculus already. For those who already work in machine learning, this video is amazing. If i didn't get something i can research what he's talking about because he's using the proper technical terms, not dumbing it down. It's a wake up call for what I need to know to be knowledge. Great video.

  • @m33pr0r
    @m33pr0r 4 ปีที่แล้ว +2

    Thank you for such a clear explanation. I was able to watch this at 2x speed and follow everything, which is a testament to your clarity. It really helped that you tied PPO back to previous work (e.g TRPO)

  • @Rnjeazy
    @Rnjeazy 6 ปีที่แล้ว +2

    Dude, your channel is awesome! So glad I found it!

  • @Խչո
    @Խչո ปีที่แล้ว

    Wonderful, this is the first video i've seen on this channel. I suspect it won't be the last!

  • @jeremydesmond638
    @jeremydesmond638 5 ปีที่แล้ว +2

    Third video in a row. Really enjoy your work. Keep it up! And thank you!!!

  • @curumo_curunir
    @curumo_curunir 2 ปีที่แล้ว

    Thank you for the video, it is very helpful. The key concepts have been explained in just 20min, bravo. I would like to see more videos from your channel. Thank you.

  • @fktudiablo9579
    @fktudiablo9579 4 ปีที่แล้ว

    one of the best overview of PPO, clean.

  • @DavidCH12345
    @DavidCH12345 5 ปีที่แล้ว +1

    I love how you take the formula apart an look at it step by step. Great work!

  • @cherguioussama1611
    @cherguioussama1611 4 ปีที่แล้ว

    best explanation of PPO I've found. Thanks

  • @labreynth
    @labreynth 5 หลายเดือนก่อน

    This topic is so far from my comprehension, and yet you got me to understand it within 3 minutes

  • @Navhkrin
    @Navhkrin 5 ปีที่แล้ว +2

    12:19
    Min operator also gets prefers old PO update
    IF advantage is positive but probability of taking that action decreased, min operator selects unclipped objective here to undo the bad update
    IF the advantage is negative but probability of taking that action increased, min operator also selects unclipped objective to undo bad update, just as mentioned in video.

  • @junjieli9253
    @junjieli9253 5 ปีที่แล้ว

    Thank you for help me understand PPO faster, good explanation with useful resources included.

  • @Navhkrin
    @Navhkrin 5 ปีที่แล้ว

    Much cleaner than deep learning boot camp explanation

  • @anonymous_user-s3s
    @anonymous_user-s3s 2 หลายเดือนก่อน

    Fantastic intuitive explanation, thank you.

  • @Bardent
    @Bardent 3 ปีที่แล้ว

    This video is absolutely amazing!!

  • @umuti5ik
    @umuti5ik 4 ปีที่แล้ว

    Excellent algorithm and explanation!

  • @arianvc8239
    @arianvc8239 6 ปีที่แล้ว

    This is really great! Keep up the good work!

  • @antoinemathu7983
    @antoinemathu7983 5 ปีที่แล้ว +5

    I watch Siraj Raval for the motivation, but I watch Arxiv Insights for the explanations

  • @jcdmb
    @jcdmb 4 ปีที่แล้ว

    Amazing explanation. Keep up the good work.

  • @MyU2beCall
    @MyU2beCall 4 ปีที่แล้ว

    Great Video. Excellent intro to this topic .

  • @ConsultingjoeOnline
    @ConsultingjoeOnline 4 ปีที่แล้ว

    *Great* video. *Great* explanation!

  • @francesco.messina88
    @francesco.messina88 5 ปีที่แล้ว

    Congrats! You have a special skill to explain AI.

  • @pawanbhandarkar4199
    @pawanbhandarkar4199 6 ปีที่แล้ว

    Hat's off, mate. This is fantastic.

  • @abhishekkapoor7955
    @abhishekkapoor7955 6 ปีที่แล้ว

    keep up the good work , sir. thanks for this awesome explanation

  • @sainijagjit
    @sainijagjit 6 หลายเดือนก่อน

    Thank you, for the clean explaination

  • @connor-shorten
    @connor-shorten 5 ปีที่แล้ว +1

    Thank you! Learned a lot from this!

  • @petersilie9702
    @petersilie9702 4 ปีที่แล้ว +1

    Thank you so much. I watch this videos the 10th time :-D

  • @akramsystems
    @akramsystems 6 ปีที่แล้ว +2

    Love your videos!!

  • @cherrysun7054
    @cherrysun7054 5 ปีที่แล้ว

    I really love your video, professional and informative, thank.

  • @maraoz
    @maraoz 4 ปีที่แล้ว

    Thanks for this video! Really good teaching skills! :)

  • @YuZhang-f1z
    @YuZhang-f1z 5 ปีที่แล้ว

    great video for ppo! thanks a lot for you work!

  • @alizerg
    @alizerg 2 ปีที่แล้ว

    Thanks buddy, really appreciated!

  • @siddharthmittal9355
    @siddharthmittal9355 6 ปีที่แล้ว

    more, just more videos. so well explained.

  • @suertem1
    @suertem1 3 ปีที่แล้ว

    Great explanation and references

  • @anthonydawson9700
    @anthonydawson9700 3 ปีที่แล้ว

    Pretty good explanation and very understandable thanks!

  • @vizart2045
    @vizart2045 2 ปีที่แล้ว

    I need to dive deeper into this.

  • @arkasaha4412
    @arkasaha4412 6 ปีที่แล้ว +2

    Great video as usual! Just a suggestion, maybe instead of diving directly into deep RL you can make videos (shorter if you don't have much time to devote) on simpler RL algorithms like DQN, Q-learning. That way someone who wants to know more about RL can build up the knowledge through more vanilla stuff. I admire your style and content like many others and would love to see it grow more :)

    • @ArxivInsights
      @ArxivInsights  6 ปีที่แล้ว +11

      You're totally right that this is not easy to step into if you are new to RL, but I feel like there are tons and tons of good introduction resources out there for the simpler stuff.
      I'm really trying to focus on the niche of people that already have this background and want to go a bit further :p

    • @52VaultBoy
      @52VaultBoy 6 ปีที่แล้ว +2

      And I thing that makes this channel amazing. Practically following the state-of-art is a fantastic concept. As I am just curious about AI as a hobby and doing science in a different sector, I am glad, that I don't need to go through tons of articles by myself, but you show me the direction, where I should look to stay in a picture. Thank you.

    • @M0481
      @M0481 6 ปีที่แล้ว +1

      I was about to say what @Arvix Insights is saying. There's an amazing book called: RL: An introduction by Richard S. Sutton and Andrew G. Barto, which introduces all the basic concepts that you're talking about. This second version of this book has been made publically available as a pdf and will be available on Amazin next month (don't quote me on the release date please :P).

    • @ColinSkow
      @ColinSkow 6 ปีที่แล้ว +1

      I'm doing beginner level videos on my channel and would love your feedback... th-cam.com/channels/rRTWfso9OS3D09-QSLA5jg.html

    • @arkasaha4412
      @arkasaha4412 6 ปีที่แล้ว

      Sure, thanks for the videos :)

  • @tuliomoreira7494
    @tuliomoreira7494 4 ปีที่แล้ว

    Amazing explanation.
    Also. I just noticed that at 9:12 the seal slaps the guy with an octopus o.O

  • @bikrammajhi3020
    @bikrammajhi3020 11 หลายเดือนก่อน

    This is gold!!

  • @ruslanuchan8880
    @ruslanuchan8880 6 ปีที่แล้ว

    Subscribed because the topics so cool!

  • @apetrenko_ai
    @apetrenko_ai 6 ปีที่แล้ว +2

    It was a great explanation!
    Please do a video on Soft Actor-Critic and Maximum Entropy RL! That would be amazing!

  • @MdelaRE1
    @MdelaRE1 5 ปีที่แล้ว +1

    Amazing work :D

  • @sarvagyagupta1744
    @sarvagyagupta1744 5 ปีที่แล้ว

    I don't know if I'll get answers here but I have some questions:
    1) Why are we taking the "min" in the loss function?
    2) We are considering 1 in 1-e and 1+e because the reward we give for each positive action is 1, right? My question here is the scaling factor.

  • @conlanrios
    @conlanrios 10 หลายเดือนก่อน

    Great breakdown and links for additional resources

  • @hadsaadat8283
    @hadsaadat8283 2 ปีที่แล้ว

    simply the best ever

  • @阮雨迪
    @阮雨迪 3 ปีที่แล้ว

    really good explanation!

  • @SG-tz7jj
    @SG-tz7jj ปีที่แล้ว

    Great explanation.

  • @jeffreylim5920
    @jeffreylim5920 5 ปีที่แล้ว +1

    16:46 How to apply gpu usage in PPO. For me, it was hard to implement collect experiences with cuda pytorch; Actually it seems even OpenAI didn't use gpu in the collecting process.

    • @DVDmatt
      @DVDmatt 5 ปีที่แล้ว +1

      You need to vectorize the environments. PPO2 now does this (see twitter.com/openai/status/931226402048811008?lang=en). In my experience GPU usage is still very low during collection, however.

  • @chid3835
    @chid3835 3 ปีที่แล้ว

    Very nice videos. FYI: Please watch at 0.75 speed for better understanding, LOL!

  • @idabagusdiaz
    @idabagusdiaz 5 ปีที่แล้ว

    YOU ARE AWESOME!

  • @yinghaohu8784
    @yinghaohu8784 8 หลายเดือนก่อน

    very good explanations

  • @yonistoller1
    @yonistoller1 ปีที่แล้ว

    Thanks for sharing this! I may be misunderstanding something, but it seems like there might be a mistake in the description. Specifically, the claim in 12:50 that "this is the only region where the unclipped part... has a lower value than the clipped version".
    I think this claim might be wrong, because there could be another case where the unclipped version would be selected:
    For example, if the ratio is e.g 0.5 (and we assume epsilon is 0.2), that would mean the ratio is smaller than the clipped version (which would be 0.8), and it would be selected.
    Is that not the case?

  • @meddh1065
    @meddh1065 3 ปีที่แล้ว

    There is something I didn't understand : If you clip r then how will you do back propagation ?? the gradient will be just zero in the case r>1+epsilon or r

  • @anshulpagariya6881
    @anshulpagariya6881 4 ปีที่แล้ว

    A big thanks for the video :)

  • @benjaminf.3760
    @benjaminf.3760 5 ปีที่แล้ว

    Very well explained, thank you

  • @kenfuliang
    @kenfuliang 4 ปีที่แล้ว

    Thank you so much. Very helpful

  • @ravichunduru834
    @ravichunduru834 5 ปีที่แล้ว +3

    Great video, but I have a couple of doubts:
    1. In PPO, how does changing the objective help in restricting the updates to the policy? Wouldn’t it make more sense to restrict the gradient so that we don’t update the policy too much in one go?
    2. In PPO, when A

  • @learningdaily4533
    @learningdaily4533 3 ปีที่แล้ว +2

    Hi, I'm really interested in your video and I found that these videos are really helpful for people who are learning RL in particular and A.I in general. Your way of representing the notion and key ideas behind these algorithms is amazing. It's too sad to found that you r not gonna update videos for 2 years. Do you have any another channel or anything I can learn from you. Please let me know :( It's my pleasure

    • @ademord
      @ademord 2 ปีที่แล้ว

      I came here to say this

  • @Sherlockarim
    @Sherlockarim 6 ปีที่แล้ว +1

    great content keep up man

  • @CommanderCraft98
    @CommanderCraft98 3 ปีที่แล้ว

    At 10:27 he says that first part in the min() expression is the "default policy gradient objective" but I do not really see how, since the objective function usually is J=E[R_t]. Does someone understand this?

  • @fabiocescon3772
    @fabiocescon3772 5 ปีที่แล้ว

    Thank you, it's really a good explanation

  • @thiyagutenysen8058
    @thiyagutenysen8058 2 ปีที่แล้ว

    log(probabilities) will be -ve right, so if we take a bad action advantage function is -ve, so Lpg = -ve*-ve = +ve. so Lpg is blowing up when we take bad actions. L represents objective and not the loss function right?

  • @samidelhi6150
    @samidelhi6150 5 ปีที่แล้ว

    How do I tackle the moving target problem using methods from RL ? Where I have more than one reward ,3 possible actions to take and ofcourse state which include many factors / sources of information around the environment ,
    Your help is highly appreciated

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 6 ปีที่แล้ว

    Great video!

  • @joshuajohnson4339
    @joshuajohnson4339 6 ปีที่แล้ว +6

    Just thought I would let you know that I just shared your video with my cohort in the Udacity Reinforcement Learning Nanodegree. We are going through PPO now and this video is relevant and timely - especially wrt to the clip region explanations. Any ideas on how to convert the outputs from discrete to continuous action space?

    • @ArxivInsights
      @ArxivInsights  6 ปีที่แล้ว +1

      Sounds great, thx for sharing! Well as I mentioned, the PPO policy head outputs the parameters of a Gaussian distribution (so means and variances) for each action. At runtime, you can then sample from these distributions to get continuous output values and use the reparametrization trick to backpropagate gradients through this non-differentiable block --> check out my video on Variational Autoencoders for all the details on this!

  • @Corpsecreate
    @Corpsecreate 6 ปีที่แล้ว

    I some questions! Taking a quick step back to the Policy Gradient Loss for a sec, we had:
    Loss = E ( [log prob] * advantage )
    If my understanding is correct, then we actually have two neural networks here. One that calculates the probabilities of each action (this is the policy network we are trying to optimise), and one entirely different neural network that tries to guess the value of being in the current state. Q1 - does the value network simply learn off mean-squared-error by minimising ([actual discounted reward] - [value net prediction])^2? Is there no way to train use policy gradient methods without running 2 networks?
    Q2 - How do we actually calculate the discounted reward for a neural network where only the probabilities of each action are taken? For example, if at time step 0, our NN produces:
    Act 1 : 20%
    Act 2 : 30%
    Act 4 : 50%
    I can only take one of these actions to end up in a new state. Do we take the highest one? Or do we, for each trajectory, randomly pick one based on their probability of being chosen? Do we do this for every time step t = 1 to T?
    After the trajectory of T timesteps, we get one 'actual' value for G, that is attributed to the timestep at time t = 0. Does this mean we can only perform gradient descent on this single observation? If we do a minibatch, do we need multiple tractories, say 50, each of length T, then do gradient descent on the 50 where the only the G value for t = 0 for each of them has been calculated?
    My apologies for the questions, hopefully they make sense and I'm just looking to confirm my understanding :)

    • @ArxivInsights
      @ArxivInsights  6 ปีที่แล้ว

      Hi, really good questions!
      Q1: you can train a policy gradient method without using a value function by just training the policy network, but using a value function to estimate the expected return from the current state tends to make things much more stable..
      Q2: You're correct that this might seem a bit weird, but indeed you have to probabilistically sample an action at each timestamp and then play out the episode along that specific path in state space. However, on average & over time you can see it that each action will in fact get selected according to it's probability rate. So stochastically every action gets played!

  • @JohnDoe-cq3ic
    @JohnDoe-cq3ic 5 ปีที่แล้ว

    Very good thorough walkthrough ...but take a breath when your talking ...it takes a moment to catch up to what your saying sometimes but I guess thats the benefit of pause and play again. nice job!

  • @СтепанТроешестов
    @СтепанТроешестов 6 ปีที่แล้ว

    Great video! Thanx a lot

  • @victor-iyi
    @victor-iyi 5 ปีที่แล้ว +1

    Hi Andrew,
    Can you please make a video explaining OpenAI's transformer model, Google's BERT & OpenAI's GPT&GPT-2 model?
    I can't seem to wrap my head around them.

  • @Guytron95
    @Guytron95 4 ปีที่แล้ว

    3:54 to 4:10 or so, why does that section remind me of the method used for ray marching in image rendering?

  • @greysky1786
    @greysky1786 9 วันที่ผ่านมา

    Thank you.

  • @absimaldata
    @absimaldata 3 ปีที่แล้ว

    Why do we take the log of policy in the loss??

  • @jeffreylim5920
    @jeffreylim5920 5 ปีที่แล้ว +1

    12:56 The real power of clipping is that it automatically ignores oulier samples. Not decreasing the influence, but totally ignoring! This is because the gradient of outlier samples are 0

  • @tumaaatum
    @tumaaatum 3 ปีที่แล้ว

    Can you do a video about DDPG?
    Also, The PPO I know uses simply the discounted future returns in the loss. Is the variant using the Advantage instead the standard one?

  • @RishiPratap-om6kg
    @RishiPratap-om6kg ปีที่แล้ว

    Can I use this algorithm for "computation offloading in edge computing "

  • @rutvikreddy772
    @rutvikreddy772 5 ปีที่แล้ว

    Great video! I had a question though, at 6:50, the objective function, which you called loss is actually the function that we'd want to maximize right? I mean calling it loss gave me the idea that we should minimize it. Correct me if I am wrong, please.

    • @gregh6586
      @gregh6586 4 ปีที่แล้ว +1

      Yes, we are trying to maximise the advantage. It is called "loss" simply because it has the same function as (true) loss functions in other domains. It might get tricky when you implement a multi-head neural network with Actor-Critic-Methods where you combine different loss functions (GAE for the actor, lambda returns for the critic, entropy for exploration) as you have to make sure which "loss" you aim to maximise and which to minimise.

  • @rayanelhelou2009
    @rayanelhelou2009 4 ปีที่แล้ว

    A comment about the PPO paper, not this video: there's a minor typo in Eq. (10, 11).
    The terms in the exponent should read T-t-1 rather than T-t+1.
    Would you agree?

  • @buffergate
    @buffergate 4 ปีที่แล้ว +7

    9:56
    "Looks surprising simple,,, .. right?"
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    :(