Does your PPO agent fail to learn?

แชร์
ฝัง

ความคิดเห็น • 35

  • @philippk5446
    @philippk5446 ปีที่แล้ว +16

    In this context, you should always track your KL divergence, since a high KL divergence may indicate over-exploration

  • @vladyslavkorenyak872
    @vladyslavkorenyak872 ปีที่แล้ว +1

    Hello Sir. Do you have any insight about the "use_sde" variable in PPO stable baselines v3? It supposedly activates "generalized State Dependent Exploration" but I did not find any clear results about the pros and cons of this.

  • @tommygin7561
    @tommygin7561 4 หลายเดือนก่อน

    Awesome video! Can you also talk about how to tune these hyperparameters generally? It would be very helpful!

  • @remcopoelarends9888
    @remcopoelarends9888 ปีที่แล้ว +2

    Very nice video! Could you maybe make a video about explaining and setting the hyperparameters of PPO in sb3? Keep up the good work!

    • @rlhugh
      @rlhugh  ปีที่แล้ว

      Thanks! Any particular parameter(s) that you are most interested in?

    • @remcopoelarends9888
      @remcopoelarends9888 ปีที่แล้ว +5

      @@rlhugh The ones that are less self-explanatory, such as clip_range, normalize_advantage, ent_coef, max_grad_norm and use_sde.

  • @petarulev9021
    @petarulev9021 ปีที่แล้ว

    I have the exact same problem of overfitting - my agent learns very useful stuff, but at some point - it just overfits to one action. This is why I take the checkpoint before overfitting, but this is a nasty fix.
    I just incorporated the entropy regularization and my model is training. The data is incredibly noisy, I will let you know about the result.
    In the meantime, I am wondering how the kf_coeff influences the whole process and what do you think about it and the relation between the entropy regularization and kl_coeff? I would appreciate a video or a comment.
    Cheers,
    petar

  • @willnutter1194
    @willnutter1194 2 ปีที่แล้ว +3

    Great videos, really enjoy your style of communication and thoughts. Thanks for making them :)

  • @hoseashpm7810
    @hoseashpm7810 2 ปีที่แล้ว +1

    Every episode, my PPO agent cumulative reward seems very “noisy”. Meaning the average cumulative reward increases but the instantaneous cumulative reward seems similar to a a noisy signal. I tried tips to designing a reward function with a gradient, and tried changing the entropy loss weight, yet it just does not reach to a consistent policy.
    I feel like pulling my hair now.

    • @rlhugh
      @rlhugh  2 ปีที่แล้ว

      Somehow I missed this comment earlier. Yeah, the reward usually is very noisy. In Tensorboard, there is an option to smooth the graph. Same option exists in mlflow, probably Weights and Biases too. But .. what do you mean by 'instantaneous cumulative reward'? Isn't the cumulative reward by definition the sum of all rewards from time 0 until some time T?

    • @hoseashpm7810
      @hoseashpm7810 2 ปีที่แล้ว +1

      @@rlhugh hi Hugh. Thanks for the tip. By “instantaneous” i meant that the cumulated reward at the end of every episode.
      I used matlab for designing the agent. I ended up using a double DQN with a discrete action space. It ended up learning a lot faster and smoother. Maybe my knowledge of PPO sucks. I tried extending the training time but the PPO agent gets stuck somehow.

    • @rlhugh
      @rlhugh  2 ปีที่แล้ว

      Interesting. Good info. Thank you! Do you have any thoughts on what about your task might make it more amenable to value function learning? What are some of the characteristics of your input and output space that might be different than eg playing Doom using the screen as input?

  • @Jolle_Gaming
    @Jolle_Gaming 10 หลายเดือนก่อน

    So the entreg is the same as ent_coef in the PPO, or did i missfollow you?

    • @rlhugh
      @rlhugh  10 หลายเดือนก่อน

      Yes, thats correct.

  • @SP-db6sh
    @SP-db6sh ปีที่แล้ว +1

    Make a video on using Finrl

  •  ปีที่แล้ว

    It's a great video, I am tunning gains of Kp, Ki with reinforcement learning PPO. The result is a constant too in all the trajectory of the movement of the robot. So I would like to know why this result is a constant too. Maybe something wrong I am doing? Or it is fine. I really appreciate your comments. Thanks!

  • @SP-db6sh
    @SP-db6sh ปีที่แล้ว

    I regret to see it 6 months later.
    Can u make video on Custom env creation for system like user experience for new app, trading bot ?

    • @rlhugh
      @rlhugh  ปีที่แล้ว +1

      So, firstly I don't have experience with using RL for trading. But secondly, my gut intuition is that one uses RL when ones actions affect the environment, or at least, the current state. However, unless you are making giant trades, your trading actions will not much affect your environment, i.e. the price, I think? The state does include things like how much money you have, and what stock you own. However I'm not sure that how much stock you own, and how much money you have, will much affect an estimate of the value of a stock? I would imagine that supervised learning is all you need, and will be much more efficient? What makes you feel that RL could be appropriate for estimating the value of a stock, or taking actions on stock?

    • @rlhugh
      @rlhugh  ปีที่แล้ว +1

      (I suppose one option could be to create a simulator, by using stock prices from a year or so ago, and assuming that one's stock trades do not affect market price?)

    • @rlhugh
      @rlhugh  ปีที่แล้ว +1

      what timeframe were you thinking of using for each step of RL? eg 5 minutes? 1 day? 1 week? 1 month? Do you know where one could obtain prices for several stock that you are interested in trading, for eg 1 year ago, at the level of granularity that you are interested in training RL on?

    • @p4ros960
      @p4ros960 ปีที่แล้ว +1

      @@rlhugh keep in mind that price does not mean anything in trading.

    • @rlhugh
      @rlhugh  ปีที่แล้ว

      @@p4ros960 can you elaborate on that? Afaik, all securities with stocks as the underlying asset do have a value that depends on the price of the underlying stock? For example, if you sell a call, the more the price of underlying stock goes up, the more money you will lose when that call is exercised, I think?

  • @Bvic3
    @Bvic3 ปีที่แล้ว

    What's 100k steps? You run 100 times 1 epoch of learning on 1000 frames?

    • @rlhugh
      @rlhugh  ปีที่แล้ว +1

      Steps relate to the simulation, not to the learning. A step is one iteration of: receive an observation, take one action. Epochs of learning etc are configured separately. You can choose to run 5 epochs of learning over each batch of steps, for example, which would result in each step being used in 5 different training epochs.

    • @Bvic3
      @Bvic3 ปีที่แล้ว +1

      @@rlhugh Ok, thanks. That's what I expected but I just wanted a confirmation.

  • @Meditator80
    @Meditator80 ปีที่แล้ว +1

    really fantastic videos 🎉

    • @rlhugh
      @rlhugh  ปีที่แล้ว

      Thank you!

  • @RoboticusMusic
    @RoboticusMusic 11 หลายเดือนก่อน +3

    It might be more helpful to explain and demo what entropy regularization is, what it does, and the history of the concept and different forms of it. The rest would be pretty intuitive.

    • @rlhugh
      @rlhugh  11 หลายเดือนก่อน +1

      Thank you for the feedback. Very useful, and I appreciate it :)

  • @TwoThreeFour
    @TwoThreeFour 6 หลายเดือนก่อน +1

    Wait until we reach C-3PO, instead of 2PO, that would be very interesting. 😁

  • @vialomur__vialomur5682
    @vialomur__vialomur5682 ปีที่แล้ว

    thanks!