NeurIPS 2020 Tutorial on Offline RL: Part 1

แชร์
ฝัง
  • เผยแพร่เมื่อ 16 พ.ย. 2024

ความคิดเห็น • 5

  • @flipthecointwice
    @flipthecointwice 2 ปีที่แล้ว

    Can you please share code example for data driven RL, including data preparation

  • @jeyasheelarakkinimj6534
    @jeyasheelarakkinimj6534 2 ปีที่แล้ว

    is there a code python available, I have a data set

  • @dermitdembrot3091
    @dermitdembrot3091 3 ปีที่แล้ว

    The equation at 17:40 is hard to understand. The LHS seems to weirdly discount the gradient at states that are later in the trajectory. while the RHS doesn't. Any indication why that happens?

    • @ruizhenliu9544
      @ruizhenliu9544 3 ปีที่แล้ว

      1) Because of Wald's identity, we can directly take the summand and the decay factor outside.
      2) the randomness of the trajectory of the LHS is induced by $\pi(a|s$ and the initial state distribution $\mu_0(s)$, so is joint distribution of the state marginal $d^\pi(s)$ and the policy $\pi(a|s)$ in the RHS, where $d^\pi(s) := (1-\gamma)\sum^{\infty}_0 P^\pi(s|s_t)$

    • @dermitdembrot3091
      @dermitdembrot3091 3 ปีที่แล้ว

      @@ruizhenliu9544 thanks for your reply! Maybe you meant to write $d^\pi(s) := (1-\gamma)\sum^{\infty}_{t=0} } \gamma^t P^\pi(s_t=s)$ (note the added $\gamma^t$)? That would answer my question by saying that the discounting of future states can be interpreted probabilistically (with a probability $1-\gamma$ of termination after each time-step) and therefore incorporated into the state distribution.
      Maybe that is what happens here. However, I think what is usually done in practice is to sample $d^\pi(s)$ by choosing states from the collected experience uniformly at random, not discounting according to the time-step at which they were observed.