The equation at 17:40 is hard to understand. The LHS seems to weirdly discount the gradient at states that are later in the trajectory. while the RHS doesn't. Any indication why that happens?
1) Because of Wald's identity, we can directly take the summand and the decay factor outside. 2) the randomness of the trajectory of the LHS is induced by $\pi(a|s$ and the initial state distribution $\mu_0(s)$, so is joint distribution of the state marginal $d^\pi(s)$ and the policy $\pi(a|s)$ in the RHS, where $d^\pi(s) := (1-\gamma)\sum^{\infty}_0 P^\pi(s|s_t)$
@@ruizhenliu9544 thanks for your reply! Maybe you meant to write $d^\pi(s) := (1-\gamma)\sum^{\infty}_{t=0} } \gamma^t P^\pi(s_t=s)$ (note the added $\gamma^t$)? That would answer my question by saying that the discounting of future states can be interpreted probabilistically (with a probability $1-\gamma$ of termination after each time-step) and therefore incorporated into the state distribution. Maybe that is what happens here. However, I think what is usually done in practice is to sample $d^\pi(s)$ by choosing states from the collected experience uniformly at random, not discounting according to the time-step at which they were observed.
Can you please share code example for data driven RL, including data preparation
is there a code python available, I have a data set
The equation at 17:40 is hard to understand. The LHS seems to weirdly discount the gradient at states that are later in the trajectory. while the RHS doesn't. Any indication why that happens?
1) Because of Wald's identity, we can directly take the summand and the decay factor outside.
2) the randomness of the trajectory of the LHS is induced by $\pi(a|s$ and the initial state distribution $\mu_0(s)$, so is joint distribution of the state marginal $d^\pi(s)$ and the policy $\pi(a|s)$ in the RHS, where $d^\pi(s) := (1-\gamma)\sum^{\infty}_0 P^\pi(s|s_t)$
@@ruizhenliu9544 thanks for your reply! Maybe you meant to write $d^\pi(s) := (1-\gamma)\sum^{\infty}_{t=0} } \gamma^t P^\pi(s_t=s)$ (note the added $\gamma^t$)? That would answer my question by saying that the discounting of future states can be interpreted probabilistically (with a probability $1-\gamma$ of termination after each time-step) and therefore incorporated into the state distribution.
Maybe that is what happens here. However, I think what is usually done in practice is to sample $d^\pi(s)$ by choosing states from the collected experience uniformly at random, not discounting according to the time-step at which they were observed.