Formula for Q(s, a) update step at 7:06 doesn't match code at 11:55 . I believe the code is correct and the formula should state Q(s, a) = Q(s,a) + alpha * [r + gamma*max(Q(s')) - Q(s, a)). Argmax -> max, V(s) -> Q(s, a)
There are too many videos focusing on explaining theory and equations which are easy to research as there is plenty of theoretical knowledge helping to work out all the math beyond the equations. Although, there are too few good quality videos focused on practice and coding. Your video is an outstanding one!
I think becasue in the Q learning per thousand episodes are being plotted and that should increase to converge and after a point the increase should be very little..so as to signify convergence....If we had not resetted the reward.... new learnings from 1001th episode would happen on the previously accumulated rewards score and the converge i.e the change in Q value will not be seen..
Formula for Q(s, a) update step at 7:06 doesn't match code at 11:55 . I believe the code is correct and the formula should state Q(s, a) = Q(s,a) + alpha * [r + gamma*max(Q(s')) - Q(s, a)). Argmax -> max, V(s) -> Q(s, a)
There are too many videos focusing on explaining theory and equations which are easy to research as there is plenty of theoretical knowledge helping to work out all the math beyond the equations. Although, there are too few good quality videos focused on practice and coding. Your video is an outstanding one!
Why every 1000 episodes do we reset the reward back to 0?
I think becasue in the Q learning per thousand episodes are being plotted and that should increase to converge and after a point the increase should be very little..so as to signify convergence....If we had not resetted the reward.... new learnings from 1001th episode would happen on the previously accumulated rewards score and the converge i.e the change in Q value will not be seen..
It is not useful if you just drop a formula out of nowhere.
Your channel quality and its views don't match :/
Your videos are really good, Keep it up!