You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?
Thanks!
The text was updated successfully, but these errors were encountered:
My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?
Thanks!
Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t can be used as the estimation of Value in time t.
In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?
My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?
Thanks!
The text was updated successfully, but these errors were encountered: