Issue with critic target in PPO #42

davidireland3 · 2022-02-12T22:27:16Z

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Kin9L · 2022-04-17T08:43:50Z

In the line used to define the returns, we use the GAE + values as the target for the critic to learn. Is this correct?

My intuition says no -- the target we are training towards does not represent the true value function; should the target for value of the current state not be the observed reward + value at the next state?

Thanks!

Hi, I just saw your comment. I think it is correct to use the GAE + Values as the target for the critic. Roughly speaking, the GAE is shown below. GAE_t + Value_t can be used as the estimation of Value in time t.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with critic target in PPO #42

Issue with critic target in PPO #42

davidireland3 commented Feb 12, 2022

Kin9L commented Apr 17, 2022

Issue with critic target in PPO #42

Issue with critic target in PPO #42

Comments

davidireland3 commented Feb 12, 2022

Kin9L commented Apr 17, 2022