New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

PGQ: Combining policy gradient and Q-learning #5

Open

sotetsuk opened this issue Apr 7, 2017 · 2 comments

Assignees

Contributor

sotetsuk commented Apr 7, 2017

https://arxiv.org/abs/1611.01626

sotetsuk self-assigned this

Contributor Author

sotetsuk commented Apr 7, 2017

8/10

Contributor Author

sotetsuk commented Apr 7, 2017

議論・疑問・コメント

なんで方策勾配の停留点とほど遠い点でも、停留点付近での関係式からQチルダを作ってそれにベルマン最適で正則化かけて更新して良いのか良くわからなかった
方策勾配法はナイーブな定式化では探索をすることができずに方策が決定論的になりがちだが、探索を促すエントロピー正則化を使った方策勾配法がある意味でより自然な定式化かもしれない、という示唆とも捉えることができて面白い。
Eq.4からπとVだけを使って（妥当な）Qを計算しているのがPGQのポイントだと思った。

他に読むべき文献

PCL: https://arxiv.org/abs/1702.08892

sotetsuk mentioned this issue

PGQを読んだ #8

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment