Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the priority computation correct in prioritized DQN #205

Open
ethanluoyc opened this issue Feb 22, 2022 · 4 comments
Open

Is the priority computation correct in prioritized DQN #205

ethanluoyc opened this issue Feb 22, 2022 · 4 comments

Comments

@ethanluoyc
Copy link
Contributor

Hi,

Cross posting a related issue here
google-deepmind/reverb#91

After looking at the DQN agent implementation in Acme, I am not sure if the prioritized variant matches the original PER paper.

Specifically, in the original PER, the priority for newly inserted transitions is the maximum of the priority seen so far. However, in Acme, the priority is uniformly set to be 1.0. This deviates from the original PER in that the newly inserted transitions will not be sampled immediately if 1.0 is comparably smaller than the typical TD error in training.

I looked into if it's possible to get the maximum priority in reverb but haven't found a nice way to do so.

@nikolamomchev
Copy link
Collaborator

Hi @ethanluoyc ,

Indeed it seems that the TD error in Acme is NOT clipped. I am not sure if this causes any performance issues, however, do you have an indication that it does so?

I believe @nino-vieillard recently managed to successfully reproduce DQN results, maybe he can shine some light into whether this matters.

FWIW, it seems that it would be easy to clip the td_error for the priority update. Would this help?

@ethanluoyc
Copy link
Contributor Author

I have not had the chance to investigate this.

Actually, I came across this problem in a different context where I want to implement prioritized sampling in a continuous control agent. I would like to implement a behavior where newly inserted experience will be immediately sampled following the insertion, this can be implemented if the adder knows about the current max priority, which seems tricky to do right now in Acme.

@nikolamomchev
Copy link
Collaborator

I'm approaching this naively and without much knowledge of what you are trying to do. But could you squash the priority in some fixed range (e.g. (0, 1]) and insert fresh experience with priority of 1?

@ethanluoyc
Copy link
Contributor Author

Hmm I guess squashing would cause issues with changing the relative importance. My solution now is to just not use reverb for this case :D
In case you are interested, I would like to replicate this paper with Acme. https://arxiv.org/abs/2107.00591. They have some setup with using prioritized replay which is currently difficult to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants