Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

Open
ZhihanLee opened this issue Jul 9, 2022 · 3 comments

Comments

@ZhihanLee
Copy link

Hello, Dr.Haydari. I am an undergraduate student engaging in safe RL, and I also tried to implement CSAC/SAC-Lagrangian in pytorch.
I was wondering :
① if it is necessary to apply extra critic networks for 'safety Q value', does it has better performance than constructing actor loss by the cost from off-policy data?
②Have you ploted the lambda training curve? I experienced a monotonic training curve, which is just raise (positve loss) or decend (negative loss), I have noticed that some paper adjust the gradient ascent with max(0, lambda)
I would appreciate it if you could help me.

@ammarhydr
Copy link
Owner

Thank you for the questions. I am also a safe-RL learner. Here are my comments about your questions.
1-) There are two contained RL methods in general. First, peak constraint RL which deals with the constraints on the reward function itself, the other method is the average constraint RL which tries to minimize the cost with extra value function while trying to maximize the reward. So for average constrain formulation yes it is required. I did not get you what do you mean by "actor loss by the cost from off-policy data"

2-) I have not inspected the change of the lambda to be honest, but with little modification on my code, u can also inspect the lambda value. The reason for doing max(0, lambda) for lagrangian optimization is to keep the Lamba in a positive scale but again I have to work on it to give you a proper answer. These days I am busy with other stuff.

@ZhihanLee
Copy link
Author

Thank you so much for your reply.
Maybe "constructing the critic loss by the cost from off-policy data" is proper.
'the cost' is the 'ci' in each step (i is the constraint number).
Let me reorganize my words. The reason why I say that is because we adopt an extral critic network to get safety value now, thus we get the actor loss as : alphalog(pi) - Q_critic + Q_safety, and the critic loss has two types (Q_critic and Q_safety, they are all the distance between the Q prediction and real Q value coming from sampled data).
However, I think the Q_safety can be perhaps replaced by the cost that we collected before, which means there is no Q_safety in actor loss. And we add the consideration of safety into the critic loss, the critic loss is now equal to the distance between the network prediction and (a real Q value minus lambda
cost), the latter one only depends on the sampled data. Just like the SAC with automatic temperature adjustment, it adjusts alpha without extra network.
M a new guy with safe RL, and hoping to receive your suggestions.

@ammarhydr
Copy link
Owner

ammarhydr commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants