If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

ZhihanLee · 2022-07-09T06:35:39Z

Hello, Dr.Haydari. I am an undergraduate student engaging in safe RL, and I also tried to implement CSAC/SAC-Lagrangian in pytorch.
I was wondering :
① if it is necessary to apply extra critic networks for 'safety Q value', does it has better performance than constructing actor loss by the cost from off-policy data?
②Have you ploted the lambda training curve? I experienced a monotonic training curve, which is just raise (positve loss) or decend (negative loss), I have noticed that some paper adjust the gradient ascent with max(0, lambda)
I would appreciate it if you could help me.

ammarhydr · 2022-07-11T00:59:54Z

Thank you for the questions. I am also a safe-RL learner. Here are my comments about your questions.
1-) There are two contained RL methods in general. First, peak constraint RL which deals with the constraints on the reward function itself, the other method is the average constraint RL which tries to minimize the cost with extra value function while trying to maximize the reward. So for average constrain formulation yes it is required. I did not get you what do you mean by "actor loss by the cost from off-policy data"

2-) I have not inspected the change of the lambda to be honest, but with little modification on my code, u can also inspect the lambda value. The reason for doing max(0, lambda) for lagrangian optimization is to keep the Lamba in a positive scale but again I have to work on it to give you a proper answer. These days I am busy with other stuff.

ZhihanLee · 2022-07-11T06:54:07Z

Thank you so much for your reply.
Maybe "constructing the critic loss by the cost from off-policy data" is proper.
'the cost' is the 'ci' in each step (i is the constraint number).
Let me reorganize my words. The reason why I say that is because we adopt an extral critic network to get safety value now, thus we get the actor loss as : alphalog(pi) - Q_critic + Q_safety, and the critic loss has two types (Q_critic and Q_safety, they are all the distance between the Q prediction and real Q value coming from sampled data).
However, I think the Q_safety can be perhaps replaced by the cost that we collected before, which means there is no Q_safety in actor loss. And we add the consideration of safety into the critic loss, the critic loss is now equal to the distance between the network prediction and (a real Q value minus lambdacost), the latter one only depends on the sampled data. Just like the SAC with automatic temperature adjustment, it adjusts alpha without extra network.
M a new guy with safe RL, and hoping to receive your suggestions.

ammarhydr · 2022-10-11T07:20:01Z

Hello ZhininLee, Thank you for the questions. I am also a safe-RL learner. Here are my comments about your questions. 1-) There are two contained RL methods in general. First, peak constraint RL which deals with the constraints on the reward function itself, the other method is the average constraint RL which tries to minimize the cost with extra value function while trying to maximize the reward. So for average constrain formulation yes it is required. I did not get you what do you mean by "actor loss by the cost from off-policy data" 2-) I have not inspected the change of the lambda to be honest but with little modification on my code, u can inspect the lambda value as well. The reason for doing max(0, lambda) for lagrangian optimization is to keep the Lamba in positive scale but again I have to work on it to give you a proper answer. These days I am busy with other stuff. I hope this helps you. Ammar

…

On Fri, Jul 8, 2022 at 11:35 PM ZhihanLee ***@***.***> wrote: Hello, Dr.Haydari. I am an undergraduate student engaging in safe RL, and I also tried to implement CSAC/SAC-Lagrangian in pytorch. I was wondering : ① if it is necessary to apply extra critic networks for 'safety Q value', does it has better performance than constructing actor loss by the cost from off-policy data? ②Have you ploted the lambda training curve? I experienced a monotonic training curve, which is just raise (positve loss) or decend (negative loss), I have noticed that some paper adjust the gradient ascent with max(0, lambda) I would appreciate it if you could help me. — Reply to this email directly, view it on GitHub <#1>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMOULIOHWWB35F6TZ3BXD3VTEMUNANCNFSM53C4UDWQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Ammar Haydari PhD Student UC Davis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

ZhihanLee commented Jul 9, 2022

ammarhydr commented Jul 11, 2022

ZhihanLee commented Jul 11, 2022

ammarhydr commented Oct 11, 2022 via email

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

If it is necessary to apply extra critic nteworks for evaluating 'safety Q value' ? #1

Comments

ZhihanLee commented Jul 9, 2022

ammarhydr commented Jul 11, 2022

ZhihanLee commented Jul 11, 2022

ammarhydr commented Oct 11, 2022 via email