You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand.
And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".
keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".
I also have the same question, have you understood it? And what I want to know is, in the PPO algorithm, when estimating the advantage function, do we only need input state information and not action information, so can we use s_encoding without using sa_encoding?
I wonder how the gradient back propagate from Q to$a_i$ .
Trace from Q:
MAAC/utils/critics.py
Lines 149 to 150 in 105d60e
Then trace
critic_in
:MAAC/utils/critics.py
Line 148 in 105d60e
Since
s_encoding
doesn't contain input fromother_all_values[i]
:MAAC/utils/critics.py
Lines 125 to 141 in 105d60e
keys
andvalues
don't contain agent i's action as input, andselector
uses only observations as input:MAAC/utils/critics.py
Lines 118 to 119 in 105d60e
So, is there gradient from Q to action$a_i$ ?
The text was updated successfully, but these errors were encountered: