How does the gradient back-propagate from Q to the action $a_i$？ #26

xihuai18 · 2020-08-10T08:07:47Z

I wonder how the gradient back propagate from Q to $a_i$.
Trace from Q:

Lines 149 to 150 in 105d60e

    
           all_q = self.critics[a_i](critic_in) 
        
           int_acs = actions[a_i].max(dim=1, keepdim=True)[1]

Then trace critic_in:

MAAC/utils/critics.py

Line 148 in 105d60e

critic_in = torch.cat((s_encodings[i], *other_all_values[i]), dim=1)

Since s_encoding doesn't contain input from $a_i$, I then trace other_all_values[i]:

MAAC/utils/critics.py

Lines 125 to 141 in 105d60e

    
           for curr_head_keys, curr_head_values, curr_head_selectors in zip( 
        
                   all_head_keys, all_head_values, all_head_selectors): 
        
               # iterate over agents 
        
               for i, a_i, selector in zip(range(len(agents)), agents, curr_head_selectors): 
        
                   keys = [k for j, k in enumerate(curr_head_keys) if j != a_i] 
        
                   values = [v for j, v in enumerate(curr_head_values) if j != a_i] 
        
                   # calculate attention across agents 
        
                   attend_logits = torch.matmul(selector.view(selector.shape[0], 1, -1), 
        
                                                torch.stack(keys).permute(1, 2, 0)) 
        
                   # scale dot-products by size of key (from Attention is All You Need) 
        
                   scaled_attend_logits = attend_logits / np.sqrt(keys[0].shape[1]) 
        
                   attend_weights = F.softmax(scaled_attend_logits, dim=2) 
        
                   other_values = (torch.stack(values).permute(1, 2, 0) * 
        
                                   attend_weights).sum(dim=2) 
        
                   other_all_values[i].append(other_values) 
        
                   all_attend_logits[i].append(attend_logits) 
        
                   all_attend_probs[i].append(attend_weights)

keys and values don't contain agent i's action as input, and selector uses only observations as input:

MAAC/utils/critics.py

Lines 118 to 119 in 105d60e

    
           all_head_selectors = [[sel_ext(enc) for i, enc in enumerate(s_encodings) if i in agents] 
        
                                 for sel_ext in self.selector_extractors]

So, is there gradient from Q to action $a_i$?

The text was updated successfully, but these errors were encountered:

DokinCui · 2023-06-28T02:16:24Z

keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand.
And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".

zhl606 · 2023-08-22T14:39:51Z

keys and values contain the agent i's action since their input is "sa_encoding", but the selector uses only observations as input, I can't understand. And for the function of "s_encoding", I also can't understand, because only "sa_encoding" is used in the paper, but not "s_encoding".

I also have the same question, have you understood it? And what I want to know is, in the PPO algorithm, when estimating the advantage function, do we only need input state information and not action information, so can we use s_encoding without using sa_encoding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the gradient back-propagate from Q to the action $a_i$？ #26

How does the gradient back-propagate from Q to the action $a_i$？ #26

xihuai18 commented Aug 10, 2020 •

edited

Loading

DokinCui commented Jun 28, 2023

zhl606 commented Aug 22, 2023

How does the gradient back-propagate from Q to the action $a_i$？ #26

How does the gradient back-propagate from Q to the action $a_i$？ #26

Comments

xihuai18 commented Aug 10, 2020 • edited Loading

DokinCui commented Jun 28, 2023

zhl606 commented Aug 22, 2023

xihuai18 commented Aug 10, 2020 •

edited

Loading