Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on other PLM #1

Open
Hannibal046 opened this issue Oct 18, 2021 · 5 comments
Open

Performance on other PLM #1

Hannibal046 opened this issue Oct 18, 2021 · 5 comments

Comments

@Hannibal046
Copy link

Hello,
Amazing work! Did you ever try other PLM(Bert,Roberta...) as your backbone model? Or did they perform not well in your preliminary experiments? Thanks so much

@xycforgithub
Copy link
Contributor

Hi @Hannibal046,
We only tested ALBERT and RoBERTa in our preliminary experiments, and ALBERT performed better.

@Hannibal046
Copy link
Author

Hi,
should this be torch.ones_like(values)?
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L29
And these two lines looks quite different from normal attention operation and from the way described in your paper, could you please tell why is that ? Is this a improved version of attentive pooling ?
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L37
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L40

@Hannibal046
Copy link
Author

Hannibal046 commented Dec 8, 2021

image

I don't understand why attention_probs+values makes sense. Could you please help me ? Thanks so much !

@Hannibal046
Copy link
Author

Hi,
Could you please help check if this is the original attentive pooling used in the paper?

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionMerge(nn.Module):

    def __init__(self,input_size):
        super().__init__()
        self.query_ = nn.Parameter(torch.Tensor(input_size,1))
        self.query_.data.normal_(mean=0.0,std=0.02)
    
    def forward(
        self,
        values,
        mask = None ## 1 for non-pad token, common usage in Huggingface/Transformers
    ):
        ## value: [bs,length,d_model]
        ## mask: [bs,length]
        bs,length,d_model = values.shape
        if mask is not None:
            mask = mask.unsqueeze(-1) # bs,length,1
            inverted_mask = 1.0 - mask
            mask = inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(values.dtype).min)
        else:
            ## assume there is no pad token 
            mask = torch.ones((bs,length,1))
            inverted_mask = 1.0 - mask
            mask = inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(values.dtype).min)
        
        attention_probs = values @ self.query_ # [bs,length,1]
        attention_probs = attention_probs + mask  # pad token is set to -inf
        attention_probs = F.softmax(attention_probs,dim=1) # bs,l,1
        return torch.bmm(attention_probs.permute(0,2,1),values).squeeze(1) # bs,d_model

image

@Hannibal046
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants