Performance on other PLM #1

Hannibal046 · 2021-10-18T10:49:39Z

Hello,
Amazing work! Did you ever try other PLM(Bert,Roberta...) as your backbone model? Or did they perform not well in your preliminary experiments? Thanks so much

xycforgithub · 2021-10-28T22:49:05Z

Hi @Hannibal046,
We only tested ALBERT and RoBERTa in our preliminary experiments, and ALBERT performed better.

Hannibal046 · 2021-12-08T09:11:14Z

Hi,
should this be torch.ones_like(values)?
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L29
And these two lines looks quite different from normal attention operation and from the way described in your paper, could you please tell why is that ? Is this a improved version of attentive pooling ?
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L37
https://github.com/microsoft/DEKCOR-CommonsenseQA/blob/499f82f55939e66416f506b5b7a51064e09abb6e/model/layers.py#L40

Hannibal046 · 2021-12-08T09:24:43Z

I don't understand why attention_probs+values makes sense. Could you please help me ? Thanks so much !

Hannibal046 · 2021-12-08T11:28:33Z

Hi,
Could you please help check if this is the original attentive pooling used in the paper?

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionMerge(nn.Module):

    def __init__(self,input_size):
        super().__init__()
        self.query_ = nn.Parameter(torch.Tensor(input_size,1))
        self.query_.data.normal_(mean=0.0,std=0.02)
    
    def forward(
        self,
        values,
        mask = None ## 1 for non-pad token, common usage in Huggingface/Transformers
    ):
        ## value: [bs,length,d_model]
        ## mask: [bs,length]
        bs,length,d_model = values.shape
        if mask is not None:
            mask = mask.unsqueeze(-1) # bs,length,1
            inverted_mask = 1.0 - mask
            mask = inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(values.dtype).min)
        else:
            ## assume there is no pad token 
            mask = torch.ones((bs,length,1))
            inverted_mask = 1.0 - mask
            mask = inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(values.dtype).min)
        
        attention_probs = values @ self.query_ # [bs,length,1]
        attention_probs = attention_probs + mask  # pad token is set to -inf
        attention_probs = F.softmax(attention_probs,dim=1) # bs,l,1
        return torch.bmm(attention_probs.permute(0,2,1),values).squeeze(1) # bs,d_model

Hannibal046 · 2021-12-08T11:29:37Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on other PLM #1

Performance on other PLM #1

Hannibal046 commented Oct 18, 2021

xycforgithub commented Oct 28, 2021

Hannibal046 commented Dec 8, 2021

Hannibal046 commented Dec 8, 2021 •

edited

Loading

Hannibal046 commented Dec 8, 2021

Hannibal046 commented Dec 8, 2021

Performance on other PLM #1

Performance on other PLM #1

Comments

Hannibal046 commented Oct 18, 2021

xycforgithub commented Oct 28, 2021

Hannibal046 commented Dec 8, 2021

Hannibal046 commented Dec 8, 2021 • edited Loading

Hannibal046 commented Dec 8, 2021

Hannibal046 commented Dec 8, 2021

Hannibal046 commented Dec 8, 2021 •

edited

Loading