Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why you discard to token-to-token attention in your model #21

Open
leoozy opened this issue Nov 26, 2022 · 3 comments
Open

Why you discard to token-to-token attention in your model #21

leoozy opened this issue Nov 26, 2022 · 3 comments

Comments

@leoozy
Copy link

leoozy commented Nov 26, 2022

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you.

@OliverRensu
Copy link
Owner

OliverRensu commented Nov 27, 2022 via email

@leoozy
Copy link
Author

leoozy commented Nov 29, 2022

Theoretically, we can split H heads into H modes. The key is how to choose the down-sampling rate $r$. For example, we have two modes and choose r=4,8 at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is corresponding to the original NN attention. However, the memory consumption and computation cost (especially for large-size inputs like 512X512 in segmentation and 1000X1000 in detection) are unacceptable. The smaller r is, the heavy the computation cost is. In stage 3 which computation cost of NN attention (N=H /16 W/16) is affordable, we take r=1 and keep the original attention. leoozy @.> 于2022年11月26日周六 10:10写道:

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N
N attention in your model? For example, you can split it into three modes. Thank you. — Reply to this email directly, view it on GitHub <#21>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374 . You are receiving this because you are subscribed to this thread.Message ID: @.
**>

Thank you for your rapid reply. Your work is excellent but I still have some confusion about how to design such a architecture.
1.
I found that in Equ 1:
image
The weight W^k, W^v are different from heads to heads. But in the trandition VIT, the weight W^k, W^v are shared among heads. Do this will lead to more parameters if I want to use more modes?
2.
In your architecture, you use more parameters than vit (e.g. conv2d operation). Am I right?

@OliverRensu
Copy link
Owner

  1. In ViT, W is also different for different heads, but implemented by one linear layer which makes it similar to shared weight.
    For example, W is (, 512) for 8 heads and there are 8 W (, 64) but implemented by one layer
  2. We take similar parameters and computation costs (a little more or a little fewer) with previous ViT and its variants

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants