Why you discard to token-to-token attention in your model #21

leoozy · 2022-11-26T02:10:15Z

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you.

OliverRensu · 2022-11-27T08:49:14Z

Theoretically, we can split H heads into H modes. The key is how to choose the down-sampling rate $r$. For example, we have two modes and choose r=4,8 at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is corresponding to the original N*N attention. However, the memory consumption and computation cost (especially for large-size inputs like 512X512 in segmentation and 1000X1000 in detection) are unacceptable. The smaller r is, the heavy the computation cost is. In stage 3 which computation cost of N*N attention (N=H /16 *W/16) is affordable, we take r=1 and keep the original attention. leoozy ***@***.***> 于2022年11月26日周六 10:10写道：

…

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you. — Reply to this email directly, view it on GitHub <#21>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

leoozy · 2022-11-29T16:07:28Z

Theoretically, we can split H heads into H modes. The key is how to choose the down-sampling rate $r$. For example, we have two modes and choose r=4,8 at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is corresponding to the original NN attention. However, the memory consumption and computation cost (especially for large-size inputs like 512X512 in segmentation and 1000X1000 in detection) are unacceptable. The smaller r is, the heavy the computation cost is. In stage 3 which computation cost of NN attention (N=H /16 W/16) is affordable, we take r=1 and keep the original attention. leoozy @.> 于2022年11月26日周六 10:10写道：
…
Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original NN attention in your model? For example, you can split it into three modes. Thank you. — Reply to this email directly, view it on GitHub <#21>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374 . You are receiving this because you are subscribed to this thread.Message ID: @.**>

Thank you for your rapid reply. Your work is excellent but I still have some confusion about how to design such a architecture.
1.
I found that in Equ 1:

The weight W^k, W^v are different from heads to heads. But in the trandition VIT, the weight W^k, W^v are shared among heads. Do this will lead to more parameters if I want to use more modes?
2.
In your architecture, you use more parameters than vit (e.g. conv2d operation). Am I right?

OliverRensu · 2022-12-02T04:54:29Z

In ViT, W is also different for different heads, but implemented by one linear layer which makes it similar to shared weight.
For example, W is (, 512) for 8 heads and there are 8 W (, 64) but implemented by one layer
We take similar parameters and computation costs (a little more or a little fewer) with previous ViT and its variants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why you discard to token-to-token attention in your model #21

Why you discard to token-to-token attention in your model #21

leoozy commented Nov 26, 2022

OliverRensu commented Nov 27, 2022 via email

leoozy commented Nov 29, 2022 •

edited

Loading

OliverRensu commented Dec 2, 2022

Why you discard to token-to-token attention in your model #21

Why you discard to token-to-token attention in your model #21

Comments

leoozy commented Nov 26, 2022

OliverRensu commented Nov 27, 2022 via email

leoozy commented Nov 29, 2022 • edited Loading

OliverRensu commented Dec 2, 2022

leoozy commented Nov 29, 2022 •

edited

Loading