-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why you discard to token-to-token attention in your model #21
Comments
Theoretically, we can split H heads into H modes. The key is how to choose
the down-sampling rate $r$. For example, we have two modes and choose r=4,8
at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is
corresponding to the original N*N attention. However, the memory
consumption and computation cost (especially for large-size inputs like
512X512 in segmentation and 1000X1000 in detection) are unacceptable. The
smaller r is, the heavy the computation cost is. In stage 3 which
computation cost of N*N attention (N=H /16 *W/16) is affordable, we take
r=1 and keep the original attention.
leoozy ***@***.***> 于2022年11月26日周六 10:10写道:
… Dears, thank you for your work. I found that in your model, you split the
attention into two modes: (<num_head/2 ; > num_head/2). And all these two
modes use the compressed sequence. Why you discard the original N*N
attention in your model? For example, you can split it into three modes.
Thank you.
—
Reply to this email directly, view it on GitHub
<#21>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for your rapid reply. Your work is excellent but I still have some confusion about how to design such a architecture. |
|
Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you.
The text was updated successfully, but these errors were encountered: