-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about MaxPool #10
Comments
Thanks foralliance for the question Your question is equivalent to Case 3 in Section 3.2 of the paper. Please refer it. |
@rajatsaini0294 Another question. If use the ordinary convolution whose output dimension is also G to replace the PW, this will not only ensure that each point and each channel has its own independent weight, but also ensure that there is interaction between channels in each group. Not sure if you have tried such a design? |
You mean without partitioning the input into G groups, use convolution to generate G output channels and generate G attention maps from that? |
Sorry for not expressing clearly. My idea is that all the designs are exactly the same as in Figure 2, the only difference is that use the ordinary convolution whose output dimension is also G to replace the original PW. This replacement can also ensure that there is interaction between channels in each group, that is, capture the cross channel information as you mentioned in Case 3 in Section 3.2. In addition, this replacement can bring an additional effect that each point and each channel has its own independent weight rather than all channels(in group) sharing one weight. |
I understood your point. We have not tried this design because this will increase the number of parameters. Surely you can try this and let us know how it worked. :-) |
@Nandan91 @rajatsaini0294 HI
For each subspace, the input is HxWxG, through DW + MaxPool + PW, the middle attention map is HxWx1, then through Softmax + Expand, the final attention map is HxWxG.
Because the output dimension of this PW operation is 1, the final attention map is equivalent to one weight shared by all channels. Why use this PW?? Why is it designed so that all channels share one weight?
If this PW operation is removed, that is, treat the output of the MaxPool operation as the final attention map. In this case, it is equivalent to that each point and each channel has its own independent weight. Why not design it this way?
many many thanks!!!
The text was updated successfully, but these errors were encountered: