-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #77 from FilipKolodziejczyk/master
- Loading branch information
Showing
3 changed files
with
10 additions
and
1 deletion.
There are no files selected for viewing
Binary file added
BIN
+3.38 MB
...Supervised_Vision_Transformer/Positional_Label_for_Self-Supervised_Vision_Transformer.pdf
Binary file not shown.
9 changes: 9 additions & 0 deletions
9
2024/2024_10_21_Positional_Label_for_Self-Supervised_Vision_Transformer/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Positional Label for Self-Supervised Vision Transformer | ||
|
||
## Abstract | ||
|
||
Self-attention, a central element of ViT architecture, is permutation-invariant. Hence, it does not capture the spatial arrangement of input by design. Thus, valuable information is lost, especially crucial in the case of computer vision tasks. To deal with that, a common approach is to add the positional information to the input embeddings (element-wise) or modify attention layers to account for that (attention score extended by the relative distance between the query and key). The authors of Positional Label for Self-Supervised Vision Transformer propose an alternative approach that does not explicitly add any positional information. Instead, the training process is extended by auxiliary task - image patches position classification. As a result, the positional information is somewhat implicitly added to the patches themselves. Both relative and absolute variants are proposed. They are plug-and-play with vanilla ViTs. The authors ensure that this solution increases the ViT performance. Moreover, this method can be used in self-supervised training, which enhances the training process. | ||
|
||
## Source paper | ||
|
||
[Positional Label for Self-Supervised Vision Transformer](https://dl.acm.org/doi/10.1609/aaai.v37i3.25461) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters