-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIRE Relative Positional Encodings #2325
Comments
Hi @kaddu341, thank you for bringing up this new positional embeddings that can well extrapolate length. Looks pretty exciting! We would love to have it in Torchtune to allow more users to benefit from it. Would you be open to draft an RFC about the design of FIRE so people can review and comment? It would be helpful to include context, motivation, and details on modules/files you plan to add/change. You can find example RFC here #102, #2105 |
Hi,
I'm currently working on the length generalization capabilities of transformers. As shown by Zhou et al. (https://arxiv.org/abs/2402.09371), FIRE positional encodings are excellent for this purpose as they can yield generalization results up to 2.5x the input length (in combination with other techniques).
FIRE, which stands for Functional Interpolation for Relative Positional Encodings, was introduced by Li et al. (https://arxiv.org/pdf/2310.04418).
I am planning to implement the algorithm from this paper myself, but I thought it would be useful if I could turn it into a PyTorch module so that others can benefit too. (I posted this originally in the Pytorch Core repo, but they suggested to bring it here). Therefore, I am proposing to add this feature to the torchtune library.
Please let me know what you think!
There are many other positional encoding types (sinusoidal, RoPE, learned, etc.), but for the specific task of length generalization, FIRE seems to be the most suitable based on several papers, which is why I am proposing this feature addition.
Like other relative attention mechanisms, FIRE introduces positional information in the attention layers rather than adding it to the input.
Here is a screenshot of some evaluation results for FIRE from the original paper (Li et al., 2024):
The text was updated successfully, but these errors were encountered: