Question regarding RoPE implementation #10

lxy1108 · 2024-11-08T06:24:46Z

Thanks for sharing the code of this interesting work!

I noticed an inconsistency between the paper’s description of the RoPE and its implementation in the code. According to the paper, the relative position should be calculated based on the temporal differences between patches,

However, the code seems to be using flattened 2D indices instead when applying RoPE:

seq_id = torch.arange(n_tokens * n_vars)
seq_id = repeat(seq_id, 'n -> b h n', b=B, h=H)
queries, keys = self.qk_proj(queries, keys, query_id=seq_id, kv_id=seq_id)

Could you clarify the reasoning behind this discrepancy? Was this an intentional change, or might it affect the model’s performance?

WenWeiTHU · 2024-11-08T06:51:42Z

Update: changing "permutation-invariant between variables" to "permutation-equivariant between variables"
Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-equivariant between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

lxy1108 · 2024-11-08T07:07:25Z

Thanks for your response! I am looking forward to seeing the relevant experiments :)

lxy1108 · 2024-11-08T12:46:01Z

Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-invariance between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

Thank you again for the detailed response! Could you elaborate on how this flattened RoPE achieves permutation invariance between variables?

WenWeiTHU · 2024-11-12T10:40:09Z

Update: Sorry for the misunderstanding, we have changed the "permutation-invariant for variables" to "permutation-equivariant for variables" (For their difference, please refer to https://arxiv.org/abs/1703.06114). We will also revise this term in our paper.

This is a very interesting and important question!

Note that permutation equivariant between variables means shuffling the input order of variables should not affect anything other than the output order of variables.

For example, if we have $T$ patch tokens for $N$ variables each. Now, we are going to mark the temporal position of these $NT$ tokens. Instead of using 2D RoPE: $R_{NT, NT}$, we think it is more reasonable to use 1D $R_{T, T}$ and repeat it by using the Kronecker Product: $I_{N\times N} \bigotimes R_{T, T}$. And thus $N$ will not influence the element values in varying sub-matrice (working for different variable pairs).

WenWeiTHU · 2024-11-12T10:47:37Z

Further, to mark the variable position of these $NT$ tokens, what we do is only distinguish whether the token belongs to the variable or not. More concisely, if token A belongs to variable 1 and does not belong to variable 2 and variable 3, we only care that V1 is different from V2/V3, but we are not concerned that V2 is different from V3 (if we do so, change the input order of V2 and V3 will cause additional effect to the output). Therefore, for any token, confirming its endogenous and exogenous variables is enough. That is why we use the term "keep the equivariance of variables" by $u$ and $v$ in our work.

lxy1108 · 2024-11-20T12:56:37Z

Thank you for your response! I agree that using 1D RoPE and repeating it is a more reasonable approach compared to the flattened 2D RoPE.

Leopold2333 · 2024-11-21T14:58:09Z

Update: Sorry for the misunderstanding, we have changed the "permutation-invariant for variables" to "permutation-equivariant for variables" (For their difference, please refer to https://arxiv.org/abs/1703.06114). We will also revise this term in our paper.

This is a very interesting and important question!

Note that permutation equivariant between variables means shuffling the input order of variables should not affect anything other than the output order of variables.

For example, if we have T patch tokens for N variables each. Now, we are going to mark the temporal position of these N T tokens. Instead of using 2D RoPE: R N T , N T , we think it is more reasonable to use 1D R T , T and repeat it by using the Kronecker Product: I N × N ⨂ R T , T . And thus N will not influence the element values in varying sub-matrice (working for different variable pairs).

I got a confusion about the size of RoPE matrix. The code implementation uses an $NT$-length 1D RoPE matrix.

First I wanna figure one thing out: For a token $x_{m,i}$, should the tokens interacting with it be $x_{n,j} (\forall n\ne m, j\le i)$ ?

If so, the angle for $x_{m,i}$ will be $(mT+i)\Theta$, while the angles for $x_{n,j}$ will be $(nT+j)\Theta$. I'm not quite sure whether it's necessary to ensure that the tokens of any variable with the same patch-index should maintain the same angle, so that even if m≠n, the tokens at indices i and j will still preserve the same relative positional difference of (i-j)Θ? This also seems to achieve permutation invariance between variables.

WenWeiTHU · 2024-12-18T07:25:48Z

@Leopold2333

Hi, it's so nice to see you. I think it is right to ensure that the tokens of any variable with the same patch index should maintain the same angle. Note that RoPE here is intended to keep the temporal order only. For different variates, we use two scalar $u, v$, only to distinguish them instead of reflecting the variate order. Therefore, using 1D RoPE can be a more reasonable approach.

Leopold2333 · 2024-12-19T01:45:17Z

@Leopold2333

Hi, it's so nice to see you. I think it is right to ensure that the tokens of any variable with the same patch index should maintain the same angle. Note that RoPE here is intended to keep the temporal order only. For different variates, we use two scalar u , v , only to distinguish them instead of reflecting the variate order. Therefore, using 1D RoPE can be a more reasonable approach.

OpenLTM/layers/SelfAttention_Family.py

Lines 64 to 68 in 8bfebe2

    
           seq_id = torch.arange(n_tokens * n_vars) 
        
           seq_id = repeat(seq_id, 'n -> b h n', b=B, h=H) 
        
           queries, keys = self.qk_proj( 
        
               queries, keys, query_id=seq_id, kv_id=seq_id)

OpenLTM/layers/Attn_Projection.py

Lines 54 to 58 in 8bfebe2

    
           def forward(self, x, seq_id): 
        
               self._init_freq(max_len=seq_id.max() + 1) 
        
               rot_cos = self.cos[seq_id] 
        
               rot_sin = self.sin[seq_id] 
        
               return rot_cos * x + rot_sin * self._rotate(x)

Thank you for the reply. I could understand the design of scalar $u,v$. My confusion is about the above code snippets. According to my example in my last question, the angle for $x_{m,i}$ will be $(mT+i)\Theta$ while the angles for any other $x_{n,j}$ will be $(nT+j)\Theta$. When $m\ne n$ and $i=j$, they don't share the same rotation angle. If the 1D RoPE is used to keep the temporal order only, should they share the same angle such as $i\Theta$?

WenWeiTHU · 2024-12-19T05:34:49Z

@Leopold2333

Thanks for your prompt answers :) I think there is an unsolved bug in the involved part of the code snippets since we intend to reveal the sequential order of the tokens only based on their temporal index (so $m\neq n$ and $i=j$ should lead to the same rotation angle). We will arrange to fix it soon.

Leopold2333 · 2024-12-19T06:49:26Z

@Leopold2333

Thanks for your prompt answers :) I think there is an unsolved bug in the involved part of the code snippets since we intend to reveal the sequential order of the tokens only based on their temporal index (so m ≠ n and i = j should lead to the same rotation angle). We will arrange to fix it soon.

Get it. Thanks for the reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding RoPE implementation #10

Question regarding RoPE implementation #10

lxy1108 commented Nov 8, 2024

WenWeiTHU commented Nov 8, 2024 •

edited

Loading

lxy1108 commented Nov 8, 2024

lxy1108 commented Nov 8, 2024

WenWeiTHU commented Nov 12, 2024 •

edited

Loading

WenWeiTHU commented Nov 12, 2024 •

edited

Loading

lxy1108 commented Nov 20, 2024

Leopold2333 commented Nov 21, 2024

WenWeiTHU commented Dec 18, 2024 •

edited

Loading

Leopold2333 commented Dec 19, 2024

WenWeiTHU commented Dec 19, 2024 •

edited

Loading

Leopold2333 commented Dec 19, 2024

Question regarding RoPE implementation #10

Question regarding RoPE implementation #10

Comments

lxy1108 commented Nov 8, 2024

WenWeiTHU commented Nov 8, 2024 • edited Loading

lxy1108 commented Nov 8, 2024

lxy1108 commented Nov 8, 2024

WenWeiTHU commented Nov 12, 2024 • edited Loading

WenWeiTHU commented Nov 12, 2024 • edited Loading

lxy1108 commented Nov 20, 2024

Leopold2333 commented Nov 21, 2024

WenWeiTHU commented Dec 18, 2024 • edited Loading

Leopold2333 commented Dec 19, 2024

WenWeiTHU commented Dec 19, 2024 • edited Loading

Leopold2333 commented Dec 19, 2024

WenWeiTHU commented Nov 8, 2024 •

edited

Loading

WenWeiTHU commented Nov 12, 2024 •

edited

Loading

WenWeiTHU commented Nov 12, 2024 •

edited

Loading

WenWeiTHU commented Dec 18, 2024 •

edited

Loading

WenWeiTHU commented Dec 19, 2024 •

edited

Loading