Replies: 2 comments
-
I think here, d = num_heads * head_dim, it doesn't actually split the dimension of a head, but rather distributes the heads across different GPUs. The computation between the two all-to-all operations is basically the same as tensor parallelism. Outside of all-to-all, it's equivalent to sequence-level data parallelism (DP). Regarding the use of sequence parallelism (SP), I have a question as well. DS-SP is used in mega-ds, but it seems to be incompatible with PP (Pipeline Parallelism) and EP (Expert Parallelism) other than TP? Additionally, Megatron's native SP doesn't seem to be well-validated in mega-ds. |
Beta Was this translation helpful? Give feedback.
-
@inkcherry I noticed now that this is mentioned in the introduction as:
while the Methods section where these details should be present, only includes vague information:
So you are right, it's sequence parallelism throughout most of the run, except inside the blue rectangle in the picture where it is head parallelism. Thank you! |
Beta Was this translation helpful? Give feedback.
-
I don't understand why the attention matrix provided by sentence parallelism in DS Ulysses is correct. Looking at the picture 2 from the paper:
DS Ulysses implementation
There are three steps on the computation of the attention matrix:
Total comm cost: 3 all-to-all of$Nh$ elements + 1 all-to-all of $Nh$ elements
Here's a demo in python. Assume$K$ and $Q$ to be of shape $[2,2]$ . In a serial run:
Now assume$P=2$ , and split $K$ and $Q$ by 2 processes as $Q0$ , $Q1$ , $K0$ and $K1$ . You can't recover the softmax above from the partial softmaxs below:
So I believe that the sequence-parallel$QK^T$ in DS-Ulysses is not equivalent to a serial implementation. And that different number of processes give different $QK^T$ values. What am I missing?
alternative implementation
I think the correct would be to add an "all-reduce sum" as mentioned above, or do a regular matrix-matrix multiplication, as in:
Total cost: 1 all-scatter of$Nh$ elements + P-1 all-to-all of $Nh$ elements
Remarks about memory consumption and max sequence length
DS-Ulysses implementation yields a$P$ -sized linear reduction of memory on the embeddings but still requires quadratic memory complexity on the sequence length in order to store the full attention matrix $[N,N]$ on every process, which is the big memory culprit in transformer models. Which is non-logical to me, that an implementation of sequence-parallelism does not parallelize the largest sequence-related tensor (the attention of size $[N,N]$ ). Long story short, if Ulysses requires each process to be able to hold $[N,N]$ in memory, then you better off never using this Ulysses at all (or any combination of sequence and data parallelism), because doing only data parallelism will give less communication, higher samples/sec, and has the same max memory requirements ($[N,N]$ ).
The alternative implementation I propose yields a reduction of$P$ times on the attention matrix and on all but 1 embedding. This allows for a higher sentence length and can be combined with data parallelism. $P$ is generally small, so the algorithm runs fast. And the MatMul algorithm proposed can be improved to run in less than $P-1$ comm steps.
Does my thinking make sense?
Thank you
(cc: @bm-synth , my work alias)
Beta Was this translation helpful? Give feedback.
All reactions