Does the distributed `KQ^T` in DeepSpeed Ulysses make sense? #5551

brunomaga · 2024-05-20T21:26:46Z

brunomaga
May 20, 2024

I don't understand why the attention matrix provided by sentence parallelism in DS Ulysses is correct. Looking at the picture 2 from the paper:

DS Ulysses implementation

There are three steps on the computation of the attention matrix:

Leftmost all-to-alls convert:

row-wise into column-wise distributed representation of $K$, transforming shape from $[N/P, d]$ to $[N, d/P]$.
column-wise into row-wise distributed representation of $Q^T$, transforming shape from $[d, N/P]$ to $[d/P, N]$.
row-wise into column-wise distributed representation of $V$, transforming shape from $[N/P, d]$ to $[N, d/P]$.
this leads to a $3Nh$ communication cost, for transformer hidden space $h$.

Computation of $softmax(KQ^T)V$ where $KQ^T$ is computed from shapes $[N, d/P]$ multiplied by $[d/P, N]$, ie just a partial sum of full $KQ^T$ (that should multiply $[N, d]$ by $[d , N]$. All processes end up with a different $[N,N]$ matrix. So applying $softmax$ directly without summing all local $[N,N]$ takes just a subset of embeddings (the ones that refer to that local process) into accuont on the $KQ^T$. I believe an "all-reduce (sum)" is needed to sum all partial $KQ^T$ across processors before the $softmax$. What am I missing?
rightmost all-to-all converts:

colum-wise into row-wise distributed representation of $softmax(KQ^T)V$, transforming shape from $[N, d/P]$ to $[N/P, d]$.
this has a communication cost of $Nh$, for transformer hidden space $h$.

Total comm cost: 3 all-to-all of $Nh$ elements + 1 all-to-all of $Nh$ elements

Here's a demo in python. Assume $K$ and $Q$ to be of shape $[2,2]$. In a serial run:

>>> K = torch.randint(5, (2,2)).float()
>>> Q = torch.randint(5, (2,2)).float()
>>> torch.nn.functional.softmax([email protected], dim=-1)
tensor([[0.5000, 0.5000],
        [0.9975, 0.0025]])

Now assume $P=2$, and split $K$ and $Q$ by 2 processes as $Q0$, $Q1$, $K0$ and $K1$. You can't recover the softmax above from the partial softmaxs below:

>>> K0, K1 = K[:, :1], K[:, 1:]
>>> Q0, Q1 = Q[:, :1], Q[:, 1:]
>>> torch.nn.functional.softmax([email protected], dim=-1)
tensor([[0.5000, 0.5000],
        [0.9820, 0.0180]])
>>> torch.nn.functional.softmax([email protected], dim=-1)
tensor([[0.5000, 0.5000],
        [0.8808, 0.1192]])

So I believe that the sequence-parallel $QK^T$ in DS-Ulysses is not equivalent to a serial implementation. And that different number of processes give different $QK^T$ values. What am I missing?

alternative implementation

I think the correct would be to add an "all-reduce sum" as mentioned above, or do a regular matrix-matrix multiplication, as in:

Leftmost all-to-all only changes representation of $V$:

$K$ remains in a row-wise representation with shape $[N/P, d]$.
$Q^T$ remains in a colum-wise representation with shape $[d, N/P]$.
$V$ of shape $[N/P, d]$ is scattered into shape $[N, d]$.

Matrix Matrix multiplication, by performing $P$ steps, where columns of $Q^T$ (subset of embeddings) are shifted in a distributed fashion ( ie passed to next rank). Requires $P-1$ comm steps, where each step sends $[N, d/P]$ elements (see drawing below).
- $KQ^T$ ends up as a row-wise distributed representation with local shape $[N/P, N]$, and softmax is computed over the whole row, not a subset as the DS-Ulysses algorithm.
$softmax(QK^T)$ has local shape $[N/P, N]$, and $V$ has shape $[N, d]$, thus $softmax(QK^T) V$ has shape $[N/P, d]$, similar to the output of the second all-to-all comm in the paper.

Total cost: 1 all-scatter of $Nh$ elements + P-1 all-to-all of $Nh$ elements

Remarks about memory consumption and max sequence length

DS-Ulysses implementation yields a $P$-sized linear reduction of memory on the embeddings but still requires quadratic memory complexity on the sequence length in order to store the full attention matrix $[N,N]$ on every process, which is the big memory culprit in transformer models. Which is non-logical to me, that an implementation of sequence-parallelism does not parallelize the largest sequence-related tensor (the attention of size $[N,N]$). Long story short, if Ulysses requires each process to be able to hold $[N,N]$ in memory, then you better off never using this Ulysses at all (or any combination of sequence and data parallelism), because doing only data parallelism will give less communication, higher samples/sec, and has the same max memory requirements ($[N,N]$).

The alternative implementation I propose yields a reduction of $P$ times on the attention matrix and on all but 1 embedding. This allows for a higher sentence length and can be combined with data parallelism. $P$ is generally small, so the algorithm runs fast. And the MatMul algorithm proposed can be improved to run in less than $P-1$ comm steps.

Does my thinking make sense?

Thank you

(cc: @bm-synth , my work alias)

inkcherry · 2024-05-21T06:24:57Z

inkcherry
May 21, 2024

I think here, d = num_heads * head_dim, it doesn't actually split the dimension of a head, but rather distributes the heads across different GPUs. The computation between the two all-to-all operations is basically the same as tensor parallelism. Outside of all-to-all, it's equivalent to sequence-level data parallelism (DP).

Regarding the use of sequence parallelism (SP), I have a question as well. DS-SP is used in mega-ds, but it seems to be incompatible with PP (Pipeline Parallelism) and EP (Expert Parallelism) other than TP? Additionally, Megatron's native SP doesn't seem to be well-validated in mega-ds.
Does this mean that if I want to use DeepSpeed and SP, I should mainly consider the Ulysses+ZeRO approach?

0 replies

brunomaga · 2024-05-21T08:06:04Z

brunomaga
May 21, 2024
Author

it doesn't actually split the dimension of a head, but rather distributes the heads across different GPUs

@inkcherry I noticed now that this is mentioned in the introduction as:

Then right before the attention computation, it employs all-to-all communication collective on the partitioned queries, keys and values such that each GPU receives the full sequence but only for a non-overlapping subset of the attention heads. This allows the participating GPUs to compute attention for different attention heads in parallel. [...] Finally, DeepSpeed-Ulysses employs
another all-to-all to gather the results along the attention heads while re-partitioning along the sequence dimension.

while the Methods section where these details should be present, only includes vague information:

Next, (QKV) embeddings are gathered into global QKV through highly optimized all-to-all collectives between participating compute devices. Sequel to all-to-all collective is the attention computation per head . [...] After the attention computation, another all-to-all collective transforms output context tensor of attention computation to sequence (N/P) parallel for subsequent operators

So you are right, it's sequence parallelism throughout most of the run, except inside the blue rectangle in the picture where it is head parallelism. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the distributed `KQ^T` in DeepSpeed Ulysses make sense? #5551

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Does the distributed KQ^T in DeepSpeed Ulysses make sense? #5551

brunomaga May 20, 2024

DS Ulysses implementation

alternative implementation

Remarks about memory consumption and max sequence length

Replies: 2 comments

inkcherry May 21, 2024

brunomaga May 21, 2024 Author

Does the distributed `KQ^T` in DeepSpeed Ulysses make sense? #5551

brunomaga
May 20, 2024

inkcherry
May 21, 2024

brunomaga
May 21, 2024
Author