Fix Mochi Quality Issues #10033

DN6 · 2024-11-27T07:05:46Z

What does this PR do?

We're seeing some quality issues with Mochi due to missing upcasts and differences between how attention is handled in the original repo.

This PR:

Matches the transformer implementation 1:1 so that norms are upcast and run in the same precision as the original repo. 2. Changes the MochiAttnProcessor to match the original approach of dropping padding tokens.
Runs the CFG and Sampling step in FP32

I'll update the docs PR: #9934 with a guide on how to reproduce the original repo results exactly once this PR is merged.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

a-r-r-o-w

Thank you for all the fixes Dhruv!

Very grateful to @YanzuoLu and @Ednaordinary for their continuous help with testing different things out - thank you!

a-r-r-o-w · 2024-11-27T07:24:53Z

src/diffusers/models/transformers/transformer_mochi.py

+logger = logging.get_logger(__name__)  # pylint: disable=invalid-n
+
+
+class MochiModulatedRMSNorm(nn.Module):


Love seeing the single file format for modeling!

a-r-r-o-w · 2024-11-27T07:25:11Z

src/diffusers/models/transformers/transformer_mochi.py



-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-n


Suggested change

logger = logging.get_logger(__name__) # pylint: disable=invalid-n

logger = logging.get_logger(__name__) # pylint: disable=invalid-name

a-r-r-o-w · 2024-11-27T07:25:45Z

src/diffusers/models/transformers/transformer_mochi.py

+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states.to(torch.float32) * torch.rsqrt(variance + self.eps)


I think we could do one cast for the hidden states instead of two here?

a-r-r-o-w · 2024-11-27T07:26:41Z

src/diffusers/models/transformers/transformer_mochi.py

+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states.to(torch.float32) * torch.rsqrt(variance + self.eps)


Same here, a bit neater to cast hdiden_states before the following statements (not really problem though)

a-r-r-o-w · 2024-11-27T07:29:19Z

src/diffusers/models/transformers/transformer_mochi.py

+        input_dtype = x.dtype
+
+        # convert back to the original dtype in case `conditioning_embedding`` is upcasted to float32 (needed for hunyuanDiT)
+        scale = self.linear_1(self.silu(conditioning_embedding).to(x.dtype))
+        x = self.norm(x, (1 + scale.unsqueeze(1).to(torch.float32)))
+
+        return x.to(input_dtype)


I think this pattern is very common for some kinds of layers. Do you think we could work on a refactor in the future where we decorate the forward methods with something like @upcast_to_fp32 so that the type conversions occur outside of the forward and they look like clean mathematical input-to-output mapping? This way, it could also be disabled with a global context manager if upcasting is not required for certain models, but be enabled by default.

a-r-r-o-w · 2024-11-27T07:30:17Z

src/diffusers/models/transformers/transformer_mochi.py

+        return hidden_states, gate_msa, scale_mlp, gate_mlp
+
+
+class MochiAttention(nn.Module):


This is interesting to see! I did not know we wanted to move from the central Attention class as well -- only thought we would be breaking up BasicTransformerBlock. Super cool and clean!

a-r-r-o-w · 2024-11-27T07:32:29Z

src/diffusers/models/transformers/transformer_mochi.py

-        hidden_states = hidden_states + self.norm2(attn_hidden_states) * torch.tanh(gate_msa).unsqueeze(1)
-        norm_hidden_states = self.norm3(hidden_states) * (1 + scale_mlp.unsqueeze(1))
+        hidden_states = hidden_states + self.norm2(attn_hidden_states, torch.tanh(gate_msa).unsqueeze(1))
+        norm_hidden_states = self.norm3(hidden_states, (1 + scale_mlp.unsqueeze(1).to(torch.float32)))


I think I'm probably missing something here, but can you point me to where do we handle the downcast for this upcast? Because otherwise all successive computations would be in float32, no?

Downcast here

diffusers/src/diffusers/models/transformers/transformer_mochi.py

Line 53 in b904325

hidden_states = hidden_states.to(hidden_states_dtype)

a-r-r-o-w · 2024-11-27T07:39:01Z

src/diffusers/models/transformers/transformer_mochi.py

+        for idx in range(batch_size):
+            mask = attention_mask[idx][None, :]
+            valid_prompt_token_indices = torch.nonzero(mask.flatten(), as_tuple=False).flatten()
+
+            valid_encoder_query = torch.index_select(encoder_query[idx][None, :], 2, valid_prompt_token_indices)
+            valid_encoder_key = torch.index_select(encoder_key[idx][None, :], 2, valid_prompt_token_indices)
+            valid_encoder_value = torch.index_select(encoder_value[idx][None, :], 2, valid_prompt_token_indices)
+
+            valid_query = torch.cat([query[idx][None, :], valid_encoder_query], dim=2)
+            valid_key = torch.cat([key[idx][None, :], valid_encoder_key], dim=2)
+            valid_value = torch.cat([value[idx][None, :], valid_encoder_value], dim=2)
+
+            attn_output = F.scaled_dot_product_attention(
+                valid_query, valid_key, valid_value, dropout_p=0.0, is_causal=False
+            )
+            valid_sequence_length = attn_output.size(2)
+            attn_output = F.pad(attn_output, (0, 0, 0, total_length - valid_sequence_length))
+            attn_outputs.append(attn_output)


Awesome catch and fix!

I think this might not play well with data parallel implementations due to the loop. i will profile this in the future and we can try a different implementation in the future.

For data parallelism, we're replicating the model on each worker, though.

maybe used wrong terminology. what I meant was parallelizing across batch dimension

Torch 2.5 has nested tensors as a prototype feature that allows variable sequence lengths in a batch
https://pytorch.org/docs/stable/nested.html

We could do a version check and use that here for parallelizing across batch dimension. The docs says the API is subject to change so I avoided using it.

jzhang38 · 2024-11-27T22:20:54Z

@DN6 Hey can you give an example code to run your fixed branch? am really curious about the quality changes with those precision fixes.

yiyixuxu · 2024-11-27T23:15:18Z

@DN6 can we get this merged now? :) the tests need to be fixed

yiyixuxu

ok i went over the PR so turns out there are a lot of refactorings (in addition to the fix)
i left some comments, I think it is not ready to merge yet and there are some clean up to be done :)

love that we start to take an action on having model-specific attention/attention class now! and +1 on @a-r-r-o-w's comments here to think about a better way to manage precision (can be future PRs, does not have to be done here) https://github.com/huggingface/diffusers/pull/10033/files#r1860070716

also, do we know what actually cause the issue? all of them together?

yiyixuxu · 2024-11-28T01:39:54Z

src/diffusers/models/transformers/transformer_mochi.py

+        return hidden_states, gate_msa, scale_mlp, gate_mlp
+
+
+class MochiAttention(nn.Module):


should have a basic attention class (e.g. AttentionMixin) everything inherits from, no? so that some methods are available, set_processor, get_processor etc?

Agreed. Should I add in this PR or follow up? I don't think Mochi needs those methods since it just has a single processor.

follow PR is ok! but let's have a rough plan for the follow-up PR before merge
we only have one attention processor for mochi, but these methods are for people to use custom attention processors, also
methods like attn_processor depends on this, and they are useful

diffusers/src/diffusers/models/transformers/transformer_flux.py

Line 296 in c96bfa5

def attn_processors(self) -> Dict[str, AttentionProcessor]:

also, let's make sure we don't have something we need relhy with this kind of logic

diffusers/src/diffusers/models/unets/unet_3d_condition.py

Line 533 in c96bfa5

if isinstance(module, Attention):

yiyixuxu · 2024-11-28T01:40:44Z

src/diffusers/models/transformers/transformer_mochi.py

+        out_dim: int = None,
+        out_context_dim: int = None,
+        out_bias: bool = True,
+        context_pre_only: bool = False,


probably can clean up the arguments a bit now? e.g. remove these not used by mochi

I think these arguments are all used in the Mochi Attention layer.

yiyixuxu · 2024-11-28T01:58:46Z

src/diffusers/models/transformers/transformer_mochi.py

+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states.to(torch.float32) * torch.rsqrt(variance + self.eps)
+
+        if scale is not None:


I looked at how this layer is used, I don't think scale should be passed down here, this should just be a regular RMSNorm; the scale part should be part of other operation (e.g. variation of AdaLayerNorm)

The way it's used is a bit different in Mochi. The outputs from the linear layer in MochiRMSNormZero (gate_msa, gate_mlp etc) are passed to the modulation after attention.

diffusers/src/diffusers/models/transformers/transformer_mochi.py

Line 406 in b904325

norm_hidden_states, gate_msa, scale_mlp, gate_mlp = self.norm1(hidden_states, temb)

There is also a difference in which parts of the modulation are upcast. The parts that have tanh applied to them remain in BF16, while those that don't are upcast to FP32. The hidden states are always upcast to FP32 before norming and multiplying by the modulation and then the output is cast back to BF16

for this comment

The way it's used is a bit different in Mochi. The outputs from the linear layer in MochiRMSNormZero (gate_msa, gate_mlp etc) are passed to the modulation after attention.

I don't think it's different, I think this is the case for all our DIT models, maybe with slight difference in implementation from model to mode,l but we always apply the "modulate" part explicitly inside the transformer blocks, we should do the same for Mochi too. see some examples here
flux:

diffusers/src/diffusers/models/transformers/transformer_flux.py

Line 192 in c96bfa5

norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]

sd3:

diffusers/src/diffusers/models/transformers/transformer_sd3.py

Line 99 in c96bfa5

norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]

cogvideo

diffusers/src/diffusers/models/transformers/transformer_cogview3plus.py

Line 117 in c96bfa5

norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]

MochiModulatedRMSNorm you defined from scratch but it is actually just MochiRMSNormwith a scale, and can be write as

class MochiModulatedRMSNorm(nn.Module): def __init__(self, eps: float): super().__init__() self.norm = MochiRMSNorm(dim=None, eps=eps, elementwise_affine=False) def forward(self, hidden_states, scale=None): hidden_states = self.norm(hidden_states) if scale is not None: hidden_states = hidden_states * scale return hidden_states

so for layers that's are currently using MochiModulatedRMSNorm, we should easily rewrite with RMSNorm too

So in this example the output of the self.norm would be torch.bfloat16.

When applying modulation the hidden state has to be in FP32. Just upcasting the output of the norm unfortunately isn't enough to reproduce the original result.

Downcasting to BF16 only happens after modulation
https://github.com/genmoai/mochi/blob/f3a800aea5862b4af13e66ff77eea1967c8c3a7f/src/genmo/mochi_preview/dit/joint_model/mod_rmsnorm.py#L15

And in the case of tanh modulation, right before adding to the residual
https://github.com/genmoai/mochi/blob/f3a800aea5862b4af13e66ff77eea1967c8c3a7f/src/genmo/mochi_preview/dit/joint_model/residual_tanh_gated_rmsnorm.py#L6

Oh. I see what you mean. Upcast hidden states in MochiModulatedNorm and the output is equivalent. Got it.

yiyixuxu · 2024-11-28T01:59:43Z

src/diffusers/models/transformers/transformer_mochi.py

+        return hidden_states
+
+
+class MochiRMSNorm(nn.Module):


this looks like identical to RMSNorm, no? did I miss something?

In our RMSNorm the hidden_states are not upcast for the entire operation

diffusers/src/diffusers/models/normalization.py

Lines 531 to 533 in 6b288ec

input_dtype = hidden_states.dtype

variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)

hidden_states = hidden_states * torch.rsqrt(variance + self.eps)

We only upcast when computing the variance. For Mochi the hidden_states are in FP32 throughout and only downcast at the end.

Changing RMSNorm to upcast throughout might affect other models, so I added a new class here.

why cannot we upcast the input instead?

in fact this operation

hidden_states = hidden_states * torch.rsqrt(variance + self.eps)

if the variance is in float32 and hidden_states in a lower precision, pytorch will automatically upcast float32 anyway

a little demo script

import torch # Create a float16 tensor (simulating hidden_states) hidden_states = torch.ones(2, 3, dtype=torch.float16) print("Initial hidden_states dtype:", hidden_states.dtype) # float16 # Calculate variance in float32 (like the code) variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True) print("Variance dtype:", variance.dtype) # float32 # Perform the multiplication (this is the line we're testing) eps = 1e-5 result = hidden_states * torch.rsqrt(variance + eps) print("Result dtype:", result.dtype) # float32 # Print all values to verify print("\nValues:") print("hidden_states:", hidden_states) print("variance:", variance) print("result:", result)

Initial hidden_states dtype: torch.float16 Variance dtype: torch.float32 Result dtype: torch.float32 Values: hidden_states: tensor([[1., 1., 1.], [1., 1., 1.]], dtype=torch.float16) variance: tensor([[1.], [1.]]) result: tensor([[1.0000, 1.0000, 1.0000], [1.0000, 1.0000, 1.0000]])

so I really don't think we need new class here

Oh yes, the RMSNorm in the Attentions can be replaced.

yiyixuxu · 2024-11-28T02:01:09Z

src/diffusers/models/transformers/transformer_mochi.py

+
+        # convert back to the original dtype in case `conditioning_embedding`` is upcasted to float32 (needed for hunyuanDiT)
+        scale = self.linear_1(self.silu(conditioning_embedding).to(x.dtype))
+        x = self.norm(x, (1 + scale.unsqueeze(1).to(torch.float32)))


I commented on MochiModulatedRMSNorm - we should apply scale here, instead of passing it to MochiModulatedRMSNorm

So the reason to pass this in is because of how upcasts are handled in Mochi. The input hidden state is always upcast, but the scaling parameter isn't always upcast. And the scaling/modulation happens between the upcast hidden state and scale parameter.

yiyixuxu · 2024-11-28T02:01:46Z

src/diffusers/models/transformers/transformer_mochi.py

+        emb = self.linear(self.silu(emb))
+        scale_msa, gate_msa, scale_mlp, gate_mlp = emb.chunk(4, dim=1)
+
+        hidden_states = self.norm(hidden_states, (1 + scale_msa[:, None].to(torch.float32)))


same comments here, scale should be applied here

yiyixuxu · 2024-11-28T02:04:42Z

src/diffusers/models/transformers/transformer_mochi.py

@@ -202,7 +478,10 @@ def _get_positions(
        return positions

    def _create_rope(self, freqs: torch.Tensor, pos: torch.Tensor) -> torch.Tensor:
-        freqs = torch.einsum("nd,dhf->nhf", pos, freqs.float())
+        with torch.autocast(freqs.device.type, enabled=False):


what does this do here? are we disable autocast here in case it is enabled
ok here if this is what causing the quality issue - but we should come up with something better :)

Hmm actually I think with all the manual casts we have in place, just setting torch_dtype=torch.bfloat16 should allow us to effectively run Mochi as if autocast is enabled. We could just remove this autocast context manager. Will test.

The reason I put this here was because under the autocast context einsum would always return bfloat16 (which causes numerical differences in the output)

into mochi-quality

HuggingFaceDocBuilderDev · 2024-11-29T12:56:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

DN6 added 30 commits November 18, 2024 17:30

update

27f81bd

update

30dd9f6

update

10275fe

update

79380ca

update

21b0997

update

fcc59d0

update

1782d02

update

66a5f59

update

3ffa711

update

dded243

update

d99234f

update

8b9d5b6

update

2cfca5e

update

900fead

update

0b09231

update

883f5c8

update

59c9f5d

update

f3fefae

update

8a5d03b

update

b7464e5

update

fb4e175

update

61001c8

update

0fdef41

update

e6fe9f1

update

c17cef7

update

0e8f20d

update

6e2011a

update

9c5eb36

update

d759516

update

7854bde

DN6 added 2 commits November 27, 2024 07:58

update

2881f2f

update

7854061

DN6 requested a review from a-r-r-o-w November 27, 2024 07:05

a-r-r-o-w approved these changes Nov 27, 2024

View reviewed changes

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

DN6 closed this Nov 28, 2024

DN6 reopened this Nov 28, 2024

Merge branch 'main' into mochi-quality

b904325

sayakpaul mentioned this pull request Nov 29, 2024

[feat] add Mochi-1 trainer a-r-r-o-w/cogvideox-factory#90

Merged

DN6 added 5 commits November 29, 2024 08:54

update

ba9c185

update

53dbc37

update

77f9d19

Merge branch 'mochi-quality' of https://github.com/huggingface/diffusers

a298915

into mochi-quality

update

dc96890

DN6 added 3 commits November 29, 2024 16:58

update

ae57913

update

7626a34

update

c39886a

		logger = logging.get_logger(__name__) # pylint: disable=invalid-n


		class MochiModulatedRMSNorm(nn.Module):

		variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
		hidden_states = hidden_states.to(torch.float32) * torch.rsqrt(variance + self.eps)

		return hidden_states, gate_msa, scale_mlp, gate_mlp


		class MochiAttention(nn.Module):

	input_dtype = hidden_states.dtype
	variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
	hidden_states = hidden_states * torch.rsqrt(variance + self.eps)

Fix Mochi Quality Issues #10033

Are you sure you want to change the base?

Fix Mochi Quality Issues #10033

Conversation

DN6 commented Nov 27, 2024

What does this PR do?

Before submitting

Who can review?

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DN6 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

jzhang38 commented Nov 27, 2024

yiyixuxu commented Nov 27, 2024

yiyixuxu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DN6 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DN6 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiyixuxu Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DN6 Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 29, 2024

DN6 Nov 29, 2024 •

edited

Loading

yiyixuxu left a comment •

edited

Loading

DN6 Nov 29, 2024 •

edited

Loading

DN6 Nov 29, 2024 •

edited

Loading

yiyixuxu Nov 29, 2024 •

edited

Loading

DN6 Nov 29, 2024 •

edited

Loading