Expand layerwise upcasting with optional white-list to allow Torch/GPU to perform native fp8 ops where possible #10635

vladmandic · 2025-01-23T17:43:06Z

PR #10347 adds native torch fp8 as storage dtype and performs upcasting/downcasting to compute dtype in pre-forward/post-forward as needed.

however, modern gpu architectures (starting with hopper in 2022) actually do implement many ops natively in fp8.
and torch is extending number of supported ops in each release.

right now, we're still pretty far from being able to execute everything natively in fp8, but request here is to allow white-list of layers for which upcast/downcast can be skipped as operations can actually be executed natively.

that list would need to be per-gpu architecture, so its unlikely to be a static one - but just being able to specify couple of most common layers that take most of compute time would be very beneficial.

cc @a-r-r-o-w @sayakpaul @DN6

a-r-r-o-w · 2025-01-23T21:19:17Z

Thanks for starting the discussion @vladmandic! I think we will definitely be looking into fp8 matmul support atleast. It is known to work quite well on Ada and Hopper for a while now, so there is good signal that it will be a nice feature.

Is something like this what you're referring to support?

https://github.com/facebookresearch/lingua/blob/022c44a75fb55c4feeea98a3b84283f8dd0af83b/lingua/float8.py

With the current ~diffusers.hooks.layerwise_casting.apply_layerwise_casting implementation, it should be possible to directly specify storage_dtype and compute_dtype as torch.float8*, since it can take any torch.nn.Module and apply the pre/post-forward stuff (given the layer it is applied to uses supported fp8 ops only). This would require digging through the modeling yourself to find the layers where to apply this, which does seem inconvenient for end users - we can give this some thought after current release schedule is complete.

vladmandic · 2025-01-23T21:30:12Z

Is something like this what you're referring to support?

yes, but thats in a long run.
in a short run, i was thinking that some layers might just work as-is in fp8 without upcast in pre-forward.
i don't want to specify compute dtype per layer, that's a nightmare.

instead something like this?
we can control which layers should not get converted using this:

    skip_modules_pattern: Union[str, Tuple[str, ...]] = "auto",
    skip_modules_classes: Optional[Tuple[Type[torch.nn.Module], ...]] = None,

and ask here is to implement similar for layers that should get converted, but not upcast during compute.

a-r-r-o-w · 2025-01-23T21:37:16Z

I see, and that should be more doable for this release. Do you have certain layers in mind where this would be beneficial/work with fp8 ops? I could give it a try and check the impact on quality/speed (generally, we want to be careful about the features exposed if they can have a negative impact on quality).

vladmandic · 2025-01-23T21:43:06Z

Do you have certain layers in mind where this would be beneficial/work with fp8 ops

not really, but i was thinking of doing some profiling with torch and simply try on the costliest ones.

a-r-r-o-w added enhancement New feature or request wip performance Anything related to performance improvements, profiling and benchmarking labels Jan 23, 2025

a-r-r-o-w mentioned this issue Jan 27, 2025

[WIP] Layerwise dynamic upcasting to Diffusers Models #9177

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand layerwise upcasting with optional white-list to allow Torch/GPU to perform native fp8 ops where possible #10635

Expand layerwise upcasting with optional white-list to allow Torch/GPU to perform native fp8 ops where possible #10635

vladmandic commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025

vladmandic commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025 •

edited

Loading

vladmandic commented Jan 23, 2025

Expand layerwise upcasting with optional white-list to allow Torch/GPU to perform native fp8 ops where possible #10635

Expand layerwise upcasting with optional white-list to allow Torch/GPU to perform native fp8 ops where possible #10635

Comments

vladmandic commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025

vladmandic commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025 • edited Loading

vladmandic commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025 •

edited

Loading