remove call to `F.pad`, improved calculation of `memory_count` #10620

bm-synth · 2025-01-21T11:59:39Z

remove one call to symmetric padding in F.pad when running with non-replicate pad mode, and instead let padding be done by Conv3d for a more efficient execution;
computation of memory_count doesn't extend dimensions to allow torch.compile to do a better optimisation (?) by @ic-synth

cc: @jamesbriggs-synth

…mory_count

hlky · 2025-01-22T08:25:11Z

Hi @bm-synth. Thanks for your contribution. Can you share some figures on the memory and performance improvements?

brunomaga · 2025-01-24T18:01:23Z

Hi @hlky.

Running the following test_autoencoder.py

import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from diffusers.models.autoencoders.autoencoder_kl_cogvideox import CogVideoXCausalConv3d

torch.manual_seed(42)

def train(model: nn.Module, video_input: torch.Tensor):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    model.train()
    start_train = time.time()
    for iteration in range(100):  # Simulate 100 training iterations
        optimizer.zero_grad()
        output = model(video_input)[0]
        loss = F.mse_loss(output, output+iteration) # sum iteration to fake different grads per iteration
        loss.backward()
        optimizer.step()
        torch.cuda.synchronize()
    train_time = time.time() - start_train
    print("train_time", train_time, "secs")
    return output.to("cpu")


def eval(model: nn.Module, video_input: torch.Tensor):
    model.eval()
    start_train = time.time()
    with torch.no_grad():
        for _ in range(300):  # Simulate 300 inference iterations
            model(video_input)
            torch.cuda.synchronize()
    eval_time = time.time() - start_train
    print("eval_time", eval_time, "secs")

calling with that input shape [1, 128, 8, 544, 960], on the main branch, gives:

$ PYTHONPATH=./diffusers_main/src/ python test_autoencoder.py
input size:  0.498046875 GBs
eval_time 33.06385564804077 secs
train_time 34.33984375 secs
Max memory 22.18018913269043 GBs

calling this PR branch gives:

$ PYTHONPATH=./diffusers_PR/src/ python test_autoencoder.py
input size:  0.498046875 GBs
eval_time 31.588099241256714 secs
train_time 34.1251916885376 secs
Max memory 22.17398452758789 GBs

on the shape (1, 3, 300, 544, 960), main branch:

$ PYTHONPATH=./diffusers_main/src/ python test_autoencoder.py
input size:  0.43773651123046875 GBs
eval_time 17.759469032287598 secs
train_time 96.50320744514465 secs
Max memory 16.353439331054688 GBs

and this PR:

$ PYTHONPATH=./diffusers_PR/src/ python test_autoencoder.py
input size:  0.43773651123046875 GBs
eval_time 16.8880774974823 secs
train_time 96.04004764556885 secs
Max memory 16.34803009033203 GBs

I'll try to test more dimensions.

hlky · 2025-01-27T08:07:07Z

@bm-synth Great, thanks. Would it also be possible to verify numerical accuracy between the two versions? For a change like this we would expect between 0 to 1e-6 difference.

brunomaga · 2025-01-27T10:07:27Z

@hlky I updated the code above to fix a seed (torch.manual_seed(42)) and save the tensor with the model output after 100 training iterations. Then I ran this to compare both output_*.pt files:

if __name__=='__main__':
    output_main: torch.Tensor = torch.load("output_main.pt")
    output_PR: torch.Tensor = torch.load("output_PR.pt")
    print("mean:", output_main.mean().item(), "vs", output_PR.mean().item())
    print("std:", output_main.std().item(), "vs", output_PR.std().item())
    print("max abs diff:", (output_PR-output_main).diff().abs().max().item())
    assert torch.allclose(output_main, output_PR)

output:

mean: -8.058547973632812e-05 vs -8.058547973632812e-05
std: 0.578125 vs 0.578125
max abs diff: 0.0

bm-synth added 3 commits January 21, 2025 11:51

rewrite memory count without implicitly using dimensions by @ic-synth

c5ce24f

replace F.pad by built-in padding in Conv3D

c586b4b

in-place sums to reduce memory allocations

272537b

bm-synth changed the title ~~Inplace sums, remove call to F.pad and better memory count~~ Inplace sums, remove call to F.pad, improved calculation of memory Jan 21, 2025

bm-synth changed the title ~~Inplace sums, remove call to F.pad, improved calculation of memory~~ Inplace sums, remove call to F.pad, improved calculation of memory_count Jan 21, 2025

bm-synth marked this pull request as ready for review January 21, 2025 12:01

Merge branch 'main' into inplace_sum_and_remove_padding_and_better_me…

153ca9a

…mory_count

bm-synth changed the title ~~Inplace sums, remove call to F.pad, improved calculation of memory_count~~ in-place sums, remove call to F.pad, improved calculation of memory_count Jan 21, 2025

bm-synth added 2 commits January 21, 2025 20:56

fixed trailing whitespace

b0c826f

file reformatted

249a8a2

in-place sums

a0c5ab2

bm-synth added 4 commits January 25, 2025 11:14

simpler in-place expressions

7747007

removed in-place sum, may affect backward propagation logic

4bd9e99

removed in-place sum, may affect backward propagation logic

2ce09f8

removed in-place sum, may affect backward propagation logic

8db178f

bm-synth changed the title ~~in-place sums, remove call to F.pad, improved calculation of memory_count~~ remove call to F.pad, improved calculation of memory_count Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove call to `F.pad`, improved calculation of `memory_count` #10620

remove call to `F.pad`, improved calculation of `memory_count` #10620

bm-synth commented Jan 21, 2025 •

edited

Loading

hlky commented Jan 22, 2025

brunomaga commented Jan 24, 2025 •

edited

Loading

hlky commented Jan 27, 2025

brunomaga commented Jan 27, 2025 •

edited

Loading

remove call to F.pad, improved calculation of memory_count #10620

Are you sure you want to change the base?

remove call to F.pad, improved calculation of memory_count #10620

Conversation

bm-synth commented Jan 21, 2025 • edited Loading

hlky commented Jan 22, 2025

brunomaga commented Jan 24, 2025 • edited Loading

hlky commented Jan 27, 2025

brunomaga commented Jan 27, 2025 • edited Loading

remove call to `F.pad`, improved calculation of `memory_count` #10620

remove call to `F.pad`, improved calculation of `memory_count` #10620

bm-synth commented Jan 21, 2025 •

edited

Loading

brunomaga commented Jan 24, 2025 •

edited

Loading

brunomaga commented Jan 27, 2025 •

edited

Loading