General Post-Training with 4 RTX 4090 GPUs #33

ztianlin · 2025-01-10T06:23:15Z

Hello, I wonder if it is possible to do the general post-training for diffusion WFM with 4 GeForce RTX 4090 GPUs.
My dad can't afford 8 A100 GPUs. Please show mercy to poor people!

ethanhe42 · 2025-01-11T02:10:25Z

hi @ztianlin the autoregressive finetuning only requires 2 A100/H100 https://github.com/NVIDIA/Cosmos/tree/main/cosmos1/models/autoregressive/nemo/post_training

ymcki · 2025-01-11T07:18:23Z

hi @ztianlin the autoregressive finetuning only requires 2 A100/H100 https://github.com/NVIDIA/Cosmos/tree/main/cosmos1/models/autoregressive/nemo/post_training

Two is still 160GB. Will DIGITS's 128GB be enough?

ztianlin · 2025-01-12T12:20:33Z

hi @ztianlin the autoregressive finetuning only requires 2 A100/H100 https://github.com/NVIDIA/Cosmos/tree/main/cosmos1/models/autoregressive/nemo/post_training

Thanks! And what about diffusion models? I really wish that one can post train diffusion models on 4090

jpenningCA · 2025-01-17T21:20:08Z

@ztianlin I'm a PM at NVIDIA for COSMOS. Can you share why post training training on a 4090 is important?

monko9j1 · 2025-01-18T08:01:57Z

@jpenningCA @ethanhe42
Do accurate benchmarks exist for VRAM usage across different models? For instance, could a setup with seven RTX 4090s work effectively? Specifically, I’d like to know how much VRAM is required to train the 7B and 14B models, both for Text2World and the upcoming Video2World model post training.

Currently, the documentation states:
https://github.com/NVIDIA/Cosmos/blob/main/cosmos1/models/diffusion/nemo/post_training/README.md

“8 NVIDIA GPUs*”

However, this doesn’t provide much clarity. It seems reasonable to infer that 8 A100s or H100s were likely used for the 7B and 14B models, but is that level of hardware strictly necessary from a VRAM perspective? What would be the minimum recommended VRAM requirements?

Additionally, the README describes that training uses NeMo Framework's data and model parallelism capabilities, specifically mentioning Fully Sharded Data Parallel (FSDP) and Tensor Parallelism. This suggests that parameters, optimizer states, and activations are distributed across all GPUs, and individual layer parameter tensors are also spread across GPUs.

Given this information:

Is the requirement for 8 GPUs primarily driven by the need to distribute computational tasks (via FSDP and Tensor Parallelism), or is it mainly due to VRAM limitations per GPU?
Could configurations with fewer GPUs, such as seven RTX 4090s, be viable if VRAM is sufficient, or is the parallelism tightly integrated with the current 8-GPU setup recommendations?

Understanding these details would help determine if alternative hardware configurations could work for post-training these models.

StarsTesla · 2025-01-20T08:30:53Z

@ztianlin I'm a PM at NVIDIA for COSMOS. Can you share why post training training on a 4090 is important?

It's like the question, why we need LLM on iPhone? I think maybe most customers have 4090, instead of expensive A100/H100.

ztianlin · 2025-01-24T08:50:23Z

@ztianlin I'm a PM at NVIDIA for COSMOS. Can you share why post training training on a 4090 is important?

As metaphorically described by @StarsTesla, my research resources are limited. I believe if one can easily train on 4090, the cosmos community will become larger and more active.

ethanhe42 · 2025-02-04T21:52:33Z

Hi all, you can try using peft/lora finetuning, which might have lower GPU memory requirement.

you need to add the following change to your recipe (e.g. cosmos_diffusion_7b_text2world_finetune) to enable lora

Cosmos/cosmos1/models/diffusion/nemo/post_training/general.py

Line 60 in 49c9150

recipe.log.log_dir = "nemo_experiments/cosmos_diffusion_7b_text2world_finetune"

    recipe.trainer.strategy.sequence_parallel = False
    recipe.model_transform = run.Config(llm.peft.LoRA,
        target_modules=['linear_qkv', 'linear_proj'],  # , 'linear_fc1', 'linear_fc2'],
        dim=256,
    )

alexnwang · 2025-02-06T04:48:22Z

Hi @ethanhe42 thanks for the LoRA code!

It seems 2xA100s can do full finetuning with activation checkpointing and smaller videos (12 x 45 x 80) instead of 10x90x160.
LoRA works but ~4.4B to 264M trainable params per GPU only decreases VRAM by ~1GB and isn't much faster.
This doesn't seem right?

I'm also noticing that:

setting recipe.trainer.strategy.grad_reduce_in_fp32 = False does not have any effect
adjusting the sequence length has very little effect on VRAM and something is incurring a large VRAM cost?

Trying to understand what is using all this VRAM, and how I might trade resolution / accuracy for less VRAM?

An aside, it seems on line 76 in nemo/collections/diffusion/datamodule.py, the dataloader truncates the video to temporal length of 10?

ethanhe42 · 2025-02-06T21:53:24Z

@alexnwang

LoRA works but ~4.4B to 264M trainable params per GPU only decreases VRAM by ~1GB and isn't much faster.

what do you see Trainable params in your log? e.g.

    29.4 M    Trainable params
    1.8 B     Non-trainable params
    1.8 B     Total params
    7,335.715 Total estimated model params size (MB)
    1574      Modules in train mode
    0         Modules in eval mode

you can try further reduce lora size or add 'linear_fc1', 'linear_fc2' to target_modules

An aside, it seems on line 76 in nemo/collections/diffusion/datamodule.py, the dataloader truncates the video to temporal length of 10?

yes, this is for reducing memory usage

alexnwang · 2025-02-07T00:15:53Z

@ethanhe42 Thanks for the response!

I see 2 of those prints:
First in nemo_log_globalrank-0_localrank-0.txt

[NeMo I 2025-02-05 23:54:31 model_transform:90] After applying model_transform:
      | Name        | Type                     | Params | Mode 
    -----------------------------------------------------------------
    0 | conditioner | VideoConditioner         | 0      | train
    1 | module      | DiTCrossAttentionModel7B | 4.6 B  | train
    -----------------------------------------------------------------
    264 M     Trainable params
    4.4 B     Non-trainable params
    4.6 B     Total params
    18,492.564Total estimated model params size (MB)
    1854      Modules in train mode
    0         Modules in eval mode
[NeMo I 2025-02-05 23:54:31 peft:190] Initializing model parallel

However, a different print is also in lightning_logs.txt

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name        | Type                     | Params | Mode 
-----------------------------------------------------------------
0 | conditioner | VideoConditioner         | 0      | train
1 | module      | DiTCrossAttentionModel7B | 4.4 B  | train
-----------------------------------------------------------------
4.4 B     Trainable params
0         Non-trainable params
4.4 B     Total params
17,435.599Total estimated model params size (MB)
1154      Modules in train mode
0         Modules in eval mode

I'm unsure why going going from 4.4B trainable params w/ activation checkpointing to 264M params without checkpointing only reduces VRAM usage by ~2GB? It feels like something else is occupying a lot of VRAM as reducing the sequence length dramatically doesn't change much.

ethanhe42 · 2025-02-10T06:52:53Z

another way to save memory is activation recomputation:

recipe.model.config.recompute_granularity="full"
recipe.model.config.recompute_method="uniform"
recipe.model.config.recompute_num_layers=1

xavhl · 2025-04-02T19:33:14Z

another way to save memory is activation recomputation:

recipe.model.config.recompute_granularity="full"
recipe.model.config.recompute_method="uniform"
recipe.model.config.recompute_num_layers=1

Hi @ethanhe42 thank you for much for the information! Could you kindly share the lora procedure for the autoregressive version?

ztianlin changed the title ~~General post-training on server with 4 RTX 4090 gpus~~ General Post-Training on Server with 4 RTX 4090 GPUs Jan 10, 2025

ztianlin changed the title ~~General Post-Training on Server with 4 RTX 4090 GPUs~~ General Post-Training with 4 RTX 4090 GPUs Jan 10, 2025

mharrim added the enhancement New feature or request label Jan 24, 2025

sophiahhuang assigned ethanhe42 Feb 14, 2025

pjannaty closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General Post-Training with 4 RTX 4090 GPUs #33

General Post-Training with 4 RTX 4090 GPUs #33

ztianlin commented Jan 10, 2025 •

edited

Loading

ethanhe42 commented Jan 11, 2025

ymcki commented Jan 11, 2025

ztianlin commented Jan 12, 2025

jpenningCA commented Jan 17, 2025

monko9j1 commented Jan 18, 2025

StarsTesla commented Jan 20, 2025

ztianlin commented Jan 24, 2025 •

edited

Loading

ethanhe42 commented Feb 4, 2025

alexnwang commented Feb 6, 2025

ethanhe42 commented Feb 6, 2025 •

edited

Loading

alexnwang commented Feb 7, 2025

ethanhe42 commented Feb 10, 2025

xavhl commented Apr 2, 2025 •

edited

Loading

General Post-Training with 4 RTX 4090 GPUs #33

General Post-Training with 4 RTX 4090 GPUs #33

Comments

ztianlin commented Jan 10, 2025 • edited Loading

ethanhe42 commented Jan 11, 2025

ymcki commented Jan 11, 2025

ztianlin commented Jan 12, 2025

jpenningCA commented Jan 17, 2025

monko9j1 commented Jan 18, 2025

StarsTesla commented Jan 20, 2025

ztianlin commented Jan 24, 2025 • edited Loading

ethanhe42 commented Feb 4, 2025

alexnwang commented Feb 6, 2025

ethanhe42 commented Feb 6, 2025 • edited Loading

alexnwang commented Feb 7, 2025

ethanhe42 commented Feb 10, 2025

xavhl commented Apr 2, 2025 • edited Loading

ztianlin commented Jan 10, 2025 •

edited

Loading

ztianlin commented Jan 24, 2025 •

edited

Loading

ethanhe42 commented Feb 6, 2025 •

edited

Loading

xavhl commented Apr 2, 2025 •

edited

Loading