-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CogvideoX1.5 I2V training patch_size_t = 2, makes num_frame=49 cannot be divided by 2 #532
Comments
You need to set it to 81 and clone the source code of the diffusers library, as this is where the implementation of version 1.5 is located. |
81 still not work, because 81 is transformed to 21 |
No, it is not 21 in the calculation process, but 22, the first frame will be copied once. |
I install diffusers by pip install git+https://github.com/zRzRzRzRzRzRzR/diffusers.git but in my code, the calculation result is still 21, you can reproduce this bug by disney dataset, and the following scripts #!/bin/bash
# max batch-size is 2.
DEFAULT_GPUS="0,1,2,3,4,5,6,7"
CUDA_DEVICES=${2:-$DEFAULT_GPUS}
export CUDA_VISIBLE_DEVICES=$CUDA_DEVICES
NUM_PROCESSES=${1:-8}
accelerate launch \
--config_file finetune/accelerate_config_machine_single.yaml \
--multi_gpu \
--num_processes $NUM_PROCESSES \
--machine_rank 0 \
--main_process_port 29501 \
finetune/train_cogvideox_image_to_video_lora.py \
--gradient_checkpointing \
--pretrained_model_name_or_path 'models' \
--enable_tiling \
--enable_slicing \
--instance_data_root "data/disney" \
--caption_column prompt.txt \
--video_column videos.txt \
--validation_prompt "A kitchen scene unfolds with Mickey Mouse-like character on the left, startled, while a female mouse character, wearing a hat, holds a gun to her head. Above, a 'KITCHEN' sign is visible. The scene shifts to a black goat character standing alone against a plain background, initially facing away, then turning with a content smile and raised hoof near a guitar. The scene returns to the kitchen, with both mice interacting near the goat, looking surprised or curious about the situation.:::A cartoon version of Mickey Mouse is seen in a cozy kitchen setting, surrounded by hanging pots and utensils. The character initially bends to interact with a wooden crate on the floor, then pulls out a pot and places it on the stove. Mickey manipulates a ladle above the pot, stirring vigorously, before raising the ladle in a dynamic pose, indicating completion of their task. The background remains consistent, emphasizing the domestic environment, as Mickey's actions progress from preparation to execution. ::: A black and white animated kitchen scene unfolds with two characters: one hit on the back of the head with a frying pan, causing musical notes to scatter. The hit character stands upright, while the other lies on the ground, indicating a comedic interaction. The hit character then appears distressed, but soon adopts a mischievous grin, lifting a heavy basket of bread in preparation for a prank, before realizing the potential consequences and freezing in surprise." \
--validation_prompt_separator ::: \
--validation_images "data/disney/validation/train_disney.png:::data/disney/validation/val_disney.png:::data/disney/validation/val_disney_2.png" \
--num_validation_videos 1 \
--validation_epochs 10 \
--seed 42 \
--rank 128 \
--lora_alpha 128 \
--mixed_precision bf16 \
--output_dir "output_models" \
--height 480 \
--width 720 \
--fps 8 \
--max_num_frames 81 \
--skip_frames_start 0 \
--skip_frames_end 0 \
--train_batch_size 4 \
--max_train_steps 10000 \
--num_train_epochs 300 \
--checkpointing_steps 500 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-3 \
--lr_scheduler cosine_with_restarts \
--lr_warmup_steps 200 \
--lr_num_cycles 1 \
--optimizer AdamW \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--max_grad_norm 1.0 \
--allow_tf32 \
--report_to wandb |
I found that in the inference pipeline, the frame does duplicated to 22, # 5. Prepare latents
latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
# For CogVideoX 1.5, the latent frames should be padded to make it divisible by patch_size_t
patch_size_t = self.transformer.config.patch_size_t
additional_frames = 0
if patch_size_t is not None and latent_frames % patch_size_t != 0:
additional_frames = patch_size_t - latent_frames % patch_size_t
num_frames += additional_frames * self.vae_scale_factor_temporal but in training process, relevant action is not taken |
@zhangch 能看一下吗 |
Is there any quick method to fix this bug, any suggestions? |
Hey :) if self.transformer.config.patch_size_t is not None:
shape = shape[:1] + (shape[1] + shape[1] % self.transformer.config.patch_size_t,) + shape[2:] there 13 frames transforms to 14. I think that for training it is need to increase the num frames. (49 -> 53 | 81 -> 85). And then for rotary_positional_embeddings decrease num_frames as |
@xilanhua12138 Setting @zRzRzRzRzRzRzR but then would hit this error: [rank5]: File "/workspace/public/users/lichang93/mydocker/cogvx/host_folder/diffusers/src/diffusers/models/transformers/cogvideox_transformer_3d.py", line 132, in forward |
|
@TheDenk
@xilanhua12138 @zRzRzRzRzRzRzR
FYI, it should based on patch_size_t |
@spacegoing yep, but patch_size_t = 2 :) |
yes that's a perfect match:D |
System Info / 系統信息
When training num_frames=49 and it will be compressed into 13 by vae, but in this code
p = self.patch_size
p_t = self.patch_size_t
Information / 问题信息
Reproduction / 复现过程
{
// 使用 IntelliSense 了解相关属性。
// 悬停以查看现有属性的描述。
// 欲了解更多信息,请访问: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Train CogVideoX-5b-I2V",
"type": "debugpy",
"request": "launch",
"program": "training/cogvideox_image_to_video_lora.py",
"console": "integratedTerminal",
"justMyCode": false,
"args": [
"--gradient_checkpointing",
"--pretrained_model_name_or_path", "/mnt2/shuiyunhao/task/CogVideo/models",
"--data_root", "data/disney",
"--caption_column", "prompt.txt",
"--video_column", "videos.txt",
"--validation_prompt", "BW_STYLE, A cartoon version of Mickey Mouse is seen in a cozy kitchen setting, surrounded by hanging pots and utensils. The character initially bends to interact with a wooden crate on the floor, then pulls out a pot and places it on the stove. Mickey manipulates a ladle above the pot, stirring vigorously, before raising the ladle in a dynamic pose, indicating completion of their task. The background remains consistent, emphasizing the domestic environment, as Mickey's actions progress from preparation to execution. ::: BW_STYLE, A black and white animated kitchen scene unfolds with two characters: one hit on the back of the head with a frying pan, causing musical notes to scatter. The hit character stands upright, while the other lies on the ground, indicating a comedic interaction. The hit character then appears distressed, but soon adopts a mischievous grin, lifting a heavy basket of bread in preparation for a prank, before realizing the potential consequences and freezing in surprise.",
"--validation_images", "data/disney/validation/val_disney.png:::data/disney/validation/val_disney_2.png",
"--validation_prompt_separator", ":::",
"--num_validation_videos", "1",
"--validation_epochs", "10",
"--seed", "42",
"--rank", "128",
"--lora_alpha", "128",
"--mixed_precision", "bf16",
"--output_dir", "output_models",
"--id_token", "BW_STYLE",
"--height_buckets", "480",
"--width_buckets", "720",
"--frame_buckets", "49",
"--max_num_frames", "49",
"--train_batch_size", "1",
"--num_train_epochs", "30",
"--checkpointing_steps", "1000",
"--gradient_accumulation_steps", "2",
"--learning_rate", "1e-3",
"--lr_scheduler", "cosine_with_restarts",
"--lr_warmup_steps", "400",
"--lr_num_cycles", "1",
"--enable_slicing",
"--enable_tiling",
"--optimizer", "AdamW",
"--beta1", "0.9",
"--beta2", "0.95",
"--max_grad_norm", "1.0",
"--allow_tf32",
"--report_to", "tensorboard"
],
}
]
}
Expected behavior / 期待表现
Should be divided by 2
The text was updated successfully, but these errors were encountered: