Official PyTorch Implementation of SIGGRAPH Asia 2024 Paper "I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models".
I2VEdit propagates edits from a single frame to the entire video.
teaser.mp4
- [2024-11-24] Code and pre-trained weights are released!
- [2024-09-12] Our paper is accepted to SIGGRAPH Asia 2024 (Conference Track)!
- [2024-05-26] Our homepage has been established!
- Gradio interface
- Optimize time and resource costs
- Release code and weights
Our framework has two pipelines: Coarse Motion Extraction (training) learns motion LoRAs, and Appearance Refinement (inference) refines motion and appearance consistency via fine-grained attention matching during EDM inversion and denoising.
git clone https://github.com/Vicky0522/I2VEdit.git
cd I2VEdit
# install required packages
conda env create -f environment.yml
# install i2vedit
python setup.py develop
Our current pipeline is resource-intensive. If you wish to fine-tune motion LoRA to customize your edits, we recommend using a machine equipped with at least one A100
GPU and 300GB
of disk space. For users without access to a GPU with large memory, you can still edit pre-trained videos using pre-saved motion LoRA and other necessary cached data, requiring approximately 20GB
of GPU memory and 100GB
of disk space (for videos with one chunk). We warmly welcome contributions to enhance resource and time efficiency.
- Choose your preferred editing samples, and click on the links below to download the pre-trained weights for motion LoRAs, intermediate attention maps, inverted latents, source videos, edited first images, and configuration files. Downloading the whole dataset can be time-consuming and require substantial storage space. We recommend downloading only the necessary parts. Here are examples:
# login with your huggingface token
huggingface-cli login
# download motion LoRAs
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/train_motion_lora*" --repo-type=dataset --local-dir <path to your local dir>
# download attention maps and inverted latents
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/attention_store*" --repo-type=dataset --local-dir <path to your local dir>
# download the attention maps saved for skip-interval cross-attention only if they
# are provided and you need to edit multiple chunks. For short videos, editing can be
# performed without skip-interval cross-attention to save time costs.
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/consistency_attention_store*" --repo-type=dataset --local-dir <path to your local dir>
# download the source video and the edited first frames
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/source_and_edits*" --repo-type=dataset --local-dir <path to your local dir>
# download configuration files
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/*.yaml" --repo-type=dataset --local-dir <path to your local dir>
# download all the necessary data for an item
huggingface-cli download Vicky0522/I2VEdit-pretrained-videos --include "item<item id>/*" --repo-type=dataset --local-dir <path to your local dir>
- Edit the configuration file and update the following parameters to their respective paths:
pretrained_model_path
: Path to the pre-trained SVD model, download here.attention_matching_params.lora_dir
: Path to the motion loRAs.attention_matching_params.load_attention_store
: Path to the attention maps and inverted latents.data_params.video_path
: Path to the source video.data_params.keyframe_paths
: Paths to the edited first frames.attention_matching_params.load_consistency_attention_store
: Path to the attention maps saved for skip-interval cross-attention ifuse_consistency_attention_control
is enabled.
Once updated, run:
CUDA_VISIBLE_DEVICES=<gpu id> python main.py --config <path to your config file>
Editing results will be saved under ./outputs/
.
- Find the preferred videos, such as from Pexels. Utilize powerful image editing tools, such as EditAnything, AnyDoor, Instruct-Pix2pix, InstantStyle, InstantID, and IDM-VTON to create first-frame edits.
- Set parameters of the configuration file. An example is provided:
./config/customize_train.yaml
. We also provide details of the key parameters as follows:
Parameter | Description | Note |
---|---|---|
pretrained_model_path |
Path to the pre-trained svd weights | Change this to your own path. |
output_dir |
Where the output folders will be created | Default as ./outputs/ . |
use_sarp |
Whether to use smooth area random perturbation (SARP) | If set to true , smooth area random perturbation (SARP) will be enabled. |
use_motion_lora |
Whether to use motion LoRA | If set to true , motion LoRA will be enabled. |
retrain_motion_lora |
Whether to retrain the motion LoRA | If set to true , motion LoRA will be retrained and any pre-trained weights will be ignored if provided. |
use_inversed_latents |
Whether to use inversed latents as the initial noisy latents for editing | If set to true , EDM inversion will be performed to obtain initial latents, otherwise random Gaussian noise will be used |
use_attention_matching |
Whether to use attention matching | If set to true , attention matching will be used. |
use_consistency_attention_control |
Whether to use skip-interval motion LoRA | If set to true , skip-interval motion LoRA will be used. This will lead to much more time cost and disk space when editing longer videos. |
data_params.video_path |
Path to the source video | |
data_params.keyframe_paths |
Paths to the edited first frames | List of paths of edited first frames. |
data_params.begin_clip_id |
Clip id to start editing | Use it to specify the starting clip id. This parameter also controls the number of generated clips with end_clip_id . |
data_params.end_clip_id |
Clip id to end editing | Use it to specify the ending clip id. This parameter also controls the number of generated clips with begin_clip_id . |
data_params.pad_to_fit |
Whether to pad the source video to achieve a 16:9 aspect ratio. | If the source video has an aspect ratio much different from 16:9, it is recommended to set this value as true . It would be better to synchronize this value with train_motion_lora_params.train_data.pad_to_fit and train_motion_lora_params.validation_data.pad_to_fit . |
train_motion_lora_params.lora_rank |
loRA rank | Default as 32 . Reducing this value lowers resource usage but increases the instability of motion preservation. |
train_motion_lora_params.train_data.height & weight |
Resolutions for training | Default as 896 & 512 . Increasing this value raises resource usage and the risk of overfitting. |
sarp_params.sarp_noise_scale |
Noise scale for SARP | |
attention_matching_params.best_checkpoint_index |
Best motion LoRA checkpoint number to use | |
attention_matching_params.disk_store |
Whether to store the attention maps to the disk | Currently this value should be true , attention maps of one timestep approximately cost 3GB. |
attention_matching_params.params.edit<edit id>.temporal_step_thr |
β1\beta_1 and β2\beta_2 for temporal attention injection | [0.5, 0.8] will work for most cases, if not, try [0.4, 0.5]. The order of these parameters should match the sequence of the edited first frames. |
- Run
CUDA_VISIBLE_DEVICES=<gpu id> python main.py --config config/customize_train_<your video>.yaml
The output folder will look like this once running finishes:
- outputs
- i2vedit-xxx
- train_motion_lora
- clip_0
...
- attention_store
- clip_0
...
- consistency_train_attention_store (if use_consistency_attention_control is enabled)
- clip_0
- consistency_attention_store (if use_consistency_attention_control is enabled)
- clip_0
- consistency_edit_attention_store (if use_consistency_attention_control is enabled)
- clip_0
- *.pt
- *.png
- *.mp4
👇You can save the following data and reuse them later if editing the same source videos:
train_motion_lora
: Setattention_matching_params.lora_dir
to use motion LoRAs for each clip. If you want to load previously saved LoRA, setretrain_motion_lora
tofalse
.attention_store
: Setattention_matching_params.load_attention_store
to use attention maps saved for each timestep and the inverted latents. If this path is specified, pre-saved attention maps and inverted latents will be loaded instead of recomputing EDM inversion.consistency_attention_store
: setattention_matching_params.load_consistency_attention_store
to load this folder. This folder contains attention maps for the first clip of the source video, which is required for skip-interval cross-attention during inference whenuse_consistency_attention_control
is enabled.consistency_train_attention_store
: setattention_matching_params.load_consistency_train_attention_store
to load this folder. This folder contains attention maps for the first clip of resized source videos, which are necessary for training whenuse_consistency_attention_control
is enabled.
👇In some scenarios, you may perform the same edits on subsequent clips given data from edits of previous clips:
*.pt
: Setload_from_last_frames_latents
to use the last frame of the previously edited clip as the initial frame for current editing. Save and reuse this data if the same edits are applied to subsequent clips.consistency_edit_attention_store
: Setload_from_previous_consistency_edit_controller
to load this folder. It stores attention maps for the first clip of the edited video, which is required for skip-interval cross-attention during inference whenuse_consistency_attention_control
is enabled. Save this folder and reuse it for consistent edits on subsequent clips.consistency_attention_store
: Setload_from_previous_consistency_store_controller
to load this folder. This folder contains attention maps for the first clip of the source video, which is required for skip-interval cross-attention during inference whenuse_consistency_attention_control
is enabled. Save this folder and reuse it for subsequent edits. Note: The configuration key for this parameter differs from the one listed above, even though the data is the same. This discrepancy may be optimized in the future.
An example is provided: ./config/customize_subsequent_edit.yaml
. Set data_params.begin_clip_id
and data_params.end_clip_id
to define the starting and ending indices of the clips, respectively.
![]() |
![]() |
---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() to be uploaded |
If our work is helpful for your research, please consider citing:
@article{ouyang2024i2vedit,
title={I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models},
author={Ouyang, Wenqi and Dong, Yi and Yang, Lei and Si, Jianlou and Pan, Xingang},
journal={arXiv preprint arXiv:2405.16537},
year={2024}
}
- Our code incorporates components from MotionDirector, FateZero, and SVD-Temporal-ControlNet. We sincerely appreciate their outstanding work!
- Some of the edited first frames are directly sourced from AnyV2V. The source videos are from Pexels Video, UBC Dress, DAVIS Dataset, TikTok Dataset, and SORA. Contact us for any copyright concerns.