GitHub - bytedance/X-Dyna: [ArXiv 2024] X-Dyna: Expressive Dynamic Human Image Animation

X-Dyna: Expressive Dynamic Human Image Animation

Di Chang^1,2 · Hongyi Xu^2* · You Xie^2* · Yipeng Gao^1* · Zhengfei Kuang^3* · Shengqu Cai^3* · Chenxu Zhang^2*
Guoxian Song² · Chao Wang² · Yichun Shi² · Zeyuan Chen^2,5 · Shijie Zhou⁴ · Linjie Luo²
Gordon Wetzstein³ · Mohammad Soleymani¹
¹Unviersity of Southern California ²ByteDance Inc. ³Stanford University
⁴University of California Los Angeles ⁵University of California San Diego

^* denotes equal contribution

This repo is the official pytorch implementation of X-Dyna, which generates temporal-consistent human motions with expressive dynamics.

📹 Teaser Video

demo_compress.mp4

The video is compressed to low quality due to GitHub's limit. The high quality version can be view from here.

📑 Open-source Plan

Abstract

We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key factors underlying the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations.

Architecture

We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.

Dynamics Adapter

Archtecture Designs for Human Video Animation

a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.

architecture_comparison.mp4

📈 Results

Comparison

To evaluate the dynamics texture generation performance of X-Dyna in human video animation, we compare the generation results of X-Dyna with MagicPose (ReferenceNet-based method) and MimicMotion (SVD-based method). For a fair comparison, all generated videos share the same resolution of Height x Width = 896 x 512.

comp_1.mp4

comp_2.mp4

comp_3.mp4

comp_4.mp4

comp_5.mp4

Ablation

To evaluate the effectiveness of the mix data training in our pipeline, we present a visualized ablation study.

ablation_1.mp4

🎥 More Demos

demo_1.mp4	demo_2.mp4
short_1.mp4	short_2.mp4
short_3.mp4	short_4.mp4
short_5.mp4	short_6.mp4

📜 Requirements

An NVIDIA GPU with CUDA support is required.
- We have tested on a single A100 GPU.
- In our experiment, we used CUDA 11.8.
- Minimum: The minimum GPU memory required is 20GB for generating a single video (batch_size=1) of 16 frames.
- Recommended: We recommend using a GPU with 80GB of memory.
Operating system: Linux Debian 11 (bullseye)

🛠️ Dependencies and Installation

Clone the repository:

git clone https://github.com/Boese0601/X-Dyna
cd X-Dyna

Installation Guide

We provide an requirements.txt file for setting up the environment.

Run the following command on your terminal:

# 1. Prepare conda environment
conda create -n xdyna python==3.10 

# 2. Activate the environment
conda activate xdyna

# 3. Install dependencies
bash env_torch2_install.sh

# I know it's a bit weird that pytorch is installed with different versions twice in that bash file, but I don't know why it doesn't work if I directly installed the final one (torch==2.0.1+cu118 torchaudio==2.0.2+cu118 torchvision==0.15.2+cu118). 
# If you managed to fix this, please open an issue and let me know, thanks. :DDDDD   
# o_O I hate environment and dependencies errors.

🧱 Download Pretrained Models

Due to restrictions, we are not able to release the model pre-trained with in-house data. Instead, we re-train our model on public datasets, e.g. HumanVid, and other human video data for research use, e.g. Pexels.

We follow the implementation details in our paper and release pretrained weights and other network modules in this huggingface repository. After downloading, please put all of them under the pretrained_weights folder.

The Stable Diffusion 1.5 UNet can be found here and place it under pretrained_weights/initialization/unet_initialization/SD.

Your file structure should look like this:

X-Dyna
|----...
|----pretrained_weights
  |----controlnet
    |----controlnet-checkpoint-epoch-5.ckpt
  |----controlnet_face
    |----controlnet-face-checkpoint-epoch-2.ckpt
  |----unet 
    |----unet-checkpoint-epoch-5.ckpt
  
  |----initialization
    |----controlnets_initialization
      |----controlnet
        |----control_v11p_sd15_openpose
      |----controlnet_face
        |----controlnet2
    |----unet_initialization
      |----IP-Adapter
        |----IP-Adapter
      |----SD
        |----stable-diffusion-v1-5
|----...

Inference

Using Command Line

cd X-Dyna

bash scripts/inference.sh

More Configurations

We list some explanations of configurations below:

Argument	Default	Description
`--gpus`	0	GPU ID for inference
`--output`	./output	Path to save the generated video
`--test_data_file`	./examples/example.json	Path to reference and driving data
`--cfg`	7.5	Classifier-free guidance scale
`--height`	896	Height of the generated video
`--width`	512	Width of the generated video
`--infer_config`	./configs/x_dyna.yaml	Path to inference model config file
`--neg_prompt`	None	Negative prompt for generation
`--length`	192	Length of the generated video
`--stride`	1	Stride of driving pose and video
`--save_fps`	15	FPS of the generated video
`--global_seed`	42	Random seed
`--face_controlnet`	False	Use Face ControlNet for inference
`--cross_id`	False	Cross-Identity
`--no_head_skeleton`	False	Head skeletons are not visuliazed

Alignment

Appropriate alignment between driving video and reference image is necessary for better generation quality. E.g. see examples below:
From left to right: Reference Image, Extracted Pose from Reference Image, Driving Video, Aligned Driving Pose.

align_1.mp4	align_2.mp4
align_3.mp4	align_4.mp4

Examples

We provide some examples of aligned driving videos, human poses and reference images here. If you would like to try on your own data, please specify the paths in this file.

🔗 BibTeX

If you find X-Dyna useful for your research and applications, please cite X-Dyna using this BibTeX:

@misc{
}

License

Our code is distributed under the Apache-2.0 license. See LICENSE.txt file for more information.

Acknowledgements

We appreciate the contributions from AnimateDiff, MagicPose, MimicMotion, Moore-AnimateAnyone, MagicAnimate, IP-Adapter, ControlNet, HumanVid, I2V-Adapter for their open-sourced research. We appreciate the support from Quankai Gao, Qiangeng Xu, Shen Sang, and Tiancheng Zhi for their suggestions and discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
animatediff		animatediff
assets		assets
configs		configs
examples		examples
pretrained_weights		pretrained_weights
scripts		scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
env_torch2_install.sh		env_torch2_install.sh
inference_xdyna.py		inference_xdyna.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X-Dyna: Expressive Dynamic Human Image Animation

📹 Teaser Video

📑 Open-source Plan

Abstract

Architecture

Dynamics Adapter

Archtecture Designs for Human Video Animation

📈 Results

Comparison

Ablation

🎥 More Demos

📜 Requirements

🛠️ Dependencies and Installation

Installation Guide

🧱 Download Pretrained Models

Inference

Using Command Line

More Configurations

Alignment

Examples

🔗 BibTeX

License

Acknowledgements

About

Releases

Packages

Languages

License

bytedance/X-Dyna

Folders and files

Latest commit

History

Repository files navigation

X-Dyna: Expressive Dynamic Human Image Animation

📹 Teaser Video

📑 Open-source Plan

Abstract

Architecture

Dynamics Adapter

Archtecture Designs for Human Video Animation

📈 Results

Comparison

Ablation

🎥 More Demos

📜 Requirements

🛠️ Dependencies and Installation

Installation Guide

🧱 Download Pretrained Models

Inference

Using Command Line

More Configurations

Alignment

Examples

🔗 BibTeX

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages