This is the repository for the paper CapHuman: Capture Your Moments in Parallel Universes.
Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang.
We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, and facial expressions in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the ``encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines.
- [2024/04/26] We release the code and checkpoint.
- [2024/02/27] Our paper is accepted by CVPR2024.
- [2024/02/01] We release the Project Page.
git clone https://github.com/VamosC/CapHuman.git
cd CapHuman
conda create -n caphuman python=3.7
conda activate caphuman
pip install -r requirements.txt
wget -c https://huggingface.co/VamosC/CapHuman/resolve/main/pytorch3d-0.7.6-cp37-cp37m-linux_x86_64.whl
pip install pytorch3d-0.7.6-cp37-cp37m-linux_x86_64.whl
Follow INSTALL to install pytorch3d (e.g. 0.7.4, 0.7.6). We provide the whl file.
We provide the script to download data and models conveniently (You must register at https://flame.is.tue.mpg.de/ and agree to the FLAME license terms first).
bash tools/setup.sh
Otherwise, follow adobe-research/diffusion-rig for DECA setup.
The file structure looks like:
data/
deca_model.tar
generic_model.pkl
FLAME_texture.npz
fixed_displacement_256.npy
head_template.obj
landmark_embedding.npy
mean_texture.jpg
texture_data_256.npy
uv_face_eye_mask.png
uv_face_mask.png
And, download our checkpoint caphuman.ckpt, vae-ft-mse-840000-ema-pruned.ckpt, Realistic_Vision_V3.0.ckpt, 79999_iter.pth and put them into ckpts.
The file structure looks like:
ckpts/
face-parsing/
79999_iter.pth
caphuman.ckpt
Realistic_Vision_V3.0.ckpt
vae-ft-mse-840000-ema-pruned.ckpt
Note: you can download comic-babes, disney-pixar-cartoon-type-a, toonyou for different styles.
Note: For clip-vit-large-patch14, it will be automatically downloaded if you specify openai/clip-vit-large-patch14
in the version
field like we do in the config file models/cldm_v15.yaml (line 29 and line 92). If you cannot get it automatically, one of the alternatives: download the files, put them in the ckpts/clip-vit-large-patch14
and then update the version
field to the path ckpts/clip-vit-large-patch14
.
In this case, the file structure will look like:
ckpts/
face-parsing/
79999_iter.pth
caphuman.ckpt
Realistic_Vision_V3.0.ckpt
vae-ft-mse-840000-ema-pruned.ckpt
clip-vit-large-patch14/
merges.txt
model.safetensors
vocab.json
tokenizer_config.json
config.json
tokenizer.json
special_tokens_map.json
preprocessor_config.json
python inference.py --ckpt ckpts/caphuman.ckpt --vae_ckpt ckpts/vae-ft-mse-840000-ema-pruned.ckpt --model models/cldm_v15.yaml --sd_ckpt ckpts/Realistic_Vision_V3.0.ckpt --input_image examples/input_images/196251.png --pose_image examples/pose_images/pose1.png --prompt "a photo of a man wearing a suit in front of Space Needle"
Note: you can replace the sd backbone for different styles, e.g. --sd_ckpt disneyPixarCartoon_v10.safetensors
.
If you prefer gradio, you can try the following command:
python -m gradios.gradio_visualization --ckpt ckpts/caphuman.ckpt --vae_ckpt ckpts/vae-ft-mse-840000-ema-pruned.ckpt --model models/cldm_v15.yaml --sd_ckpt ckpts/Realistic_Vision_V3.0.ckpt
If you are familiar with stable-diffusion-webui, please refer to the extension sd-webui-controlnet. Note: we make some modifications to support CapHuman.
Download the checkpoint control_v11p_sd15_openpose.pth and put it in the ckpts.
python inference.py --ckpt ckpts/caphuman.ckpt --vae_ckpt ckpts/vae-ft-mse-840000-ema-pruned.ckpt --model models/cldm_v15.yaml --sd_ckpt ckpts/Realistic_Vision_V3.0.ckpt --input_image examples/input_images/196251.png --pose_image examples/pose_images/pose2.png --prompt "a photo of a man raising the hand, cyberpunk" --output_image examples/output_images/out2.png --control_ckpt ckpts/control_v11p_sd15_openpose.pth --controlnet_strength 1.0 --controlnet_mode "face,body,hand" --n_prompt "missing fingers"
The file structure looks like:
ckpts/
face-parsing/
79999_iter.pth
caphuman.ckpt
Realistic_Vision_V3.0.ckpt
vae-ft-mse-840000-ema-pruned.ckpt
control_v11p_sd15_openpose.pth
body_pose_model.pth
hand_pose_model.pth
Note: body_pose_model.pth and hand_pose_model.pth will be automatically downloaded.
@inproceedings{liang2024caphuman,
author={Liang, Chao and Ma, Fan and Zhu, Linchao and Deng, Yingying and Yang, Yi},
title={CapHuman: Capture Your Moments in Parallel Universes},
booktitle={CVPR},
pages={6400--6409},
year={2024}
}
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
- lllyasviel/ControlNet
- adobe-research/diffusion-rig
- yfeng95/DECA
- CompVis/stable-diffusion
- openai/CLIP
- VamosC/CLIP4STR
- mzhaoshuai/RLCF
- mzhaoshuai/CenterCLIP
- FreeformRobotics/Divide-and-Co-training
- Realistic_Vision_V3.0.ckpt
- comic-babes
- disney-pixar-cartoon-type-a
- toonyou
We sincerely thank Zongxin Yang for valuable discussions.