Yasheng Sun, Wenqing Chu, Zhiliang Xu, Dongliang He, Hideki Koike
Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns with the speaking status. In this paper, we propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking face generation.
- [2024/02]: Paper is on Arxiv.
- [2024/04]: Paper is accepted by IEEE Access.
a. Create a conda virtual environment and activate it. It requires python >= 3.8 as base environment.
conda create -n sssp python=3.8 -y
conda activate sssp
b. Install PyTorch and torchvision following the official instructions.
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch -c conda-forge
b. Install other dependencies. We simply freeze our environments. Other environments might also works. Here we provide requirements.txt file for reference.
pip install -r requirements.txt
- Download the pre-trained model and put it to
train_logs/
accordingly.
Once the pre-trained model is prepared, you can test the model by running the following command:
bash experiments/diffusion_test.sh align_emote
If you are interested in training the model by yourself, please set up the environments accordingly and run the below commands.
bash experiments/diffusion_train.sh align_emote
Many thanks to these excellent open source projects:
- [DALLE2-pytorch] (https://github.com/lucidrains/DALLE2-pytorch)
- [INFERNO] (https://github.com/radekd91/inferno)
- [PIRender] (https://github.com/RenYurui/PIRender)
- [PD-FGC-inference] (https://github.com/Dorniwang/PD-FGC-inference)
If you find our paper and code useful for your research, please consider citing:
@article{sun2024avi,
title={AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation},
author={Sun, Yasheng and Chu, Wenqing and Zhou, Hang and Wang, Kaisiyuan and Koike, Hideki},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}