This is a repository for organizing papers and codes about Talking-Face-Generation(TFG) for computer vision.
Besides,the commonly-used datasets and metrics for TFG are also introduced.
💫 This project is constantly being updated,any suggestions are welcomed!
Title | Venue | Dataset | CODE | |
---|---|---|---|---|
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | arXiv 2024 | HDTF | CODE | |
EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model | ICASSP 2024 | MEAD&CREMA-D | CODE | |
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis | ICLR 2024 | CelebV-HQ&VoxCeleb2 | - | |
G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment | arXiv 2024 | HDTF&LRS2 | - | |
Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis | CVPR 2024 | - | CODE | |
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis | CVPR2024 | - | CODE |
Title | Venue | Dataset | CODE | |
---|---|---|---|---|
DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models | arXiv 2023 | MEAD & HDTF & Voxceleb2 | - | |
GMTalker: Gaussian Mixture based Emotional talking video Portraits | arXiv 2023 | MEAD & LSP | - | |
DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers | arXiv 2023 | HDTF | - | |
R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning | arXiv 2023 | - | - | |
FT2TF: First-Person Statement Text-To-Talking Face Generation | arXiv 2023 | LRS2 & LRS3 | - | |
VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior | arXiv 2023 | HDTF & VoxCeleb | - | |
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesi | arXiv 2023 | LRS3 | - | |
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis | ICLR 2023 | LRS3 | CODE | |
GAIA: Zero-shot Talking Avatar Generation | arXiv 2023 | dataset from diverse sources | - | |
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis | ICCV 2023 | - | CODE | |
Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation | ICCV 2023 | VoxCeleb1 & CelebV | CODE | |
MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions | ICCV 2023 | HDTF & LSP | - | |
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation | ICCV 2023 | Celeb2 & MEAD & LRW & MEAD | CODE | |
EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation | ICCV 2023 | MEAD & LRW | - | |
Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation | ICCV 2023 | ViCo and the dataset proposed by Learning2Listen | - | |
MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation | CVPR 2023 | VoxCeleb2 & HDTF | CODE | |
Implicit Neural Head Synthesis via Controllable Local Deformation Fields | CVPR 2023 | - | - | |
LipFormer: High-fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook | CVPR 2023 | LRS2 & FFHQ | - | |
GANHead: Towards Generative Animatable Neural Head Avatars | CVPR 2023 | FaceVerse-Dataset | CODE | |
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment | CVPR 2023 | HDTF & Testset1 & Testset 2 | - | |
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors | CVPR 2023 | LRS2 & LRS3 | CODE | |
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator | CVPR 2023 | LRW & VoxCeleb | - | |
High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning | CVPR 2023 | MEAD | - | |
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert | CVPR 2023 | LRS2 & LRW | CODE | |
OTAvatar : One-shot Talking Face Avatar with Controllable Tri-plane Rendering | CVPR 2023 | HDTF & e Multiface | CODE | |
Style Transfer for 2D Talking Head Animation | arXiv 2023 | VoxCeleb2 | - | |
StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles | AAAI 2023 | MEAD & HDTF | CODE |
Title | Venue | Dataset | CODE | |
---|---|---|---|---|
SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory | AAAI 2022 | LRW & LRS2 & BBC News | - | |
Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis | CVPR 2022 | VoxCeleb2 & Mead | - | |
Compressing Video Calls using Synthetic Talking Heads | BMVC 2022 | - | - | |
Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement | arXiv 2022 | - | - | |
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation | arXiv 2022 | Voxceleb2 | - | |
Talking Head from Speech Audio using a Pre-trained Image Generato | ACM MM 2022 | TCD-TIMIT & GRID | - | |
Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis | ECCV 2022 | - | CODE | |
Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation | ECCV 2022 | - | CODE | |
Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary | ICASSP 2022 | VidTIMIT | CODE | |
Emotion-Controllable Generalized Talking Face Generation | IJCAI 2022 | MEAD & CREMA-D & RAVDESS | - | |
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning | CVPR 2022 | Shapes & MUG & iPER & Multimodal VoxCeleb | CODE | |
Depth-Aware Generative Adversarial Network for Talking Head Video Generation | CVPR 2022 | VoxCeleb1 & CelebV | CODE | |
Expressive Talking Head Generation with Granular Audio-Visual Control | CVPR 2022 | Voxceleb2 & MEAD | - |
Title | Venue | Dataset | CODE | |
---|---|---|---|---|
Audio-Driven Emotional Video Portraits | CVPR 2021 | MEAD & LRW | CODE | |
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation | CVPR 2021 | VoxCeleb2& LRW | CODE | |
Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset | CVPR 2021 | HDTF | CODE | |
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation | AAAI 2021 | Mocap dataset | - | |
Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion | IJCAI 2021 | VoxCeleb & GRID & LRW | CODE | |
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis | ACMMM 2021 | Ted-HD & LRW | CODE | |
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | ICCV 2021 | - | CODE | |
FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning | ICCV 2021 | - | CODE | |
Learned Spatial Representations for Few-shot Talking-Head Synthesis | ICCV 2021 | VoxCeleb | - | |
One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing | CVPR 2021 | VoxCeleb2 & TalkingHead-1KH | - | |
Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary | arXiv 2021 | VidTIMIT | CODE |
Title | Venue | Datasets | CODE | |
---|---|---|---|---|
Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose | AAAI 2020 | VoxCeleb | - | |
Robust One Shot Audio to Video Generation | CVPR 2020 | GRID & LOMBARD GRID | - | |
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis | CVPR 2020 | GRID & TCD-TIMIT | - | |
Neural Voice Puppetry: Audio-driven Facial Reenactment | ECCV 2020 | - | CODE | |
Talking-head Generation with Rhythmic Head Motion | ECCV 2020 | Crema & Grid & Voxceleb & Lrs3 | - | |
A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors | ICPR 2020 | - | - | |
Talking Face Generation with Expression-Tailored Generative Adversarial Network | ACMMM 2020 | - | - | |
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild | ACMMM 2020 | LRS2 | CODE |
Title | Venue | Datasets | CODE | |
---|---|---|---|---|
Few-Shot Adversarial Learning of Realistic Neural Talking Head Model | ICCV 2019 | VoxCeleb | CODE | |
Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss | CVPR 2019 | LRW & GRID | CODE | |
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation | AAAI 2019 | LRW | CODE | |
Realistic Speech-Driven Facial Animation with GANs | IJCV 2019 | GRID & TCD-TIMIT & CREMA-D &LRW | - | |
Talking Face Generation by Conditional Recurrent Adversarial Network | IJCAI 2019 | TCD-TIMIT & LRW & VoxCeleb | CODE | |
Lip Movements Generation at a Glance | ECCV 2018 | GRID &LRW &LDC | - | |
You said that? | BMVC 2017 | VoxCeleb & LRW | - |
• LRS2
• LRW
• GRID
• MEAD
• VoxCeleb
• HDTF
• SAVEE
• VOCA
• CREMA-D
- PSNR (Peak Signal-to-Noise Ratio) : Measures the signal-to-noise ratio between the generated image and the original image, often used for comparing the similarity between two images. Higher PSNR values indicate better image quality.
- SSIM (Structural Similarity Index) : Evaluates the structural similarity between the generated image and the original image, considering brightness, contrast, and structure. SSIM values range from [-1, 1], with values closer to 1 indicating better image quality.
- LMD (Log-Mel Filterbank Distance) : Measures the Mel filterbank distance between the generated speech and the target speech. Lower LMD values indicate better speech generation quality.
- LRA (lip-reading accuracy) : used in evaluating speech generation quality, but focusing on the ratio of Mel filterbanks.
- FID (Fréchet inception distance) : Measures the quality of generated images by comparing the feature statistics of generated images to real images. Lower FID values indicate higher similarity between the distributions of generated and real images.
- LSE-D (Lip Sync Error - Distance) : Measures the error between the spectrogram of the generated speech and the real speech.
- LSE-C (Lip Sync Error - Confidence) :Similar to LSE-D, but considers a classifier model for measuring the error between the spectrogram of generated speech and real speech.
- LPIPS (Learned Perceptual Image Patch Similarity) : Utilizes a deep learning model to learn perceptual image quality, considering human perception of local image structures. Lower LPIPS values indicate better image generation quality.
- NIQE (Natural Image Quality Evaluator) : Used to evaluate the naturalness and quality of images, considering natural statistical properties. Lower NIQE values indicate better image quality.
This page was created by Dan Zhao, a graduate student at Dalian University of Technology.