Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap Peng Tan, Weipeng Hu
Project Page | paper | arXiv | WebUI | Demo | Video | Diffuser | Colab
- Existing methods lack ability to control the interactions between objects in the generated content.
- We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions.
- [2024.3.13] Diffusers code is available at here.
- [2024.3.8] Demo is available at Huggingface Spaces.
- [2024.3.6] Code is released.
- [2024.2.27] InteractionDiffusion paper is accepted at CVPR 2024.
- [2023.12.12] InteractionDiffusion paper is released. WebUI of InteractDiffusion is available as alpha version.
Model | Interaction Controllability | FID | KID | |
---|---|---|---|---|
Tiny | Large | |||
v1.0 | 29.53 | 31.56 | 18.69 | 0.00676 |
v1.1 | 30.20 | 31.96 | 17.90 | 0.00635 |
v1.2 | 30.73 | 33.10 | 17.32 | 0.00585 |
Interaction Controllability is measured using FGAHOI detection score. In this table, we measure the Full subset in Default setting on Swin-Tiny and Swin-Large backbone. More details on the protocol is in the paper.
We provide three checkpoints with different training strategies.
Version | Dataset | SD | Download |
---|---|---|---|
v1.0 | HICO-DET | v1.4 | HF Hub |
v1.1 | HICO-DET | v1.5 | HF Hub |
v1.2 | HICO-DET + VisualGenome | v1.5 | HF Hub |
Note that the experimental results in our paper is referring to v1.0.
- v1.0 is based on Stable Diffusion v1.4 and GLIGEN. We train at batch size of 16 for 250k steps on HICO-DET. Our paper is based on this.
- v1.1 is based on Stable Diffusion v1.5 and GLIGEN. We train at batch size of 32 for 250k steps on HICO-DET.
- v1.1 is based on InteractDiffusion v1.1. We train further at batch size of 32 for 172.5k steps on HICO-DET and VisualGenome.
We develop an AutomaticA111's Stable Diffuion WebUI extension to allow the use of InteractDiffusion over existing SD models. Check out the plugin at sd-webui-interactdiffusion. Note that it is still on alpha
version.
Some examples generated with InteractDiffusion, together with other DreamBooth and LoRA models.
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"interactdiffusion/diffusers-v1-2",
trust_remote_code=True,
variant="fp16", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
images = pipeline(
prompt="a person is feeding a cat",
interactdiffusion_subject_phrases=["person"],
interactdiffusion_object_phrases=["cat"],
interactdiffusion_action_phrases=["feeding"],
interactdiffusion_subject_boxes=[[0.0332, 0.1660, 0.3359, 0.7305]],
interactdiffusion_object_boxes=[[0.2891, 0.4766, 0.6680, 0.7930]],
interactdiffusion_scheduled_sampling_beta=1,
output_type="pil",
num_inference_steps=50,
).images
images[0].save('out.jpg')
-
Change
ckpt.pth
in interence_batch.py to selected checkpoint. -
Made inference on InteractDiffusion to synthesis the test set of HICO-DET based on the ground truth.
python inference_batch.py --batch_size 1 --folder generated_output --seed 489 --scheduled-sampling 1.0 --half
-
Setup FGAHOI at
../FGAHOI
. See FGAHOI repo on how to setup FGAHOI and also HICO-DET dataset indata/hico_20160224_det
. -
Prepare for evaluate on FGAHOI. See
id_prepare_inference.ipynb
-
Evaluate on FGAHOI.
python main.py --backbone swin_tiny --dataset_file hico --resume weights/FGAHOI_Tiny.pth --num_verb_classes 117 --num_obj_classes 80 --output_dir logs --merge --hierarchical_merge --task_merge --eval --hoi_path data/id_generated_output --pretrain_model_path "" --output_dir logs/id-generated-output-t
-
Evaluate for FID and KID. We recommend to resize hico_det dataset to 512x512 before perform image quality evaluation, for a fair comparison. We use torch-fidelity.
fidelity --gpu 0 --fid --isc --kid --input2 ~/data/hico_det_test_resize --input1 ~/FGAHOI/data/data/id_generated_output/images/test2015
-
This should provide a brief overview of how the evaluation process works.
-
Prepare the necessary dataset and pretrained models, see DATA
-
Run the following command:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>
- Code Release
- HuggingFace demo
- WebUI extension
- Diffuser
@InProceedings{Hoe_2024_CVPR,
author = {Hoe, Jiun Tian and Jiang, Xudong and Chan, Chee Seng and Tan, Yap-Peng and Hu, Weipeng},
title = {InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {6180-6189}
}
This work is developed based on the codebase of GLIGEN and LDM.