Reproduce the reported values in the paper #42

Caesarhhh · 2025-01-13T13:13:19Z

Hi,

Thank you for sharing your excellent work. I have been trying to reproduce the results reported in the paper using the provided eval.sh. However, I noticed some discrepancies in the results I obtained compared to those in the paper:

ImageReward: Paper reports 0.962, I reproduced 0.9212.
HPSv2.1: Paper reports 32.25, I reproduced 30.36.
GenEval: After rewriting prompts with the provided script, my results are as follows:
position = 41.75% (167 / 400)
colors = 84.31% (317 / 376)
color_attr = 55.25% (221 / 400)
counting = 68.44% (219 / 320)
single_object = 100.00% (320 / 320)
two_object = 85.86% (340 / 396)
Overall score: 0.72601
I am wondering whether these differences could be due to:

Randomness in evaluation or implementation details that may vary across different machines or setups.
Additional hyperparameters, configurations, or preprocessing steps not explicitly mentioned in the repository or paper.
Could you please clarify:

If there are any additional steps or configurations required to reproduce the exact results?
Whether there are known factors that could cause these discrepancies?
Additionally, do you have any suggestions to further improve the reproduction to align with the reported results?

Thank you for your time and assistance!

JeyesHan · 2025-01-13T14:30:13Z

@Caesarhhh Thanks for your attention to Infinity. The provided checkpoint of Infinity-2B are tested with: ImageReward=0.94, HPSv2.1=32.2 and GenEval score=0.73. I think your tested GenEval results are roughly aligned since the rewriting of different runs may influence final results. But the ImageReward and HPSv2.1 are strange.

Below are my tested HPS v2.1 results:
-----------benchmark score ----------------
images anime 33.58 0.2726
images concept-art 32.12 0.3408
images paintings 31.88 0.2737
images photo 31.18 0.5194
images Average 32.19

My tested ImageReward results:
{"prompts": 100, "images": 1000, "average_image_reward": 0.9360224117940233, "average_clip_scores": 0.2676745575889945}

Note that the provided checkpoint may yield a little different evaluation results compared to the paper reported (especially ImageReward). It's because we sightly changed the data recipe in the last fine-tuning stage to improve Infinity's ability in text redering after the paper released. Besides, we also add some regulations in VAE. These modificaitons make the results slightly different to the paper reported. However, we only see ImageReward changed from 0.96 to 0.94. HPS v2.1 and GenEval are unchanged.

Caesarhhh · 2025-01-14T07:25:04Z

Thanks for your time and assistance! Could you please provide the versions of the libraries (e.g., transformers) listed in the requirements.txt? I suspect that the inconsistency in library versions might be causing this discrepancy.

JeyesHan · 2025-01-14T13:28:31Z

I used "transformers==4.38.2".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce the reported values in the paper #42

Reproduce the reported values in the paper #42

Caesarhhh commented Jan 13, 2025

JeyesHan commented Jan 13, 2025 •

edited

Loading

Caesarhhh commented Jan 14, 2025

JeyesHan commented Jan 14, 2025

Reproduce the reported values in the paper #42

Reproduce the reported values in the paper #42

Comments

Caesarhhh commented Jan 13, 2025

JeyesHan commented Jan 13, 2025 • edited Loading

Caesarhhh commented Jan 14, 2025

JeyesHan commented Jan 14, 2025

JeyesHan commented Jan 13, 2025 •

edited

Loading