Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the reported values in the paper #42

Open
Caesarhhh opened this issue Jan 13, 2025 · 3 comments
Open

Reproduce the reported values in the paper #42

Caesarhhh opened this issue Jan 13, 2025 · 3 comments

Comments

@Caesarhhh
Copy link

Hi,

Thank you for sharing your excellent work. I have been trying to reproduce the results reported in the paper using the provided eval.sh. However, I noticed some discrepancies in the results I obtained compared to those in the paper:

ImageReward: Paper reports 0.962, I reproduced 0.9212.
HPSv2.1: Paper reports 32.25, I reproduced 30.36.
GenEval: After rewriting prompts with the provided script, my results are as follows:
position = 41.75% (167 / 400)
colors = 84.31% (317 / 376)
color_attr = 55.25% (221 / 400)
counting = 68.44% (219 / 320)
single_object = 100.00% (320 / 320)
two_object = 85.86% (340 / 396)
Overall score: 0.72601
I am wondering whether these differences could be due to:

Randomness in evaluation or implementation details that may vary across different machines or setups.
Additional hyperparameters, configurations, or preprocessing steps not explicitly mentioned in the repository or paper.
Could you please clarify:

If there are any additional steps or configurations required to reproduce the exact results?
Whether there are known factors that could cause these discrepancies?
Additionally, do you have any suggestions to further improve the reproduction to align with the reported results?

Thank you for your time and assistance!

@JeyesHan
Copy link
Collaborator

JeyesHan commented Jan 13, 2025

@Caesarhhh Thanks for your attention to Infinity. The provided checkpoint of Infinity-2B are tested with: ImageReward=0.94, HPSv2.1=32.2 and GenEval score=0.73. I think your tested GenEval results are roughly aligned since the rewriting of different runs may influence final results. But the ImageReward and HPSv2.1 are strange.

Below are my tested HPS v2.1 results:
-----------benchmark score ----------------
images anime 33.58 0.2726
images concept-art 32.12 0.3408
images paintings 31.88 0.2737
images photo 31.18 0.5194
images Average 32.19

My tested ImageReward results:
{"prompts": 100, "images": 1000, "average_image_reward": 0.9360224117940233, "average_clip_scores": 0.2676745575889945}

Note that the provided checkpoint may yield a little different evaluation results compared to the paper reported (especially ImageReward). It's because we sightly changed the data recipe in the last fine-tuning stage to improve Infinity's ability in text redering after the paper released. Besides, we also add some regulations in VAE. These modificaitons make the results slightly different to the paper reported. However, we only see ImageReward changed from 0.96 to 0.94. HPS v2.1 and GenEval are unchanged.

@Caesarhhh
Copy link
Author

Thanks for your time and assistance! Could you please provide the versions of the libraries (e.g., transformers) listed in the requirements.txt? I suspect that the inconsistency in library versions might be causing this discrepancy.

@JeyesHan
Copy link
Collaborator

I used "transformers==4.38.2".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants