-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce the reported values in the paper #42
Comments
@Caesarhhh Thanks for your attention to Infinity. The provided checkpoint of Infinity-2B are tested with: ImageReward=0.94, HPSv2.1=32.2 and GenEval score=0.73. I think your tested GenEval results are roughly aligned since the rewriting of different runs may influence final results. But the ImageReward and HPSv2.1 are strange. Below are my tested HPS v2.1 results: My tested ImageReward results: Note that the provided checkpoint may yield a little different evaluation results compared to the paper reported (especially ImageReward). It's because we sightly changed the data recipe in the last fine-tuning stage to improve Infinity's ability in text redering after the paper released. Besides, we also add some regulations in VAE. These modificaitons make the results slightly different to the paper reported. However, we only see ImageReward changed from 0.96 to 0.94. HPS v2.1 and GenEval are unchanged. |
Thanks for your time and assistance! Could you please provide the versions of the libraries (e.g., |
I used "transformers==4.38.2". |
Hi,
Thank you for sharing your excellent work. I have been trying to reproduce the results reported in the paper using the provided eval.sh. However, I noticed some discrepancies in the results I obtained compared to those in the paper:
ImageReward: Paper reports 0.962, I reproduced 0.9212.
HPSv2.1: Paper reports 32.25, I reproduced 30.36.
GenEval: After rewriting prompts with the provided script, my results are as follows:
position = 41.75% (167 / 400)
colors = 84.31% (317 / 376)
color_attr = 55.25% (221 / 400)
counting = 68.44% (219 / 320)
single_object = 100.00% (320 / 320)
two_object = 85.86% (340 / 396)
Overall score: 0.72601
I am wondering whether these differences could be due to:
Randomness in evaluation or implementation details that may vary across different machines or setups.
Additional hyperparameters, configurations, or preprocessing steps not explicitly mentioned in the repository or paper.
Could you please clarify:
If there are any additional steps or configurations required to reproduce the exact results?
Whether there are known factors that could cause these discrepancies?
Additionally, do you have any suggestions to further improve the reproduction to align with the reported results?
Thank you for your time and assistance!
The text was updated successfully, but these errors were encountered: