-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] Mismatch between local evaluation and leaderboard numbers #773
Comments
Hi @sileix, |
Thanks for the quick response. Please see below:
|
Hi, |
@HuanzhiMao Thank you for the update. It seems the numbers on the leaderboard updated in last few days. |
As you can see in the PR description, that leaderboard update contains the effect of 12 PRs; some lead to score boost, some lead to score decrease. I think #733 might result in the biggest drop, as it introduced new evaluation metrics. |
I see. Thanks again for the detailed info! @HuanzhiMao |
Describe the issue
Local evaluation on A100 has a lower accuracy than the numbers reported on Leaderboard.
ID datapoint
Gorilla repo commit #: 5a42197
We followed the instruction to set up the environment and API keys to evaluate with sglang. We have evaluated several models. The numbers we obtained locally are consistently lower than the numbers reported on the official leader board.
Some examples:
Is this expected? What could be the potential reason for the difference? Thanks in advance.
The text was updated successfully, but these errors were encountered: