Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

Closed
sileix opened this issue Nov 19, 2024 · 6 comments
Closed

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

sileix opened this issue Nov 19, 2024 · 6 comments

Comments

@sileix
Copy link

sileix commented Nov 19, 2024

Describe the issue
Local evaluation on A100 has a lower accuracy than the numbers reported on Leaderboard.

ID datapoint
Gorilla repo commit #: 5a42197

We followed the instruction to set up the environment and API keys to evaluate with sglang. We have evaluated several models. The numbers we obtained locally are consistently lower than the numbers reported on the official leader board.

Some examples:

Model Local Evaluation Acc Leaderboard Acc Diff
Hammer2.0-1.5B (FC) 49.3% 51.59% -2.29%
Qwen2.5-1.5B-Instruct (Prompt) 46.61% 48.82% -1.98%
Qwen2-1.5B-Instruct (Prompt) 29.31% 32.08% -2.77%
xLAM-1b-fc-r (FC) 24.58% 25.14% -0.56%

Is this expected? What could be the potential reason for the difference? Thanks in advance.

@HuanzhiMao
Copy link
Collaborator

Hi @sileix,
Could you provide the detailed score breakdown for your local evaluation result (eg, the data_overall.csv)?

@sileix
Copy link
Author

sileix commented Nov 22, 2024

Thanks for the quick response. Please see below:

Rank,Overall Acc,Model,Model Link,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s),Non-Live AST Acc,Non-Live Simple AST,Non-Live Multiple AST,Non-Live Parallel AST,Non-Live Parallel Multiple AST,Non-Live Exec Acc,Non-Live Simple Exec,Non-Live Multiple Exec,Non-Live Parallel Exec,Non-Live Parallel Multiple Exec,Live Acc,Live Simple AST,Live Multiple AST,Live Parallel AST,Live Parallel Multiple AST,Multi Turn Acc,Multi Turn Base,Multi Turn Miss Func,Multi Turn Miss Param,Multi Turn Long Context,Multi Turn Composite,Relevance Detection,Irrelevance Detection,Organization,License
1,49.30%,Hammer2.0-1.5b (FC),https://huggingface.co/MadeAgents/Hammer2.0-1.5b,N/A,N/A,N/A,N/A,84.29%,75.17%,92.00%,88.00%,82.00%,87.07%,93.29%,92.00%,88.00%,75.00%,62.86%,69.38%,68.08%,56.25%,70.83%,1.38%,3.00%,0.50%,1.00%,1.00%,N/A,92.68%,60.38%,MadeAgents,cc-by-nc-4.0
4,46.61%,Qwen2.5-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct,N/A,N/A,N/A,N/A,75.81%,72.25%,85.50%,75.50%,70.00%,83.29%,75.14%,92.00%,86.00%,80.00%,60.06%,66.28%,59.31%,50.00%,50.00%,1.88%,2.50%,2.00%,1.50%,1.50%,N/A,73.17%,61.78%,Qwen,apache-2.0
5,29.31%,Qwen2-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2-1.5B-Instruct,N/A,N/A,N/A,N/A,53.67%,50.67%,77.50%,45.50%,41.00%,55.66%,45.64%,76.00%,56.00%,45.00%,37.85%,44.96%,36.26%,18.75%,25.00%,0.25%,0.50%,0.00%,0.50%,0.00%,N/A,78.05%,23.85%,Qwen,apache-2.0
6,24.58%,xLAM-1b-fc-r (FC),https://huggingface.co/Salesforce/xLAM-1b-fc-r,N/A,N/A,N/A,N/A,40.92%,75.17%,86.50%,1.00%,1.00%,40.55%,68.21%,88.00%,6.00%,0.00%,37.41%,61.24%,54.68%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,N/A,97.56%,5.02%,Salesforce,cc-by-nc-4.0

@HuanzhiMao
Copy link
Collaborator

Hi,
I think the numbers are expected. Looking at the most recent leaderboard scores we have in #748, Hammer2.0-1.5B (FC) has 49.68% while you reported 49.30%. Maybe you are looking at an old score before we update the leaderboard?

@sileix
Copy link
Author

sileix commented Nov 22, 2024

@HuanzhiMao Thank you for the update. It seems the numbers on the leaderboard updated in last few days.
After the update all my numbers are within 0.5% difference except for Qwen2-1.5B-Instruct (Prompt) which is still 1.73% lower.
Is there any particular update that leads to a global drop in accuracy number on the leaderboard?

@HuanzhiMao
Copy link
Collaborator

As you can see in the PR description, that leaderboard update contains the effect of 12 PRs; some lead to score boost, some lead to score decrease. I think #733 might result in the biggest drop, as it introduced new evaluation metrics.

@sileix
Copy link
Author

sileix commented Nov 22, 2024

I see. Thanks again for the detailed info! @HuanzhiMao

@sileix sileix closed this as completed Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants