[BFCL] Mismatch between local evaluation and leaderboard numbers #773

sileix · 2024-11-19T23:21:06Z

Describe the issue
Local evaluation on A100 has a lower accuracy than the numbers reported on Leaderboard.

ID datapoint
Gorilla repo commit #: 5a42197

We followed the instruction to set up the environment and API keys to evaluate with sglang. We have evaluated several models. The numbers we obtained locally are consistently lower than the numbers reported on the official leader board.

Some examples:

Model	Local Evaluation Acc	Leaderboard Acc	Diff
Hammer2.0-1.5B (FC)	49.3%	51.59%	-2.29%
Qwen2.5-1.5B-Instruct (Prompt)	46.61%	48.82%	-1.98%
Qwen2-1.5B-Instruct (Prompt)	29.31%	32.08%	-2.77%
xLAM-1b-fc-r (FC)	24.58%	25.14%	-0.56%

Is this expected? What could be the potential reason for the difference? Thanks in advance.

The text was updated successfully, but these errors were encountered:

HuanzhiMao · 2024-11-19T23:40:04Z

Hi @sileix,
Could you provide the detailed score breakdown for your local evaluation result (eg, the data_overall.csv)?

sileix · 2024-11-22T01:01:30Z

Thanks for the quick response. Please see below:

Rank,Overall Acc,Model,Model Link,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s),Non-Live AST Acc,Non-Live Simple AST,Non-Live Multiple AST,Non-Live Parallel AST,Non-Live Parallel Multiple AST,Non-Live Exec Acc,Non-Live Simple Exec,Non-Live Multiple Exec,Non-Live Parallel Exec,Non-Live Parallel Multiple Exec,Live Acc,Live Simple AST,Live Multiple AST,Live Parallel AST,Live Parallel Multiple AST,Multi Turn Acc,Multi Turn Base,Multi Turn Miss Func,Multi Turn Miss Param,Multi Turn Long Context,Multi Turn Composite,Relevance Detection,Irrelevance Detection,Organization,License
1,49.30%,Hammer2.0-1.5b (FC),https://huggingface.co/MadeAgents/Hammer2.0-1.5b,N/A,N/A,N/A,N/A,84.29%,75.17%,92.00%,88.00%,82.00%,87.07%,93.29%,92.00%,88.00%,75.00%,62.86%,69.38%,68.08%,56.25%,70.83%,1.38%,3.00%,0.50%,1.00%,1.00%,N/A,92.68%,60.38%,MadeAgents,cc-by-nc-4.0
4,46.61%,Qwen2.5-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct,N/A,N/A,N/A,N/A,75.81%,72.25%,85.50%,75.50%,70.00%,83.29%,75.14%,92.00%,86.00%,80.00%,60.06%,66.28%,59.31%,50.00%,50.00%,1.88%,2.50%,2.00%,1.50%,1.50%,N/A,73.17%,61.78%,Qwen,apache-2.0
5,29.31%,Qwen2-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2-1.5B-Instruct,N/A,N/A,N/A,N/A,53.67%,50.67%,77.50%,45.50%,41.00%,55.66%,45.64%,76.00%,56.00%,45.00%,37.85%,44.96%,36.26%,18.75%,25.00%,0.25%,0.50%,0.00%,0.50%,0.00%,N/A,78.05%,23.85%,Qwen,apache-2.0
6,24.58%,xLAM-1b-fc-r (FC),https://huggingface.co/Salesforce/xLAM-1b-fc-r,N/A,N/A,N/A,N/A,40.92%,75.17%,86.50%,1.00%,1.00%,40.55%,68.21%,88.00%,6.00%,0.00%,37.41%,61.24%,54.68%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,N/A,97.56%,5.02%,Salesforce,cc-by-nc-4.0

HuanzhiMao · 2024-11-22T01:20:20Z

Hi,
I think the numbers are expected. Looking at the most recent leaderboard scores we have in #748, Hammer2.0-1.5B (FC) has 49.68% while you reported 49.30%. Maybe you are looking at an old score before we update the leaderboard?

sileix · 2024-11-22T04:41:36Z

@HuanzhiMao Thank you for the update. It seems the numbers on the leaderboard updated in last few days.
After the update all my numbers are within 0.5% difference except for Qwen2-1.5B-Instruct (Prompt) which is still 1.73% lower.
Is there any particular update that leads to a global drop in accuracy number on the leaderboard?

HuanzhiMao · 2024-11-22T05:36:05Z

As you can see in the PR description, that leaderboard update contains the effect of 12 PRs; some lead to score boost, some lead to score decrease. I think #733 might result in the biggest drop, as it introduced new evaluation metrics.

sileix · 2024-11-22T16:58:17Z

I see. Thanks again for the detailed info! @HuanzhiMao

sileix closed this as completed Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

sileix commented Nov 19, 2024 •

edited

Loading

HuanzhiMao commented Nov 19, 2024

sileix commented Nov 22, 2024 •

edited

Loading

HuanzhiMao commented Nov 22, 2024

sileix commented Nov 22, 2024

HuanzhiMao commented Nov 22, 2024

sileix commented Nov 22, 2024

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

Comments

sileix commented Nov 19, 2024 • edited Loading

HuanzhiMao commented Nov 19, 2024

sileix commented Nov 22, 2024 • edited Loading

HuanzhiMao commented Nov 22, 2024

sileix commented Nov 22, 2024

HuanzhiMao commented Nov 22, 2024

sileix commented Nov 22, 2024

sileix commented Nov 19, 2024 •

edited

Loading

sileix commented Nov 22, 2024 •

edited

Loading