[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

TurboMa · 2024-11-04T01:27:15Z

Describe the issue
This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?

HuanzhiMao · 2024-11-04T02:01:35Z

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

TurboMa · 2024-11-04T06:37:46Z

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

HuanzhiMao · 2024-11-04T22:33:16Z

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

The simple category on the leaderboard is an unweighted average of BFCL_V3_simple, BFCL_V3_java, and BFCL_V3_javascript.

HuanzhiMao added the BFCL-General General BFCL Issue label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

TurboMa commented Nov 4, 2024

HuanzhiMao commented Nov 4, 2024

TurboMa commented Nov 4, 2024

HuanzhiMao commented Nov 4, 2024

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

Comments

TurboMa commented Nov 4, 2024

HuanzhiMao commented Nov 4, 2024

TurboMa commented Nov 4, 2024

HuanzhiMao commented Nov 4, 2024