You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue
This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?
The text was updated successfully, but these errors were encountered:
bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.
bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.
Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks
Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks
The simple category on the leaderboard is an unweighted average of BFCL_V3_simple, BFCL_V3_java, and BFCL_V3_javascript.
Describe the issue
This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?
The text was updated successfully, but these errors were encountered: