Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Are there any dependencies between 'bfcl generate' and 'bfcl evaluate' #734

Open
TurboMa opened this issue Nov 4, 2024 · 3 comments
Labels
BFCL-General General BFCL Issue

Comments

@TurboMa
Copy link

TurboMa commented Nov 4, 2024

Describe the issue
This is actually not a issue but a simple question which is I have run bfcl generate on llama3.1 with python_ast test category and I got list of results from model. Next, If I want to get the score, when I run bfcl evaluate, it will directly compare the model generation with the standard answer or it will run again the model to generate new answer (run generation again inside evaluation)?

@HuanzhiMao
Copy link
Collaborator

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

@TurboMa
Copy link
Author

TurboMa commented Nov 4, 2024

bfcl generate generates the model response, bfcl evaluate takes the output from bfcl generate and compares them with ground truth. bfcl evaluate will not run the generation again.

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

@HuanzhiMao
Copy link
Collaborator

Thanks @HuanzhiMao for the reply and I run it (Although there is a problem when generating the csv) and I got the acc for each dataset (I run the python_pst) test category. All the subsets are more or less result in the same acc with leaderboard except BFCL_V3_simple_score.json (I got 24.5 acc, and leaderboard is 49.58), any ideas? Thanks

The simple category on the leaderboard is an unweighted average of BFCL_V3_simple, BFCL_V3_java, and BFCL_V3_javascript.

@HuanzhiMao HuanzhiMao added the BFCL-General General BFCL Issue label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-General General BFCL Issue
Projects
None yet
Development

No branches or pull requests

2 participants