-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add swiss legal evals as new community tasks #389
base: main
Are you sure you want to change the base?
Conversation
@hynky1999 tagging you if you've got a couple minutes to check the templating when back from the offsite |
Re templates: @JoelNiklaus I can quickly make a PR for the translation template and we can convert it to that. |
I haven't experimented with prompts yet. Yes, going with variant A sounds good. Thanks so much! |
Btw. what is the reason you are not using the metrics from evaluate? |
Evaluate is no longer actively maintained (it's indicated in the Github readme). We also wanted lighteval to be light, and not rely on a heap of dependencies. |
I see. I used the direct implementation for COMET and METEOR, rather than evaluate. |
PR looks great ! Do the results on your evals look sound ?
|
Great, thanks! Couldn't run the evals yet because of the judge prompt. Hope to do that soon. |
* implement tranlsation prompt * add small coment about tranlsation prompt * change formatting to reformat language dependant parts --------- Co-authored-by: Clémentine Fourrier <[email protected]>
For some reason, bleurt_large, wmt22-comet-da, and judge_score_gpt-4o are saved to separated duplicated rows in the details. Also, judge_score_gpt-4o does not show up in the overview:
I am running this command:
@clefourrier @NathanHB Do you know why this is happening and how can I fix it? |
Duplicated rows: yes, each "metric type" leads to its own row, since they are not parsed the same (to make sure each comes with its own correct logprob related info for example). This is a feature not a bug. |
Found the issue: The corpus_level_fn name was not matching the metric name |
Adds new community tasks with swiss legal evaluations. Currently translation tasks are supported but others may follow in the future.