Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add swiss legal evals as new community tasks #389

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

JoelNiklaus
Copy link
Contributor

Adds new community tasks with swiss legal evaluations. Currently translation tasks are supported but others may follow in the future.

@clefourrier
Copy link
Member

clefourrier commented Nov 12, 2024

@hynky1999 tagging you if you've got a couple minutes to check the templating when back from the offsite

community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved
community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved
community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved
community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved
community_tasks/swiss_legal_evals.py Show resolved Hide resolved
community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved
@hynky1999
Copy link
Collaborator

Re templates:
We don't have any template for translation tasks atm.
There are many variants to go with (see the image below), but I would prefer going with the [src]: [input] [tgt]: (A variant). Since translation is inherently cross-lingual tasks and it's not clear which language we should use (target or source?), such template allows us to be independant on language (the language labels are kinda standardized, but yeah they will be in latin).

@JoelNiklaus
Have you experimented different prompt formats?

image Source: https://arxiv.org/pdf/2301.07069

I can quickly make a PR for the translation template and we can convert it to that.

@JoelNiklaus
Copy link
Contributor Author

I haven't experimented with prompts yet. Yes, going with variant A sounds good.

Thanks so much!

@JoelNiklaus
Copy link
Contributor Author

Btw. what is the reason you are not using the metrics from evaluate?

@clefourrier
Copy link
Member

clefourrier commented Nov 13, 2024

Evaluate is no longer actively maintained (it's indicated in the Github readme). We also wanted lighteval to be light, and not rely on a heap of dependencies.

@JoelNiklaus
Copy link
Contributor Author

JoelNiklaus commented Nov 13, 2024

I see. I used the direct implementation for COMET and METEOR, rather than evaluate.

@NathanHB
Copy link
Member

PR looks great ! Do the results on your evals look sound ?
Also, you can use the pre-commit hooks to format the files and fix the CI :)

pip install pre-commit
pre-commit install
pre-commit run --all-files

@JoelNiklaus
Copy link
Contributor Author

Great, thanks!
Just ran the pre-commit hooks.

Couldn't run the evals yet because of the judge prompt. Hope to do that soon.

JoelNiklaus referenced this pull request Nov 20, 2024
* implement tranlsation prompt

* add small coment about tranlsation prompt

* change formatting to reformat language dependant  parts

---------

Co-authored-by: Clémentine Fourrier <[email protected]>
@JoelNiklaus
Copy link
Contributor Author

JoelNiklaus commented Nov 26, 2024

For some reason, bleurt_large, wmt22-comet-da, and judge_score_gpt-4o are saved to separated duplicated rows in the details. Also, judge_score_gpt-4o does not show up in the overview:

|community:sdst-text_level:de-fr:3|      0|bleu          |44.8267|±  | 0.5706|
|                                 |       |chrf          |77.1781|±  | 0.5906|
|                                 |       |ter           |54.5455|±  | 0.3742|
|                                 |       |meteor        |62.1061|±  |18.7888|
|                                 |       |BERTScore-P   |98.6389|±  | 0.4811|
|                                 |       |BERTScore-R   |98.6984|±  | 0.6839|
|                                 |       |BERTScore-F   |98.6685|±  | 0.5824|
|                                 |       |bleurt_large  |14.3253|±  | 0.5687|
|                                 |       |wmt22-comet-da|84.5221|±  | 0.4793|

I am running this command:

python -m lighteval accelerate \
  --model_args openai,model=gpt-4o-mini \
  --tasks community|sdst-text_level:de-fr|3|0 \
  --custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
  --output_dir outputs \
  --override_batch_size 1 \
  --save_details \
  --max_samples 2

@clefourrier @NathanHB Do you know why this is happening and how can I fix it?

@clefourrier
Copy link
Member

Duplicated rows: yes, each "metric type" leads to its own row, since they are not parsed the same (to make sure each comes with its own correct logprob related info for example). This is a feature not a bug.
However, no idea why the judge eval is not there

@JoelNiklaus
Copy link
Contributor Author

JoelNiklaus commented Nov 26, 2024

Found the issue: The corpus_level_fn name was not matching the metric name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants