Add swiss legal evals as new community tasks #389

JoelNiklaus · 2024-11-11T11:03:56Z

Adds new community tasks with swiss legal evaluations. Currently translation tasks are supported but others may follow in the future.

clefourrier · 2024-11-12T09:16:18Z

@hynky1999 tagging you if you've got a couple minutes to check the templating when back from the offsite

community_tasks/swiss_legal_evals.py

hynky1999 · 2024-11-12T21:05:06Z

Re templates:
We don't have any template for translation tasks atm.
There are many variants to go with (see the image below), but I would prefer going with the [src]: [input] [tgt]: (A variant). Since translation is inherently cross-lingual tasks and it's not clear which language we should use (target or source?), such template allows us to be independant on language (the language labels are kinda standardized, but yeah they will be in latin).

@JoelNiklaus
Have you experimented different prompt formats?

Source: https://arxiv.org/pdf/2301.07069

I can quickly make a PR for the translation template and we can convert it to that.

JoelNiklaus · 2024-11-13T08:48:32Z

I haven't experimented with prompts yet. Yes, going with variant A sounds good.

Thanks so much!

JoelNiklaus · 2024-11-13T11:08:33Z

Btw. what is the reason you are not using the metrics from evaluate?

clefourrier · 2024-11-13T11:38:25Z

Evaluate is no longer actively maintained (it's indicated in the Github readme). We also wanted lighteval to be light, and not rely on a heap of dependencies.

JoelNiklaus · 2024-11-13T13:56:40Z

I see. I used the direct implementation for COMET and METEOR, rather than evaluate.

community_tasks/swiss_legal_evals.py

NathanHB · 2024-11-19T13:13:53Z

PR looks great ! Do the results on your evals look sound ?
Also, you can use the pre-commit hooks to format the files and fix the CI :)

pip install pre-commit
pre-commit install
pre-commit run --all-files

JoelNiklaus · 2024-11-20T10:40:42Z

Great, thanks!
Just ran the pre-commit hooks.

Couldn't run the evals yet because of the judge prompt. Hope to do that soon.

* implement tranlsation prompt * add small coment about tranlsation prompt * change formatting to reformat language dependant parts --------- Co-authored-by: Clémentine Fourrier <[email protected]>

…100.

JoelNiklaus · 2024-11-26T13:53:52Z

For some reason, bleurt_large, wmt22-comet-da, and judge_score_gpt-4o are saved to separated duplicated rows in the details. Also, judge_score_gpt-4o does not show up in the overview:

|community:sdst-text_level:de-fr:3|      0|bleu          |44.8267|±  | 0.5706|
|                                 |       |chrf          |77.1781|±  | 0.5906|
|                                 |       |ter           |54.5455|±  | 0.3742|
|                                 |       |meteor        |62.1061|±  |18.7888|
|                                 |       |BERTScore-P   |98.6389|±  | 0.4811|
|                                 |       |BERTScore-R   |98.6984|±  | 0.6839|
|                                 |       |BERTScore-F   |98.6685|±  | 0.5824|
|                                 |       |bleurt_large  |14.3253|±  | 0.5687|
|                                 |       |wmt22-comet-da|84.5221|±  | 0.4793|

I am running this command:

python -m lighteval accelerate \
  --model_args openai,model=gpt-4o-mini \
  --tasks community|sdst-text_level:de-fr|3|0 \
  --custom_tasks lighteval/community_tasks/swiss_legal_evals.py \
  --output_dir outputs \
  --override_batch_size 1 \
  --save_details \
  --max_samples 2

@clefourrier @NathanHB Do you know why this is happening and how can I fix it?

clefourrier · 2024-11-26T14:32:09Z

Duplicated rows: yes, each "metric type" leads to its own row, since they are not parsed the same (to make sure each comes with its own correct logprob related info for example). This is a feature not a bug.
However, no idea why the judge eval is not there

JoelNiklaus · 2024-11-26T15:43:11Z

Found the issue: The corpus_level_fn name was not matching the metric name

Add swiss legal evals as new community tasks

e2a27a7

clefourrier requested a review from hynky1999 November 12, 2024 09:15

clefourrier reviewed Nov 12, 2024

View reviewed changes

JoelNiklaus added 2 commits November 12, 2024 10:34

Removed nltk and numpy dependencies.

aa409c8

Added short dataset descriptions.

a8ee2a5

Merge branch 'main' into add_swiss_legal_evals

8f68844

Removed open judge models and added COMET and METEOR.

c7f7038

hynky1999 mentioned this pull request Nov 13, 2024

Adds template for translation tasks #391

Merged

Merge branch 'main' into add_swiss_legal_evals

0ca5af6

NathanHB reviewed Nov 19, 2024

View reviewed changes

community_tasks/swiss_legal_evals.py Outdated Show resolved Hide resolved

Merge branch 'main' into add_swiss_legal_evals

1d51a01

Ran pre-commit hooks.

5d41ce0

JoelNiklaus added 9 commits November 20, 2024 11:52

Changed prompt template.

8194125

Added legal translation specific judge prompt.

c58ae44

Improved judge prompt.

ff3705f

Changed metric selection.

091ec11

Made generation_size dependent on the config.

5a47956

Fixed error in config.

6bf7fa2

Fixed error in config.

6cf1c2a

Added support for multiple devices.

b548801

Fixed some bugs for evaluation on GPUs.

ee2a83c

JoelNiklaus mentioned this pull request Nov 25, 2024

[FT] Support batch metric computation for SampleLevelMetrics #404

Closed

JoelNiklaus added 4 commits November 26, 2024 11:02

Added batch inference for heavy metrics and multiplied each score by …

36b7e94

…100.

Added few shot examples and did some refactoring.

5ba218f

Merge branch 'main' into add_swiss_legal_evals

8490841

Switched to an own judge class.

576b847

JoelNiklaus added 2 commits November 26, 2024 16:48

Fixed issue with judge metric not showing up in results.

41bb59a

Fixed issue with evaluation on GPUs.

d82cd91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swiss legal evals as new community tasks #389

Add swiss legal evals as new community tasks #389

JoelNiklaus commented Nov 11, 2024

clefourrier commented Nov 12, 2024 •

edited

Loading

hynky1999 commented Nov 12, 2024

JoelNiklaus commented Nov 13, 2024

JoelNiklaus commented Nov 13, 2024

clefourrier commented Nov 13, 2024 •

edited

Loading

JoelNiklaus commented Nov 13, 2024 •

edited

Loading

NathanHB commented Nov 19, 2024

JoelNiklaus commented Nov 20, 2024

JoelNiklaus commented Nov 26, 2024 •

edited

Loading

clefourrier commented Nov 26, 2024

JoelNiklaus commented Nov 26, 2024 •

edited

Loading

Add swiss legal evals as new community tasks #389

Are you sure you want to change the base?

Add swiss legal evals as new community tasks #389

Conversation

JoelNiklaus commented Nov 11, 2024

clefourrier commented Nov 12, 2024 • edited Loading

hynky1999 commented Nov 12, 2024

JoelNiklaus commented Nov 13, 2024

JoelNiklaus commented Nov 13, 2024

clefourrier commented Nov 13, 2024 • edited Loading

JoelNiklaus commented Nov 13, 2024 • edited Loading

NathanHB commented Nov 19, 2024

JoelNiklaus commented Nov 20, 2024

JoelNiklaus commented Nov 26, 2024 • edited Loading

clefourrier commented Nov 26, 2024

JoelNiklaus commented Nov 26, 2024 • edited Loading

clefourrier commented Nov 12, 2024 •

edited

Loading

clefourrier commented Nov 13, 2024 •

edited

Loading

JoelNiklaus commented Nov 13, 2024 •

edited

Loading

JoelNiklaus commented Nov 26, 2024 •

edited

Loading

JoelNiklaus commented Nov 26, 2024 •

edited

Loading