Skip to content

Commit c2c8e23

Browse files
authored
Add Latxa paper evaluation tasks for Basque (EleutherAI#1654)
* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit
1 parent ab7cc6b commit c2c8e23

File tree

85 files changed

+933
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+933
-0
lines changed

lm_eval/tasks/basqueglue/README.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# BasqueGLUE
2+
3+
### Paper
4+
5+
Title: `BasqueGLUE: A Natural Language Understanding Benchmark for Basque`
6+
7+
Abstract: `https://aclanthology.org/2022.lrec-1.172/`
8+
9+
Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.
10+
11+
Homepage: `https://github.com/orai-nlp/BasqueGLUE`
12+
13+
Title: `Latxa: An Open Language Model and Evaluation Suite for Basque`
14+
15+
Abstract: `https://arxiv.org/abs/2403.20266`
16+
17+
The use of BasqueGLUE for evaluating the performance of decoder models in Basque is presented in this paper.
18+
19+
Homepage: `https://github.com/hitz-zentroa/latxa`
20+
21+
### Citation
22+
23+
```
24+
@InProceedings{urbizu2022basqueglue,
25+
author = {Urbizu, Gorka and San Vicente, Iñaki and Saralegi, Xabier and Agerri, Rodrigo and Soroa, Aitor},
26+
title = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
27+
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
28+
month = {June},
29+
year = {2022},
30+
address = {Marseille, France},
31+
publisher = {European Language Resources Association},
32+
pages = {1603--1612},
33+
url = {https://aclanthology.org/2022.lrec-1.172}
34+
}
35+
36+
@misc{etxaniz2024latxa,
37+
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
38+
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
39+
year={2024},
40+
eprint={2403.20266},
41+
archivePrefix={arXiv},
42+
primaryClass={cs.CL}
43+
}
44+
```
45+
46+
### Groups and Tasks
47+
48+
#### Groups
49+
50+
* `basque-glue`: First version of the implementation
51+
52+
#### Tasks
53+
54+
* `bhtc_v2`: Topic classification of news extracts with 12 categories.
55+
* `bec`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
56+
* `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement.
57+
* `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli).
58+
* `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic).
59+
* `epec_korref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).
60+
61+
### Checklist
62+
63+
For adding novel benchmarks/datasets to the library:
64+
* [ ] Is the task an existing benchmark in the literature?
65+
* [ ] Have you referenced the original paper that introduced the task?
66+
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
67+
68+
69+
If other tasks on this dataset are already supported:
70+
* [ ] Is the "Main" variant of this task clearly denoted?
71+
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
72+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

lm_eval/tasks/basqueglue/bec.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
group: basque-glue
2+
task: bec2016eu
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: bec
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak?\nErantzuna:"
9+
doc_to_target: label
10+
doc_to_choice: ['negatiboa', 'neutrala', 'positiboa']
11+
metric_list:
12+
- metric: f1
13+
aggregation: !function utils.micro_f1_score
14+
higher_is_better: true
15+
metadata:
16+
- version: 1.0

lm_eval/tasks/basqueglue/bhtc.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
group: basque-glue
2+
task: bhtc_v2
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: bhtc
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
doc_to_text: "Testua: {{text}}\nGaldera: Zein da aurreko testuaren gaia?\nErantzuna:"
9+
doc_to_target: label
10+
doc_to_choice: ['Ekonomia', 'Euskal Herria', 'Euskara', 'Gizartea', 'Historia', 'Ingurumena', 'Iritzia', 'Komunikazioa', 'Kultura', 'Nazioartea', 'Politika', 'Zientzia']
11+
metric_list:
12+
- metric: f1
13+
aggregation: !function utils.micro_f1_score
14+
higher_is_better: true
15+
metadata:
16+
- version: 1.0

lm_eval/tasks/basqueglue/coref.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
group: basque-glue
2+
task: epec_koref_bin
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: coref
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
doc_to_text: !function utils.coref_doc_to_text
9+
doc_to_target: label
10+
doc_to_choice: ['ez', 'bai']
11+
metric_list:
12+
- metric: acc
13+
aggregation: mean
14+
higher_is_better: true
15+
metadata:
16+
- version: 1.0

lm_eval/tasks/basqueglue/qnli.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
group: basque-glue
2+
task: qnlieu
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: qnli
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
doc_to_text: "{{question}}\n{{sentence}}\nGaldera: aurreko galderari erantzuten al dio emandako testuak?\nErantzuna:"
9+
doc_to_target: label
10+
doc_to_choice: ['bai', 'ez']
11+
metric_list:
12+
- metric: acc
13+
aggregation: mean
14+
higher_is_better: true
15+
metadata:
16+
- version: 1.0

lm_eval/tasks/basqueglue/utils.py

+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import html
2+
import re
3+
4+
from datasets import load_metric
5+
6+
7+
def general_detokenize(string):
8+
string = re.sub(r"\s+([.,;:!?)])", r"\1", string)
9+
string = re.sub(r"(\s+|^)\(\s+([^)]+)\s+\)", r"\1(\2)", string)
10+
string = re.sub(r"(\s+|^)\[\s+([^)]+)\s+\]", r"\1[\2]", string)
11+
string = re.sub(r'(\s+|^)"\s+([^"]+)\s+"', r'\1"\2"', string)
12+
string = re.sub(r"(\s+|^)'\s+([^']+)\s+'", r"\1'\2'", string)
13+
return string
14+
15+
16+
def process_doc(string):
17+
string = html.unescape(string)
18+
string = general_detokenize(string)
19+
return string
20+
21+
22+
def process_wic_docs(dataset):
23+
def _helper(doc):
24+
# there's some issues with the encoding on this one
25+
doc["sentence1"] = (
26+
process_doc(doc["sentence1"]).encode("latin-1").decode("utf-8")
27+
)
28+
doc["sentence2"] = (
29+
process_doc(doc["sentence2"]).encode("latin-1").decode("utf-8")
30+
)
31+
return doc
32+
33+
return dataset.map(_helper)
34+
35+
36+
def coref_doc_to_text(x):
37+
def _span_in_context(span_index, span_text):
38+
span_start = span_index
39+
span_end = span_start + len(span_text.split(" ")) - 1
40+
tokens[span_start] = f"*{tokens[span_start]}"
41+
tokens[span_end] = f"{tokens[span_end]}*"
42+
43+
tokens = x["text"].split(" ")
44+
_span_in_context(x["span1_index"], x["span1_text"])
45+
_span_in_context(
46+
x["span2_index"] - 1, x["span2_text"]
47+
) # span1_index is 0-based but span2_index is 1-based ??
48+
context = process_doc(" ".join(tokens))
49+
span_1 = process_doc(x["span1_text"])
50+
span_2 = process_doc(x["span2_text"])
51+
text = (
52+
f"Testua: {context}\n"
53+
+ f'Galdera: Aurreko testuan, "*{span_1}*" eta "*{span_2}*" gauza bera dira?\n'
54+
+ "Erantzuna:"
55+
)
56+
return text
57+
58+
59+
# Measure F1 as in the benchmark repo: https://github.com/orai-nlp/BasqueGLUE/blob/main/eval_basqueglue.py
60+
61+
62+
def micro_f1_score(items):
63+
f1_metric = load_metric("f1")
64+
golds, preds = list(zip(*items))
65+
f1_score = f1_metric.compute(references=golds, predictions=preds, average="micro")[
66+
"f1"
67+
]
68+
return f1_score
69+
70+
71+
def vaxx_f1_score(items):
72+
f1_metric = load_metric("f1")
73+
golds, preds = list(zip(*items))
74+
f1_class = f1_metric.compute(
75+
references=golds, predictions=preds, labels=[0, 2], average=None
76+
)["f1"]
77+
f1_score = sum(f1_class) / len(f1_class)
78+
return f1_score

lm_eval/tasks/basqueglue/vaxx.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
group: basque-glue
2+
task: vaxx_stance
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: vaxx
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak txertoei buruz?\nErantzuna:"
9+
doc_to_target: label
10+
doc_to_choice: ['aurka', 'neutrala', 'alde']
11+
metric_list:
12+
- metric: f1
13+
aggregation: !function utils.vaxx_f1_score
14+
higher_is_better: true
15+
metadata:
16+
- version: 1.0

lm_eval/tasks/basqueglue/wic.yaml

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
group: basque-glue
2+
task: wiceu
3+
dataset_path: orai-nlp/basqueGLUE
4+
dataset_name: wic
5+
output_type: multiple_choice
6+
validation_split: validation
7+
test_split: test
8+
process_docs: !function utils.process_wic_docs
9+
doc_to_text: "1. esaldia: {{sentence1}}\n2. esaldia: {{sentence2}}\nGaldera: Aurreko bi esaldietan, \"{{word}}\" hitzak esanahi berdina du?\nErantzuna:"
10+
doc_to_target: label
11+
doc_to_choice: ['ez', 'bai']
12+
metric_list:
13+
- metric: acc
14+
aggregation: mean
15+
higher_is_better: true
16+
metadata:
17+
- version: 1.0

lm_eval/tasks/eus_exams/README.md

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# EusExams
2+
3+
### Paper
4+
5+
Title: Latxa: An Open Language Model and Evaluation Suite for Basque
6+
7+
Abstract: https://arxiv.org/abs/2403.20266
8+
9+
EusExams is a collection of tests designed to prepare individuals for Public Service examinations conducted by several Basque institutions, including the public health system Osakidetza, the Basque Government, the City Councils of Bilbao and Gasteiz, and the University of the Basque Country (UPV/EHU). Within each of these groups, there are different exams for public positions, such as administrative and assistant roles. Each multiple-choice question contains 2 to 4 choices (3.90 on average) and one correct answer. The dataset is mostly parallel with 16k questions in Basque and 18k in Spanish.
10+
11+
Homepage: https://github.com/hitz-zentroa/latxa
12+
13+
14+
### Citation
15+
16+
```
17+
@misc{etxaniz2024latxa,
18+
title={Latxa: An Open Language Model and Evaluation Suite for Basque},
19+
author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
20+
year={2024},
21+
eprint={2403.20266},
22+
archivePrefix={arXiv},
23+
primaryClass={cs.CL}
24+
}
25+
```
26+
27+
### Groups and Tasks
28+
29+
#### Groups
30+
31+
* `eus_exams_eu`: The Basque version of the exams.
32+
* `eus_exams_es`: The Spanish version of the exams.
33+
34+
#### Tasks
35+
36+
Basque and Spanish versions of the exams are available as separate tasks starting with `eus_exams_eu` and `eus_exams_es` respectively.
37+
38+
### Checklist
39+
40+
For adding novel benchmarks/datasets to the library:
41+
* [ ] Is the task an existing benchmark in the literature?
42+
* [ ] Have you referenced the original paper that introduced the task?
43+
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
44+
45+
46+
If other tasks on this dataset are already supported:
47+
* [ ] Is the "Main" variant of this task clearly denoted?
48+
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
49+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

lm_eval/tasks/eus_exams/configs.py

+67
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
import argparse
2+
import json
3+
4+
import requests
5+
import yaml
6+
7+
8+
# get configs from huggingface datasets server by doing a request
9+
response = requests.get(
10+
"https://datasets-server.huggingface.co/splits?dataset=HiTZ%2FEusExams", timeout=5
11+
)
12+
response_json = json.loads(response.text)
13+
CONFIGS = [split["config"] for split in response_json["splits"]]
14+
15+
16+
def gen_config_yamls(output_dir: str, overwrite: bool) -> None:
17+
"""
18+
Generate a yaml file for each configuage.
19+
20+
:param output_dir: The directory to output the files to.
21+
:param overwrite: Whether to overwrite files if they already exist.
22+
"""
23+
err = []
24+
for config in CONFIGS:
25+
file_name = f"eus_exams_{config}.yaml"
26+
try:
27+
with open(f"{output_dir}/{file_name}", "w" if overwrite else "x") as f:
28+
f.write("# Generated by utils.py\n")
29+
yaml.dump(
30+
{
31+
"include": "eus_exams_es"
32+
if "eus_exams_es" in config
33+
else "eus_exams_eu",
34+
"dataset_name": config,
35+
"task": f"eus_exams_{config}",
36+
},
37+
f,
38+
)
39+
except FileExistsError:
40+
err.append(file_name)
41+
42+
if len(err) > 0:
43+
raise FileExistsError(
44+
"Files were not created because they already exist (use --overwrite flag):"
45+
f" {', '.join(err)}"
46+
)
47+
48+
49+
def main() -> None:
50+
"""Parse CLI args and generate configuage-specific yaml files."""
51+
parser = argparse.ArgumentParser()
52+
parser.add_argument(
53+
"--overwrite",
54+
default=False,
55+
action="store_true",
56+
help="Overwrite files if they already exist",
57+
)
58+
parser.add_argument(
59+
"--output-dir", default=".", help="Directory to write yaml files to"
60+
)
61+
args = parser.parse_args()
62+
63+
gen_config_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
64+
65+
66+
if __name__ == "__main__":
67+
main()

lm_eval/tasks/eus_exams/eus_exams

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
dataset_path: HiTZ/EusExams
2+
dataset_name: null
3+
validation_split: null
4+
test_split: test
5+
fewshot_split: test
6+
process_docs: !function utils.process_docs
7+
output_type: multiple_choice
8+
doc_to_choice: ["A", "B", "C", "D"]
9+
doc_to_target: answer
10+
metric_list:
11+
- metric: acc
12+
aggregation: mean
13+
higher_is_better: true
14+
- metric: acc_norm
15+
aggregation: mean
16+
higher_is_better: true
17+
metadata:
18+
version: 0.0

lm_eval/tasks/eus_exams/eus_exams_es

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
include: eus_exams
2+
group:
3+
- eus_exams_es
4+
doc_to_text: "Pregunta: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nRespuesta:"

0 commit comments

Comments
 (0)