Skip to content

Commit

Permalink
Merge pull request #228 from EvolvingLMMs-Lab/dev/fewshot
Browse files Browse the repository at this point in the history
add more language tasks and fix fewshot evaluation bugs
  • Loading branch information
Luodian authored Sep 6, 2024
2 parents 21d0fdf + 3ecdde7 commit 432d445
Show file tree
Hide file tree
Showing 24 changed files with 339 additions and 1 deletion.
6 changes: 5 additions & 1 deletion lmms_eval/api/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,9 @@ def build_all_requests(self, limit=None, rank=None, world_size=None) -> None:
pbar = tqdm(total=total_docs, desc=f"Building context", disable=(rank != 0))
for doc_id in doc_id_iterator:
# sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context(doc_id, 0 if self.config.num_fewshot is None else self.config.num_fewshot, self.config.training_split if self.has_training_docs() else split)
fewshot_ctx = self.fewshot_context(
doc_id, 0 if self.config.num_fewshot is None else self.config.num_fewshot, split
) # TODO: avoid doc_id inconsistency between test and train, but wondering why selecting docs from test set, not train set

# TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute
per_task_metadata = {"task": self.config["task"], "doc_id": doc_id, "repeats": self.config.repeats}
Expand Down Expand Up @@ -1026,11 +1028,13 @@ def fewshot_context(self, doc_id, num_fewshot, split):
The fewshot context.
"""
doc = self.dataset_no_image[split][doc_id]

if num_fewshot == 0:
# always prepend the (possibly empty) task description
labeled_examples = self.config.description
else:
labeled_examples = self.config.description + self.sampler.get_context(doc, num_fewshot)

example = self.doc_to_text(doc)
if type(example) == str:
return labeled_examples + example
Expand Down
4 changes: 4 additions & 0 deletions lmms_eval/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,6 +609,10 @@ def evaluate(

if hasattr(lm, "accelerator"):
lm.accelerator.wait_for_everyone()

if not isinstance(lm, lm_eval.api.model.LM):
del lm

return results_dict


Expand Down
58 changes: 58 additions & 0 deletions lmms_eval/tasks/arc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# ARC

### Paper

Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Abstract: https://arxiv.org/abs/1803.05457

The ARC dataset consists of 7,787 science exam questions drawn from a variety
of sources, including science questions provided under license by a research
partner affiliated with AI2. These are text-only, English language exam questions
that span several grade levels as indicated in the files. Each question has a
multiple choice structure (typically 4 answer options). The questions are sorted
into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.

Homepage: https://allenai.org/data/arc


### Citation

```
@article{Clark2018ThinkYH,
title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
journal={ArXiv},
year={2018},
volume={abs/1803.05457}
}
```

### Groups, Tags, and Tasks

#### Groups

None.

#### Tags

* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`

#### Tasks

* `arc_easy`
* `arc_challenge`

### Checklist

For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
3 changes: 3 additions & 0 deletions lmms_eval/tasks/arc/arc_challenge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include: arc_easy.yaml
task: arc_challenge
dataset_name: ARC-Challenge
23 changes: 23 additions & 0 deletions lmms_eval/tasks/arc/arc_easy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
tag:
- ai2_arc
task: arc_easy
dataset_path: allenai/ai2_arc
dataset_name: ARC-Easy
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{choices.label.index(answerKey)}}"
doc_to_choice: "{{choices.text}}"
should_decontaminate: true
doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
59 changes: 59 additions & 0 deletions lmms_eval/tasks/mmlu_pro/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# mmlu_pro

### Paper

Title: `MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark`

Abstract: `In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.`

Homepage: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

### Citation

```bibtex
@misc{wang2024mmlupro,
title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
year={2024},
eprint={2406.01574},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
```

### Groups and Tasks

#### Groups

* `mmlu_pro`: 'All 14 subjects of the mmlu_pro dataset, evaluated following the methodology in mmlu's original implementation'

#### Tasks

The following tasks evaluate subjects in the mmlu_pro dataset
- `mmlu_pro_biology`
- `mmlu_pro_business`
- `mmlu_pro_chemistry`
- `mmlu_pro_computer_science`
- `mmlu_pro_economics`
- `mmlu_pro_engineering`
- `mmlu_pro_health`
- `mmlu_pro_history`
- `mmlu_pro_law`
- `mmlu_pro_math`
- `mmlu_pro_other`
- `mmlu_pro_philosophy`
- `mmlu_pro_physics`
- `mmlu_pro_psychology`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
33 changes: 33 additions & 0 deletions lmms_eval/tasks/mmlu_pro/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
dataset_path: TIGER-Lab/MMLU-Pro
test_split: test
fewshot_split: validation
fewshot_config:
sampler: first_n
doc_to_text: !function utils.fewshot_to_text
doc_to_target: ""
output_type: generate_until
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "custom-extract"
filter:
- function: "regex"
regex_pattern: 'answer is \(?([ABCDEFGHIJ])\)?'
# regex_pattern: r".*[aA]nswer:\s*([A-J])",
- function: "take_first"
generation_kwargs:
until:
- "</s>"
- "Q:"
- "<|im_end|>"
do_sample: false
temperature: 0.0
num_fewshot: 5
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 0.0
23 changes: 23 additions & 0 deletions lmms_eval/tasks/mmlu_pro/_mmlu_pro.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
group: mmlu_pro
task:
- mmlu_pro_biology
- mmlu_pro_business
- mmlu_pro_chemistry
- mmlu_pro_computer_science
- mmlu_pro_economics
- mmlu_pro_engineering
- mmlu_pro_health
- mmlu_pro_history
- mmlu_pro_law
- mmlu_pro_math
- mmlu_pro_other
- mmlu_pro_philosophy
- mmlu_pro_physics
- mmlu_pro_psychology
aggregate_metric_list:
- aggregation: mean
metric: exact_match
weight_by_size: true
filter_list: custom-extract
metadata:
version: 1.0
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_biology"
task_alias: "biology"
process_docs: !function utils.process_biology
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_business.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_business"
task_alias: "business"
process_docs: !function utils.process_business
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_chemistry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_chemistry"
task_alias: "chemistry"
process_docs: !function utils.process_chemistry
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_computer_science.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_computer_science"
task_alias: "computer_science"
process_docs: !function utils.process_computer_science
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_economics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_economics"
task_alias: "economics"
process_docs: !function utils.process_economics
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_engineering.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_engineering"
task_alias: "engineering"
process_docs: !function utils.process_engineering
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_health.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_health"
task_alias: "health"
process_docs: !function utils.process_health
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_history.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_history"
task_alias: "history"
process_docs: !function utils.process_history
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_law.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_law"
task_alias: "law"
process_docs: !function utils.process_law
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_math.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_math"
task_alias: "math"
process_docs: !function utils.process_math
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_other.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_other"
task_alias: "other"
process_docs: !function utils.process_other
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_philosophy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_philosophy"
task_alias: "philosophy"
process_docs: !function utils.process_philosophy
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_physics"
task_alias: "physics"
process_docs: !function utils.process_physics
5 changes: 5 additions & 0 deletions lmms_eval/tasks/mmlu_pro/mmlu_pro_psychology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
description: "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
include: "_default_template_yaml"
task: "mmlu_pro_psychology"
task_alias: "psychology"
process_docs: !function utils.process_psychology
60 changes: 60 additions & 0 deletions lmms_eval/tasks/mmlu_pro/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from functools import partial

choices = [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
]


def format_cot_example(example, including_answer=True):
prompt = "Question:\n"
question = example["question"]
options = example["options"]
prompt += question + "\n"
prompt += "Options:\n"
for i, opt in enumerate(options):
prompt += "{}. {}\n".format(choices[i], opt)
if including_answer:
cot_content = example["cot_content"].replace("A: Let's think step by step.", "Answer: Let's think step by step.")
prompt += cot_content + "\n\n"
else:
prompt += "Answer: Let's think step by step."
return prompt


doc_to_text = partial(format_cot_example, including_answer=False)
fewshot_to_text = partial(format_cot_example, including_answer=True)


def process_docs(dataset, subject):
return dataset.filter(lambda x: x["category"] == subject)


process_biology = partial(process_docs, subject="biology")
process_business = partial(process_docs, subject="business")
process_chemistry = partial(process_docs, subject="chemistry")
process_computer_science = partial(process_docs, subject="computer science")
process_economics = partial(process_docs, subject="economics")
process_engineering = partial(process_docs, subject="engineering")
process_health = partial(process_docs, subject="health")
process_history = partial(process_docs, subject="history")
process_law = partial(process_docs, subject="law")
process_math = partial(process_docs, subject="math")
process_other = partial(process_docs, subject="other")
process_philosophy = partial(process_docs, subject="philosophy")
process_physics = partial(process_docs, subject="physics")
process_psychology = partial(process_docs, subject="psychology")
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ all = [
"vila",
"gemini",
"reka",
"metrics",
]

[tool.setuptools.packages.find]
Expand Down

0 comments on commit 432d445

Please sign in to comment.