huggingface · NathanHB · Aug 28, 2024 · Aug 28, 2024 · Sep 3, 2024 · Sep 3, 2024
diff --git a/.github/workflows/build_main_documentation.yml b/.github/workflows/build_main_documentation.yml
@@ -0,0 +1,19 @@
+name: Build documentation
+
+on:
+  push:
+    branches:
+      - main
+      - doc-builder*
+      - v*-release
+      - v*-alpha
+
+jobs:
+   build:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: lighteval
+      custom_container: huggingface/transformers-doc-builder
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
@@ -0,0 +1,17 @@
+name: Build PR Documentation
+
+on:
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    with:
+      commit_sha: ${{ github.event.pull_request.head.sha }}
+      pr_number: ${{ github.event.number }}
+      package: lighteval
+      custom_container: huggingface/transformers-doc-builder
diff --git a/.github/workflows/delete_doc_comment_trigger.yml b/.github/workflows/delete_doc_comment_trigger.yml
diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml
@@ -0,0 +1,16 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: lighteval
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -0,0 +1,28 @@
+- local: index
+  title: 🌤️ Lighteval
+- title: "Getting Started"
+  sections:
+    - local: installation
+      title: Installation
+    - local: quicktour
+      title: Quicktour
+- title: "Guides"
+  sections:
+    - local: saving_results
+      title: Saving Results
+    - local: use_python_api
+      title: Use The Python API
+    - local: adding_new_task
+      title: Adding a Custom Task
+    - local: adding_new_metric
+      title: Adding a Custom Metric
+    - local: use_vllm
+      title: Using VLLM as backend
+    - local: use_tgi
+      title: Evaluate on Server
+- title: "API Reference"
+  sections:
+    - local: metric_list
+      title: Available Metrics
+    - local: tasks
+      title: Available Tasks
diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
@@ -0,0 +1,87 @@
+# Adding a New Metric
+
+First, check if you can use one of the parametrized functions in
+[src.lighteval.metrics.metrics_corpus]() or
+[src.lighteval.metrics.metrics_sample]().
+
+If not, you can use the `custom_task` system to register your new metric:
+
+<Tip>
+To see an example of a custom metric added along with a custom task, look at
+<a href="">the IFEval custom task</a>.
+</Tip>
+
+- Create a new Python file which should contain the full logic of your metric.
+- The file also needs to start with these imports
+
+```python
+from aenum import extend_enum
+from lighteval.metrics import Metrics
+```
+
+You need to define a sample level metric:
+
+```python
+def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
+    response = predictions[0]
+    return response == formatted_doc.choices[formatted_doc.gold_index]
+```
+
+Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
+
+```python
+def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
+    response = predictions[0]
+    return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
+```
+
+Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`.
+
+```python
+def agg_function(items):
+    flat_items = [item for sublist in items for item in sublist]
+    score = sum(flat_items) / len(flat_items)
+    return score
+```
+
+Finally, you can define your metric. If it's a sample level metric, you can use the following code:
+
+```python
+my_custom_metric = SampleLevelMetric(
+    metric_name={custom_metric_name},
+    higher_is_better={either True or False},
+    category={MetricCategory},
+    use_case={MetricUseCase},
+    sample_level_fn=custom_metric,
+    corpus_level_fn=agg_function,
+)
+```
+
+If your metric defines multiple metrics per sample, you can use the following code:
+
+```python
+custom_metric = SampleLevelMetricGrouping(
+    metric_name={submetric_names},
+    higher_is_better={n: {True or False} for n in submetric_names},
+    category={MetricCategory},
+    use_case={MetricUseCase},
+    sample_level_fn=custom_metric,
+    corpus_level_fn={
+        "accuracy": np.mean,
+        "other_metric": agg_function,
+    },
+)
+```
+
+To finish, add the following, so that it adds your metric to our metrics list
+when loaded as a module.
+
+```python
+# Adds the metric to the metric list!
+extend_enum(Metrics, "metric_name", metric_function)
+if __name__ == "__main__":
+    print("Imported metric")
+```
+
+You can then give your custom metric to lighteval by using `--custom-tasks
+path_to_your_file` when launching it.
diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
@@ -0,0 +1,194 @@
+# Adding a Custom Task
+
+To add a new task, first either open an issue, to determine whether it will be
+integrated in the core evaluations of lighteval, in the extended tasks, or the
+community tasks, and add its dataset on the hub.
+
+- Core evaluations are evaluations that only require standard logic in their
+  metrics and processing, and that we will add to our test suite to ensure non
+  regression through time. They already see high usage in the community.
+- Extended evaluations are evaluations that require custom logic in their
+  metrics (complex normalisation, an LLM as a judge, ...), that we added to
+  facilitate the life of users. They already see high usage in the community.
+- Community evaluations are submissions by the community of new tasks.
+
+A popular community evaluation can move to become an extended or core evaluation over time.
+
+<Tip>
+You can find examples of custom tasks in the <a
+href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a>
+directory.
+</Tip>
+
+## Step by step creation of a custom task
+
+First, create a python file under the `community_tasks` directory.
+
+You need to define a prompt function that will convert a line from your
+dataset to a document to be used for evaluation.
+
+```python
+# Define as many as you need for your different tasks
+def prompt_fn(line, task_name: str = None):
+    """Defines how to go from a dataset line to a doc object.
+    Follow examples in src/lighteval/tasks/default_prompts.py, or get more info
+    about what this function should do in the README.
+    """
+    return Doc(
+        task_name=task_name,
+        query=line["question"],
+        choices=[f" {c}" for c in line["choices"]],
+        gold_index=line["gold"],
+        instruction="",
+    )
+```
+
+Then, you need to choose a metric, you can either use an existing one (defined
+in `lighteval/metrics/metrics.py`) or [create a custom one](./adding_new_metric).
+
+```python
+custom_metric = SampleLevelMetric(
+    metric_name="my_custom_metric_name",
+    higher_is_better=True,
+    category=MetricCategory.IGNORED,
+    use_case=MetricUseCase.NONE,
+    sample_level_fn=lambda x: x,  # how to compute score for one sample
+    corpus_level_fn=np.mean,  # How to aggreagte the samples metrics
+)
+```
+
+Then, you need to define your task. You can define a task with or without subsets.
+To define a task with no subsets:
+
+```python
+# This is how you create a simple task (like hellaswag) which has one single subset
+# attached to it, and one evaluation possible.
+task = LightevalTaskConfig(
+    name="myothertask",
+    prompt_function=prompt_fn,  # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
+    suite=["community"],
+    hf_repo="",
+    hf_subset="default",
+    hf_avail_splits=[],
+    evaluation_splits=[],
+    few_shots_split=None,
+    few_shots_select=None,
+    metric=[],  # select your metric in Metrics
+)
+```
+
+If you want to create a task with multiple subset, add them to the
+`SAMPLE_SUBSETS` list and create a task for each subset.
+
+```python
+SAMPLE_SUBSETS = []  # list of all the subsets to use for this eval
+
+
+class CustomSubsetTask(LightevalTaskConfig):
+    def __init__(
+        self,
+        name,
+        hf_subset,
+    ):
+        super().__init__(
+            name=name,
+            hf_subset=hf_subset,
+            prompt_function=prompt_fn,  # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
+            hf_repo="",
+            metric=[custom_metric],  # select your metric in Metrics or use your custom_metric
+            hf_avail_splits=[],
+            evaluation_splits=[],
+            few_shots_split=None,
+            few_shots_select=None,
+            suite=["community"],
+            generation_size=-1,
+            stop_sequence=None,
+            output_regex=None,
+            frozen=False,
+        )
+SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
+```
+
+Here is a list of the parameters and their meaning:
+
+- `name` (str), your evaluation name
+- `suite` (list), the suite(s) to which your evaluation should belong. This
+  field allows us to compare different task implementations and is used as a
+  task selection to differentiate the versions to launch. At the moment, you'll
+  find the keywords ["helm", "bigbench", "original", "lighteval", "community",
+  "custom"]; for core evals, please choose `lighteval`.
+- `prompt_function` (Callable), the prompt function you defined in the step
+  above
+- `hf_repo` (str), the path to your evaluation dataset on the hub
+- `hf_subset` (str), the specific subset you want to use for your evaluation
+  (note: when the dataset has no subset, fill this field with `"default"`, not
+  with `None` or `""`)
+- `hf_avail_splits` (list), all the splits available for your dataset (train,
+  valid or validation, test, other...)
+- `evaluation_splits` (list), the splits you want to use for evaluation
+- `few_shots_split` (str, can be `null`), the specific split from which you
+  want to select samples for your few-shot examples. It should be different
+  from the sets included in `evaluation_splits`
+- `few_shots_select` (str, can be `null`), the method that you will use to
+  select items for your few-shot examples. Can be `null`, or one of:
+    - `balanced` select examples from the `few_shots_split` with balanced
+      labels, to avoid skewing the few shot examples (hence the model
+      generations) toward one specific label
+    - `random` selects examples at random from the `few_shots_split`
+    - `random_sampling` selects new examples at random from the
+      `few_shots_split` for every new item, but if a sampled item is equal to
+      the current one, it is removed from the available samples
+    - `random_sampling_from_train` selects new examples at random from the
+      `few_shots_split` for every new item, but if a sampled item is equal to
+      the current one, it is kept! Only use this if you know what you are
+      doing.
+    - `sequential` selects the first `n` examples of the `few_shots_split`
+- `generation_size` (int), the maximum number of tokens allowed for a
+  generative evaluation. If your evaluation is a log likelihood evaluation
+  (multi-choice), this value should be -1
+- `stop_sequence` (list), a list of strings acting as end of sentence tokens
+  for your generation
+- `metric` (list), the metrics you want to use for your evaluation (see next
+  section for a detailed explanation)
+- `output_regex` (str), A regex string that will be used to filter your
+  generation. (Generative metrics will only select tokens that are between the
+  first and the second sequence matched by the regex. For example, for a regex
+  matching `\n` and a generation `\nModel generation output\nSome other text`
+  the metric will only be fed with `Model generation output`)
+- `frozen` (bool), for now, is set to False, but we will steadily pass all
+  stable tasks to True.
+- `trust_dataset` (bool), set to True if you trust the dataset.
+
+
+Then you need to add your task to the `TASKS_TABLE` list.
+
+```python
+# STORE YOUR EVALS
+
+# tasks with subset:
+TASKS_TABLE = SUBSET_TASKS
+
+# tasks without subset:
+# TASKS_TABLE = [task]
+```
+
+Finally, you need to add a module logic to convert your task to a dict for lighteval.
+
+```python
+# MODULE LOGIC
+# You should not need to touch this
+# Convert to dict for lighteval
+if __name__ == "__main__":
+    print(t.name for t in TASKS_TABLE)
+    print(len(TASKS_TABLE))
+```
+
+Once your file is created you can then run the evaluation with the following command:
+
+```bash
+lighteval accelerate \
+    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
+    --tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
+    --custom_tasks {path_to_your_custom_task_file} \
+    --output_dir "./evals"
+```
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -0,0 +1,13 @@
+# 🌤️ Lighteval
+
+A lightweight framework for LLM evaluation
+
+LightEval is a lightweight LLM evaluation suite that Hugging Face has been
+using internally with the recently released LLM data processing library
+datatrove and LLM training library nanotron.
+
+We're releasing it with the community in the spirit of building in the open.
+
+Even though it has been used in a variety of projects, keep in mind that parts
+of lighteval are still unstable and might break ! In case of any problem or
+question, feel free to open an issue.