Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding documentation #282

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a324d63
adding documentation
NathanHB Aug 28, 2024
26d8402
adding documentation nanotron
NathanHB Aug 28, 2024
203045a
commit
NathanHB Sep 3, 2024
cbdcf1b
commit
NathanHB Sep 3, 2024
dd67ce4
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
015e924
undo unecessary changes
NathanHB Sep 3, 2024
4e9c30e
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
8aabbc8
still working on docs
NathanHB Sep 5, 2024
3a74186
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 5, 2024
db0c06d
Merge remote-tracking branch 'origin/main' into nathan-add-doc
NathanHB Sep 6, 2024
57b0cd4
commit
NathanHB Sep 9, 2024
7e4d56d
commit
NathanHB Sep 11, 2024
e533074
commit
NathanHB Sep 11, 2024
2f1c7f5
Update docs/source/installation.md
NathanHB Sep 17, 2024
0d1da5d
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
7a8782a
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
1c7454b
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
2539035
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
b5f2942
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
9825950
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
fa67cf0
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f17ce92
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f3c319d
Update docs/source/adding_new_metric.md
NathanHB Sep 18, 2024
bcd6f50
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
33c1e7f
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
016cea4
fix
NathanHB Sep 18, 2024
e86912a
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 18, 2024
3aba2a1
fix
NathanHB Sep 18, 2024
af1ad13
commit
NathanHB Sep 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/build_main_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Build documentation

on:
push:
branches:
- main
- doc-builder*
- v*-release
- v*-alpha

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: lighteval
custom_container: huggingface/transformers-doc-builder
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
17 changes: 17 additions & 0 deletions .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Build PR Documentation

on:
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: lighteval
custom_container: huggingface/transformers-doc-builder
Empty file.
16 changes: 16 additions & 0 deletions .github/workflows/upload_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Upload PR Documentation

on:
workflow_run:
workflows: ["Build PR Documentation"]
types:
- completed

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: lighteval
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
28 changes: 28 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
- local: index
title: 🌤️ Lighteval
- title: "Getting Started"
sections:
- local: installation
title: Installation
- local: quicktour
title: Quicktour
- title: "Guides"
sections:
- local: saving_results
title: Saving Results
- local: use_python_api
title: Use The Python API
- local: adding_new_task
title: Adding a Custom Task
- local: adding_new_metric
title: Adding a Custom Metric
- local: use_vllm
title: Using VLLM as backend
- local: use_tgi
title: Evaluate on Server
- title: "API Reference"
sections:
- local: metric_list
title: Available Metrics
- local: tasks
title: Available Tasks
87 changes: 87 additions & 0 deletions docs/source/adding_new_metric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Adding a New Metric

First, check if you can use one of the parametrized functions in
[src.lighteval.metrics.metrics_corpus]() or
[src.lighteval.metrics.metrics_sample]().

If not, you can use the `custom_task` system to register your new metric:

<Tip>
To see an example of a custom metric added along with a custom task, look at
<a href="">the IFEval custom task</a>.
</Tip>

- Create a new Python file which should contain the full logic of your metric.
- The file also needs to start with these imports

```python
from aenum import extend_enum
from lighteval.metrics import Metrics
```

You need to define a sample level metric:

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
response = predictions[0]
return response == formatted_doc.choices[formatted_doc.gold_index]
```

Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
response = predictions[0]
return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
```

Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`.

```python
def agg_function(items):
flat_items = [item for sublist in items for item in sublist]
score = sum(flat_items) / len(flat_items)
return score
```

Finally, you can define your metric. If it's a sample level metric, you can use the following code:

```python
my_custom_metric = SampleLevelMetric(
metric_name={custom_metric_name},
higher_is_better={either True or False},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn=agg_function,
)
```

If your metric defines multiple metrics per sample, you can use the following code:

```python
custom_metric = SampleLevelMetricGrouping(
metric_name={submetric_names},
higher_is_better={n: {True or False} for n in submetric_names},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn={
"accuracy": np.mean,
"other_metric": agg_function,
},
)
```

To finish, add the following, so that it adds your metric to our metrics list
when loaded as a module.

```python
# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
print("Imported metric")
```

You can then give your custom metric to lighteval by using `--custom-tasks
path_to_your_file` when launching it.
194 changes: 194 additions & 0 deletions docs/source/adding_new_task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Adding a Custom Task

To add a new task, first either open an issue, to determine whether it will be
integrated in the core evaluations of lighteval, in the extended tasks, or the
community tasks, and add its dataset on the hub.

- Core evaluations are evaluations that only require standard logic in their
metrics and processing, and that we will add to our test suite to ensure non
regression through time. They already see high usage in the community.
- Extended evaluations are evaluations that require custom logic in their
metrics (complex normalisation, an LLM as a judge, ...), that we added to
facilitate the life of users. They already see high usage in the community.
- Community evaluations are submissions by the community of new tasks.

A popular community evaluation can move to become an extended or core evaluation over time.

<Tip>
You can find examples of custom tasks in the <a
href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a>
directory.
</Tip>

## Step by step creation of a custom task

First, create a python file under the `community_tasks` directory.

You need to define a prompt function that will convert a line from your
dataset to a document to be used for evaluation.

```python
# Define as many as you need for your different tasks
def prompt_fn(line, task_name: str = None):
"""Defines how to go from a dataset line to a doc object.
Follow examples in src/lighteval/tasks/default_prompts.py, or get more info
about what this function should do in the README.
"""
return Doc(
task_name=task_name,
query=line["question"],
choices=[f" {c}" for c in line["choices"]],
gold_index=line["gold"],
instruction="",
)
```

Then, you need to choose a metric, you can either use an existing one (defined
in `lighteval/metrics/metrics.py`) or [create a custom one](./adding_new_metric).

```python
custom_metric = SampleLevelMetric(
metric_name="my_custom_metric_name",
higher_is_better=True,
category=MetricCategory.IGNORED,
use_case=MetricUseCase.NONE,
sample_level_fn=lambda x: x, # how to compute score for one sample
corpus_level_fn=np.mean, # How to aggreagte the samples metrics
)
```

Then, you need to define your task. You can define a task with or without subsets.
To define a task with no subsets:

```python
# This is how you create a simple task (like hellaswag) which has one single subset
# attached to it, and one evaluation possible.
task = LightevalTaskConfig(
name="myothertask",
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
suite=["community"],
hf_repo="",
hf_subset="default",
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
metric=[], # select your metric in Metrics
)
```

If you want to create a task with multiple subset, add them to the
`SAMPLE_SUBSETS` list and create a task for each subset.

```python
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval


class CustomSubsetTask(LightevalTaskConfig):
def __init__(
self,
name,
hf_subset,
):
super().__init__(
name=name,
hf_subset=hf_subset,
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
hf_repo="",
metric=[custom_metric], # select your metric in Metrics or use your custom_metric
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
suite=["community"],
generation_size=-1,
stop_sequence=None,
output_regex=None,
frozen=False,
)
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
```

Here is a list of the parameters and their meaning:

- `name` (str), your evaluation name
- `suite` (list), the suite(s) to which your evaluation should belong. This
field allows us to compare different task implementations and is used as a
task selection to differentiate the versions to launch. At the moment, you'll
find the keywords ["helm", "bigbench", "original", "lighteval", "community",
"custom"]; for core evals, please choose `lighteval`.
- `prompt_function` (Callable), the prompt function you defined in the step
above
- `hf_repo` (str), the path to your evaluation dataset on the hub
- `hf_subset` (str), the specific subset you want to use for your evaluation
(note: when the dataset has no subset, fill this field with `"default"`, not
with `None` or `""`)
- `hf_avail_splits` (list), all the splits available for your dataset (train,
valid or validation, test, other...)
- `evaluation_splits` (list), the splits you want to use for evaluation
- `few_shots_split` (str, can be `null`), the specific split from which you
want to select samples for your few-shot examples. It should be different
from the sets included in `evaluation_splits`
- `few_shots_select` (str, can be `null`), the method that you will use to
select items for your few-shot examples. Can be `null`, or one of:
- `balanced` select examples from the `few_shots_split` with balanced
labels, to avoid skewing the few shot examples (hence the model
generations) toward one specific label
- `random` selects examples at random from the `few_shots_split`
- `random_sampling` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is removed from the available samples
- `random_sampling_from_train` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is kept! Only use this if you know what you are
doing.
- `sequential` selects the first `n` examples of the `few_shots_split`
- `generation_size` (int), the maximum number of tokens allowed for a
generative evaluation. If your evaluation is a log likelihood evaluation
(multi-choice), this value should be -1
- `stop_sequence` (list), a list of strings acting as end of sentence tokens
for your generation
- `metric` (list), the metrics you want to use for your evaluation (see next
section for a detailed explanation)
- `output_regex` (str), A regex string that will be used to filter your
generation. (Generative metrics will only select tokens that are between the
first and the second sequence matched by the regex. For example, for a regex
matching `\n` and a generation `\nModel generation output\nSome other text`
the metric will only be fed with `Model generation output`)
- `frozen` (bool), for now, is set to False, but we will steadily pass all
stable tasks to True.
- `trust_dataset` (bool), set to True if you trust the dataset.


Then you need to add your task to the `TASKS_TABLE` list.

```python
# STORE YOUR EVALS

# tasks with subset:
TASKS_TABLE = SUBSET_TASKS

# tasks without subset:
# TASKS_TABLE = [task]
```

Finally, you need to add a module logic to convert your task to a dict for lighteval.

```python
# MODULE LOGIC
# You should not need to touch this
# Convert to dict for lighteval
if __name__ == "__main__":
print(t.name for t in TASKS_TABLE)
print(len(TASKS_TABLE))
```

Once your file is created you can then run the evaluation with the following command:

```bash
lighteval accelerate \
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
--tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
--custom_tasks {path_to_your_custom_task_file} \
--output_dir "./evals"
```
13 changes: 13 additions & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# 🌤️ Lighteval

A lightweight framework for LLM evaluation

LightEval is a lightweight LLM evaluation suite that Hugging Face has been
using internally with the recently released LLM data processing library
datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Even though it has been used in a variety of projects, keep in mind that parts
of lighteval are still unstable and might break ! In case of any problem or
question, feel free to open an issue.
Loading