Method comparison evaluation suite #2395

githubnemo · 2025-02-24T15:41:40Z

Introduction of a method evaluation suite.

We generally face the problem that there is little knowledge on what PEFT methods perform best. To this end we decided to build an evaluation suite that has defined tasks, shared hyper-parameters and can be extended with new tasks and new method configurations over time.

For the sake of comparison we've not decided to incorporate user-submitted results but we encourage users to inspect the results, suggest new experiments and improve the configuration of methods if they're deemed unfavorable.

As of now there's only one task based on the MetaMathQA dataset which has the benefit of being complex while still fitting on a consumer GPU.

HuggingFaceDocBuilderDev · 2025-02-24T15:45:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Still WIP: - no sacred or other experiment tracking framework - only toy task

- Error with data processing, now learns properly - Use tqdm - Improve console logging - Adjust data filtering criteria - Add option for cosine lr scheduler

Off by default as it gives OOM without any speed benefit.

Also, model never predicts EOS.

The experiment specific training params use the default training params but can override any parameter from it if needed. However, this way it's easier to make a change to all experiments (say, I want to change the base model, I don't need to change each individual training_parameters.json).

However, both flash attention 2 and flex attention are slower on my system. Thus, stay with default None (-> SDPA).

Allows to more easily use, say, static cache, which is the new default, as it's faster (apart from the first pass)

Type annotations, a training param, comments

No longer log the sha256sum of them

Also, print number of (trainable) parameters.

- Clarify dataset split for the MetaMathQA task - Add section on contributing

E.g. 1/2 == 0.5

But add --clean to delete it. Keeping the adapter can be useful if the user wants to run further tests with the trained model.

Otherwise, if one of these packages are missing, we could get an error after the model training and lose all progress.

Adding run.py

…t into feature/method-comparison

Improve Pareto plot

It's true by default, same as in PEFT in general.

Add more experiments

…adapter-dtype-option Add option to change autocast_adapter_dtype

BenjaminBossan

LGTM.

Basic structure

f0686f9

BenjaminBossan added 28 commits February 24, 2025 18:08

Empty run.py

d5b19a7

First implementation of training script

d372804

Still WIP: - no sacred or other experiment tracking framework - only toy task

Use MetaMathQA, update scripts

d998b35

Multiple fixes and improvements

0558922

- Error with data processing, now learns properly - Use tqdm - Improve console logging - Adjust data filtering criteria - Add option for cosine lr scheduler

Update docs, .gitignore, requirements

4d06b1b

Add option to use automatic mixed precision

7a75185

Off by default as it gives OOM without any speed benefit.

Move all logging stuff to utils.py

2a45b31

Some fixes suggested by mypy

07c3f0c

Some formatting

5813728

Improve docs

b49ddb6

Include one EOS token in loss calculation

278fee5

Also, model never predicts EOS.

Fix to progress bar

a084516

Log tokens / sec

a72aa94

Add possibility to change attn implementation

ec57603

However, both flash attention 2 and flex attention are slower on my system. Thus, stay with default None (-> SDPA).

Fix bugs when there is no training_params.json

d1979d3

Refactor to use GenerationConfig

c57984a

Allows to more easily use, say, static cache, which is the new default, as it's faster (apart from the first pass)

Support for separate eval batch size

405490e

Small fixes, slightly lower char limit

03be25e

Support running in offline mode

060a571

Also log reserved memory

437abbe

Small fixes

f587fdd

Type annotations, a training param, comments

Log whole train and peft config

a447a4e

No longer log the sha256sum of them

Update README to reflect latest changes in results

38e2474

Also, print number of (trainable) parameters.

Update READMEs

e9db2cf

- Clarify dataset split for the MetaMathQA task - Add section on contributing

Small adjustments to support AdaLoRA

1577097

Slight refactoring of data loading functions

85279b6

Change valid/test set to use GSM8K

8ec8f90

BenjaminBossan and others added 24 commits March 10, 2025 17:48

Better parsing of answers

517b409

E.g. 1/2 == 0.5

Small updates to README

2c6a304

Add a couple of experiment configs

8ce7604

Add more stuff to logging

027b921

Improve eval speed by setting dynamic max length

7242da2

Add small Gradio demo

6f25f9d

Update to app requirements

e23f57f

Small update to processing and app, prettier

398c3a7

Add option to filter the df

d79f210

Introduce BucketIterator for faster training

a4f5d42

Update ideas for interesting tasks

968326c

Keep adapter file by default after train run

6994149

But add --clean to delete it. Keeping the adapter can be useful if the user wants to run further tests with the trained model.

Move imports from local to global

77d3da4

Otherwise, if one of these packages are missing, we could get an error after the model training and lose all progress.

Small fixes to AMP

85541ba

Small fix for bucket type

97aa86d

Merge pull request #1 from BenjaminBossan/ben-method-comparison

b5b0eb2

Adding run.py

Merge branch 'main' into feature/method-comparison

24506a6

Merge branch 'feature/method-comparison' of github.com:githubnemo/pef…

3b52cb6

…t into feature/method-comparison

Improve Pareto plot

82feb92

Merge pull request #2 from BenjaminBossan/improve-pareto-plot

6972e02

Improve Pareto plot

Add option to change autocast_adapter_dtype

a126d82

It's true by default, same as in PEFT in general.

Add more experiments

c6a065f

Merge pull request #4 from BenjaminBossan/add-more-experiments

a1ed9b3

Add more experiments

Merge pull request #3 from BenjaminBossan/method-comparison-autocast-…

35bd142

…adapter-dtype-option Add option to change autocast_adapter_dtype

githubnemo marked this pull request as ready for review March 26, 2025 14:31

githubnemo changed the title ~~Draft: Method comparison evaluation suite~~ Method comparison evaluation suite Mar 26, 2025

BenjaminBossan approved these changes Mar 27, 2025

View reviewed changes

githubnemo merged commit 4192101 into huggingface:main Mar 27, 2025
14 checks passed

BenjaminBossan mentioned this pull request Mar 27, 2025

Comparison of Different Fine-Tuning Techniques for Conversational AI #2310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method comparison evaluation suite #2395

Method comparison evaluation suite #2395

githubnemo commented Feb 24, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 24, 2025

BenjaminBossan left a comment

Method comparison evaluation suite #2395

Method comparison evaluation suite #2395

Conversation

githubnemo commented Feb 24, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Feb 24, 2025

BenjaminBossan left a comment

Choose a reason for hiding this comment

githubnemo commented Feb 24, 2025 •

edited

Loading