Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method comparison evaluation suite #2395

Merged
merged 53 commits into from
Mar 27, 2025

Conversation

githubnemo
Copy link
Collaborator

@githubnemo githubnemo commented Feb 24, 2025

Introduction of a method evaluation suite.

We generally face the problem that there is little knowledge on what PEFT methods perform best. To this end we decided to build an evaluation suite that has defined tasks, shared hyper-parameters and can be extended with new tasks and new method configurations over time.

For the sake of comparison we've not decided to incorporate user-submitted results but we encourage users to inspect the results, suggest new experiments and improve the configuration of methods if they're deemed unfavorable.

As of now there's only one task based on the MetaMathQA dataset which has the benefit of being complex while still fitting on a consumer GPU.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Still WIP:

- no sacred or other experiment tracking framework
- only toy task
- Error with data processing, now learns properly
- Use tqdm
- Improve console logging
- Adjust data filtering criteria
- Add option for cosine lr scheduler
Off by default as it gives OOM without any speed benefit.
Also, model never predicts EOS.
The experiment specific training params use the default training params
but can override any parameter from it if needed. However, this way it's
easier to make a change to all experiments (say, I want to change the
base model, I don't need to change each individual
training_parameters.json).
However, both flash attention 2 and flex attention are slower on my
system. Thus, stay with default None (-> SDPA).
Allows to more easily use, say, static cache, which is the new default,
as it's faster (apart from the first pass)
Type annotations, a training param, comments
No longer log the sha256sum of them
Also, print number of (trainable) parameters.
- Clarify dataset split for the MetaMathQA task
- Add section on contributing
BenjaminBossan and others added 24 commits March 10, 2025 17:48
E.g. 1/2 == 0.5
But add --clean to delete it.

Keeping the adapter can be useful if the user wants to run further tests
with the trained model.
Otherwise, if one of these packages are missing, we could get an error
after the model training and lose all progress.
It's true by default, same as in PEFT in general.
…adapter-dtype-option

Add option to change autocast_adapter_dtype
@githubnemo githubnemo marked this pull request as ready for review March 26, 2025 14:31
@githubnemo githubnemo changed the title Draft: Method comparison evaluation suite Method comparison evaluation suite Mar 26, 2025
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@githubnemo githubnemo merged commit 4192101 into huggingface:main Mar 27, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants