-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method comparison evaluation suite #2395
Merged
githubnemo
merged 53 commits into
huggingface:main
from
githubnemo:feature/method-comparison
Mar 27, 2025
Merged
Method comparison evaluation suite #2395
githubnemo
merged 53 commits into
huggingface:main
from
githubnemo:feature/method-comparison
Mar 27, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Still WIP: - no sacred or other experiment tracking framework - only toy task
- Error with data processing, now learns properly - Use tqdm - Improve console logging - Adjust data filtering criteria - Add option for cosine lr scheduler
Off by default as it gives OOM without any speed benefit.
Also, model never predicts EOS.
The experiment specific training params use the default training params but can override any parameter from it if needed. However, this way it's easier to make a change to all experiments (say, I want to change the base model, I don't need to change each individual training_parameters.json).
However, both flash attention 2 and flex attention are slower on my system. Thus, stay with default None (-> SDPA).
Allows to more easily use, say, static cache, which is the new default, as it's faster (apart from the first pass)
Type annotations, a training param, comments
No longer log the sha256sum of them
Also, print number of (trainable) parameters.
- Clarify dataset split for the MetaMathQA task - Add section on contributing
E.g. 1/2 == 0.5
But add --clean to delete it. Keeping the adapter can be useful if the user wants to run further tests with the trained model.
Otherwise, if one of these packages are missing, we could get an error after the model training and lose all progress.
Adding run.py
…t into feature/method-comparison
Improve Pareto plot
It's true by default, same as in PEFT in general.
Add more experiments
…adapter-dtype-option Add option to change autocast_adapter_dtype
BenjaminBossan
approved these changes
Mar 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction of a method evaluation suite.
We generally face the problem that there is little knowledge on what PEFT methods perform best. To this end we decided to build an evaluation suite that has defined tasks, shared hyper-parameters and can be extended with new tasks and new method configurations over time.
For the sake of comparison we've not decided to incorporate user-submitted results but we encourage users to inspect the results, suggest new experiments and improve the configuration of methods if they're deemed unfavorable.
As of now there's only one task based on the MetaMathQA dataset which has the benefit of being complex while still fitting on a consumer GPU.