-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make StaticCache configurable at model construct time #32830
Make StaticCache configurable at model construct time #32830
Conversation
b4034c2
to
57aecb3
Compare
Is the test failure relevant?
I don't get what this test does and what the error means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Allowing configuration of the staticCache
is most welcome! the other changes IMO do not belong in transformers and are way to heavy, while in optimum
we could simply wrap arround transformers models as it's something we are more likely to push than change each modeling file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we've discussed in other places, the cache config should live inside GenerationConfig
, not in Config
🤗 There should be no changes in the model architecture files (e.g. modeling_gemma.py
), otherwise we expose ourselves to a very lengthy feature propagation cycle -- we would have to edit hundreds of files.
In a nutshell, to enable this feature:
generation_config
can hold and serialize the desired cache configPreTrainedModel.from_pretrained
should acceptgeneration_config
of typeGenerationConfig
, and overwrite the defaultgeneration_config
generate
uses that field when it is set
I have no preference where the I explained why some modeling code change is necessary in order to support |
It makes sense to have |
Totally understand the motivations!
|
@guangy10 my apologies for the acute review above -- the PR already had a positive review, I wanted to make sure it didn't go in at least without exploring alternatives 🤗 As Arthur wrote, let's try to explore options that do not require modeling changes. We've learned from recent experience that seemingly harmless changes can quickly escalate into difficult issues -- for us and for our users |
57aecb3
to
1e61048
Compare
Thank you for reviewing this PR. All feedback is appreciated! |
@ArthurZucker @gante This PR is updated according to comments about generation config plus what we've discussed in the meeting last week:
I was planning to split this PR into two parts then I realized it's easier to keep it in one so that you can see the full picture about how the mode is configured and exported. |
cc: @amyeroberts @qubvel for additional eyes on reviewing. |
1e61048
to
805c61d
Compare
Okay, all CIs are green 🚀 🚀 |
805c61d
to
b93152e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Would be nice to have a real world example
Yeah, there is an e2e demo with PR pytorch/executorch#4723 in ExecuTorch repo. You can see After this PR is merged, I can start adding an integration test for gemma2b utilizing the exported program (not lowered to ExecuTorch since no ExecuTorch integration in transformers repo) to generate sequences |
b93152e
to
7d4092b
Compare
The examples_tensorflow failure doesn't seem to be relevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Just gave an overall review: @ArthurZucker and @gante are the cache kings, so I'll leave it to them to cover anything on the cache design patterns. Overall looks good to me - just a few small comments and nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating, looks mostly good to me 💛
I've added a few comments to make the PR fit with a few transfomers
nuances, like our import structure or documentation
haha I see @amyeroberts was reviewing this at the same time as I was, there are a few things we both suggested/mentioned (@amyeroberts's comments take precedence, as she's in charge of overall library maintenance) |
3ad40ba
to
ad17e42
Compare
ad17e42
to
a573552
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few suggestions to fix the import structure according to our expectations, otherwise LGTM 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
+1 on @gante's comment's about handling the imports such that most of the code isn't indented under is_torch_available
, and a small nit on the docs.
Otherwise all looks good to me! ❤️
a573552
to
03ac195
Compare
5199230
to
5ae5243
Compare
@@ -0,0 +1,32 @@ | |||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the following here
"Copyright (c) Meta Platforms, Inc. and affiliates."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should "HuggingFace Team" be here since you technically haven't contributed to this file yet.
self.skipTest(reason="This test requires torch >= 2.3 to run.") | ||
|
||
set_seed(0) | ||
device = "cpu" | ||
dtype = torch.float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add copyright here in this file?
What's your general guideline of having a file with mixed credits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one with mixed credits (
# Copyright 2024 The ggml.ai team and The HuggingFace Inc. team. and pygguf author (github.com/99991) |
fa07acb
to
f213da1
Compare
0700b51
to
0ef678d
Compare
Just updated the copyright headers and rebased to latest main to resolve conflicts. I start seeing new failures on this PR. @amyeroberts @gante Any idea why, or it's irrelevant and safe to merge? |
Hi @guangy10 👋 I wasn't sure what was missing at first, so I went into the code, found the issue, and pushed the fix (tl;dr it was missing the lazy import structure for EDIT: it was also missing an entry for the new doc page in our doc's toc |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@gante Thanks! If no more comments, can we merge this PR? |
@guangy10 done! Thank you for iterating with us 🤗 |
Summary: bypass-github-export-checks [Done] ~~Require PR [Make StaticCache configurable at model construct time](huggingface/transformers#32830) in order to export, lower and run the 🤗 model OOTB.~~ [Done] ~~Require huggingface/transformers#33303 or huggingface/transformers#33287 to be merged to 🤗 `transformers` to resolve the export issue introduced by huggingface/transformers#32543 ----------- Now we can take the integration point from 🤗 `transformers` to lower compatible models to ExecuTorch OOTB. - This PR creates a simple script with recipe of XNNPACK. - This PR also created a secret `EXECUTORCH_HT_TOKEN` to allow download checkpoints in the CI - This PR connects the 🤗 "Export to ExecuTorch" e2e workflow to ExecuTorch CI ### Instructions to run the demo: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m extension.export_util.export_hf_model -hfm "google/gemma-2b" # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m extension.llm.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o tokenizer.bin ``` 3. Build llm runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) 4. Run the lowered model ``` cmake-out/examples/models/llama2/llama_main --model_path=gemma.pte --tokenizer_path=tokenizer.bin --prompt="My name is" ``` OOTB output and perf ``` I 00:00:00.003110 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003360 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003380 executorch:cpuinfo_utils.cpp:158] Number of efficient cores 4 I 00:00:00.003384 executorch:main.cpp:65] Resetting threadpool with num threads = 6 I 00:00:00.014716 executorch:runner.cpp:51] Creating LLaMa runner: model_path=gemma.pte, tokenizer_path=tokenizer_gemma.bin I 00:00:03.065359 executorch:runner.cpp:66] Reading metadata from model I 00:00:03.065391 executorch:metadata_util.h:43] get_n_bos: 1 I 00:00:03.065396 executorch:metadata_util.h:43] get_n_eos: 1 I 00:00:03.065399 executorch:metadata_util.h:43] get_max_seq_len: 123 I 00:00:03.065402 executorch:metadata_util.h:43] use_kv_cache: 1 I 00:00:03.065404 executorch:metadata_util.h:41] The model does not contain use_sdpa_with_kv_cache method, using default value 0 I 00:00:03.065405 executorch:metadata_util.h:43] use_sdpa_with_kv_cache: 0 I 00:00:03.065407 executorch:metadata_util.h:41] The model does not contain append_eos_to_prompt method, using default value 0 I 00:00:03.065409 executorch:metadata_util.h:43] append_eos_to_prompt: 0 I 00:00:03.065411 executorch:metadata_util.h:41] The model does not contain enable_dynamic_shape method, using default value 0 I 00:00:03.065412 executorch:metadata_util.h:43] enable_dynamic_shape: 0 I 00:00:03.130388 executorch:metadata_util.h:43] get_vocab_size: 256000 I 00:00:03.130405 executorch:metadata_util.h:43] get_bos_id: 2 I 00:00:03.130408 executorch:metadata_util.h:43] get_eos_id: 1 My name is Melle. I am a 20 year old girl from Belgium. I am living in the southern part of Belgium. I am 165 cm tall and I weigh 45kg. I like to play sports like swimming, running and playing tennis. I am very interested in music and I like to listen to classical music. I like to sing and I can play the piano. I would like to go to the USA because I like to travel a lot. I am looking for a boy from the USA who is between 18 and 25 years old. I PyTorchObserver {"prompt_tokens":4,"generated_tokens":118,"model_load_start_ms":1723685715497,"model_load_end_ms":1723685718612,"inference_start_ms":1723685718612,"inference_end_ms":1723685732965,"prompt_eval_end_ms":1723685719087,"first_token_ms":1723685719087,"aggregate_sampling_time_ms":182,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:17.482472 executorch:stats.h:70] Prompt Tokens: 4 Generated Tokens: 118 I 00:00:17.482475 executorch:stats.h:76] Model Load Time: 3.115000 (seconds) I 00:00:17.482481 executorch:stats.h:86] Total inference time: 14.353000 (seconds) Rate: 8.221278 (tokens/second) I 00:00:17.482483 executorch:stats.h:94] Prompt evaluation: 0.475000 (seconds) Rate: 8.421053 (tokens/second) I 00:00:17.482485 executorch:stats.h:105] Generated 118 tokens: 13.878000 (seconds) Rate: 8.502666 (tokens/second) I 00:00:17.482486 executorch:stats.h:113] Time to first generated token: 0.475000 (seconds) I 00:00:17.482488 executorch:stats.h:120] Sampling time over 122 tokens: 0.182000 (seconds) ``` Pull Request resolved: #4723 Reviewed By: huydhn, kirklandsign Differential Revision: D62543933 Pulled By: guangy10 fbshipit-source-id: 00401a39ba03d7383e4b284d25c8fc62a6695b34
) * Make StaticCache configurable at model construct time * integrations import structure * add new doc file to toc --------- Co-authored-by: Guang Yang <[email protected]> Co-authored-by: Joao Gante <[email protected]>
) * Make StaticCache configurable at model construct time * integrations import structure * add new doc file to toc --------- Co-authored-by: Guang Yang <[email protected]> Co-authored-by: Joao Gante <[email protected]>
What does this PR do?
This PR is to address #32500 for "Export to ExecuTorch"
StaticCache
ExecuTorch
attransformers/integrations/executorch.py
and hosts the wrapper module class and utilconvert_and_export_with_cache
thereA real world example/demo:
The test model
gemma-2b
is naturally exportable viaconvert_and_export_with_cache
.The test model
gemma-2b
is also lowerable and runnable viaExecuTorch
w/ 15 tokens/s onXNNPACK
backend. Checkout pytorch/executorch#4723 for details inExecuTorch
repo.Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker
@amyeroberts
@gante