Make StaticCache configurable at model construct time #32830

guangy10 · 2024-08-15T01:18:22Z

What does this PR do?

This PR is to address #32500 for "Export to ExecuTorch"

Enable the ability to load a model with options to statically config the model using StaticCache
Create a new integration point for ExecuTorch at transformers/integrations/executorch.py and hosts the wrapper module class and util convert_and_export_with_cache there
Expose the module so that it can be used in other repos
Improve test

from transformers import convert_and_export_with_cache

model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    attn_implementation="sdpa",
    generation_config=GenerationConfig(
        use_cache=True,
        cache_implementation=cache_implementation,
        max_length=max_cache_len,
        cache_config={
            "batch_size": batch_size,
            "max_cache_len": max_cache_len,
        },
    ),
)

exported_prog = convert_and_export_with_cache(model)

# Further lower the exported program to ExecuTorch with delegates

A real world example/demo:
The test model gemma-2b is naturally exportable via convert_and_export_with_cache.
The test model gemma-2b is also lowerable and runnable via ExecuTorch w/ 15 tokens/s on XNNPACK backend. Checkout pytorch/executorch#4723 for details in ExecuTorch repo.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Make Cache statically configurable at model construction time #32500
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker
@amyeroberts
@gante

guangy10 · 2024-08-15T01:55:53Z

Is the test failure relevant?

FAILED tests/generation/test_beam_search.py::ConstrainedBeamSearchTest::test_constrained_beam_scorer_finalize - 
ValueError: Each list in `nested_token_ids` can't be a complete subset of another list, but is [[19, 40], [19, 40]].

I don't get what this test does and what the error means

ArthurZucker

Hey! Allowing configuration of the staticCache is most welcome! the other changes IMO do not belong in transformers and are way to heavy, while in optimum we could simply wrap arround transformers models as it's something we are more likely to push than change each modeling file

src/transformers/cache_utils.py

src/transformers/models/gemma/modeling_gemma.py

gante

As we've discussed in other places, the cache config should live inside GenerationConfig, not in Config 🤗 There should be no changes in the model architecture files (e.g. modeling_gemma.py), otherwise we expose ourselves to a very lengthy feature propagation cycle -- we would have to edit hundreds of files.

In a nutshell, to enable this feature:

generation_config can hold and serialize the desired cache config
PreTrainedModel.from_pretrained should accept generation_config of type GenerationConfig, and overwrite the default generation_config
generate uses that field when it is set

guangy10 · 2024-08-19T20:32:59Z

As we've discussed in other places, the cache config should live inside GenerationConfig, not in Config 🤗 There should be no changes in the model architecture files (e.g. modeling_gemma.py), otherwise we expose ourselves to a very lengthy feature propagation cycle -- we would have to edit hundreds of files.

In a nutshell, to enable this feature:

generation_config can hold and serialize the desired cache config

PreTrainedModel.from_pretrained should accept generation_config of type GenerationConfig, and overwrite the default generation_config

generate uses that field when it is set

I have no preference where the cache_config lives, so I can move it to GenerationConfig per your feedback.

I explained why some modeling code change is necessary in order to support torch.export() abave here, and the while work should be similar to support torch.compile. Because torch.export is more restricted due to its boarder use-cases in a non-python env, there are limitations to the graph input/outputs. The idea is that if we could make such changes in the fundamental models like llama, bert, clip, etc. many other models based on those could leverage the work. @amyeroberts could share more contexts just in case.

guangy10 · 2024-08-19T20:49:56Z

Hey! Allowing configuration of the staticCache is most welcome! the other changes IMO do not belong in transformers and are way to heavy, while in optimum we could simply wrap arround transformers models as it's something we are more likely to push than change each modeling file

It makes sense to have optimum being the place where the actual lowering to on-device happens. This would align it well with other on-device workflows like onnx and tflite. Here in this PR what I'm trying to enable is purely for torch.export, similarly to the work for torch.compile. One difference between torch.compile and torch.export is that export is the entry point for the PyTorch 2.x based on-device workflow ie ExecuTorch. So getting a transformer model compatible with torch.export in a way that can be further lowered to ExecuTorch later via optimum is what I'm trying to accomplish in the transformers repo.

ArthurZucker · 2024-08-20T12:46:50Z

Totally understand the motivations!
For now the changes are too "drastic" and constraining for general use-cases.
The changes done for torch compile were actually also benefic for non compile case and fairly "minimal".

[BC 4.37 -> 4.38] for Llama family, memory and speed #29753 where we had a LOT of issues with registering a causal mask

gante · 2024-08-23T14:34:34Z

@guangy10 my apologies for the acute review above -- the PR already had a positive review, I wanted to make sure it didn't go in at least without exploring alternatives 🤗

As Arthur wrote, let's try to explore options that do not require modeling changes. We've learned from recent experience that seemingly harmless changes can quickly escalate into difficult issues -- for us and for our users

guangy10 · 2024-08-28T04:12:48Z

@guangy10 my apologies for the acute review above -- the PR already had a positive review, I wanted to make sure it didn't go in at least without exploring alternatives 🤗

As Arthur wrote, let's try to explore options that do not require modeling changes. We've learned from recent experience that seemingly harmless changes can quickly escalate into difficult issues -- for us and for our users

Thank you for reviewing this PR. All feedback is appreciated!

guangy10 · 2024-08-28T04:15:58Z

@ArthurZucker @gante This PR is updated according to comments about generation config plus what we've discussed in the meeting last week:

Passed generation_config via from_pretrained and override the default generation_config.
Implement the preferred option we discussed last week. That is, continue using the forward() adapter and define it in transformers where lives closest to the source of truth for all transformers models and use in all places where "Export to ExecuTorch" is suitable.

I was planning to split this PR into two parts then I realized it's easier to keep it in one so that you can see the full picture about how the mode is configured and exported.

guangy10 · 2024-08-28T04:30:32Z

cc: @amyeroberts @qubvel for additional eyes on reviewing.

guangy10 · 2024-08-28T20:31:33Z

Okay, all CIs are green 🚀 🚀

src/transformers/integrations/executorch.py

ArthurZucker

Looks good! Would be nice to have a real world example

src/transformers/integrations/executorch.py

tests/utils/test_cache_utils.py

guangy10 · 2024-09-03T20:52:55Z

Looks good! Would be nice to have a real world example

Yeah, there is an e2e demo with PR pytorch/executorch#4723 in ExecuTorch repo. You can see gemma2b is working out-of-the-box with ~15 tokens/s using convert_and_export from huggingface/transformers/integrations/executorch.py we added in this PR .

After this PR is merged, I can start adding an integration test for gemma2b utilizing the exported program (not lowered to ExecuTorch since no ExecuTorch integration in transformers repo) to generate sequences

guangy10 · 2024-09-03T21:56:51Z

The examples_tensorflow failure doesn't seem to be relevant

amyeroberts

Thanks for adding this!

Just gave an overall review: @ArthurZucker and @gante are the cache kings, so I'll leave it to them to cover anything on the cache design patterns. Overall looks good to me - just a few small comments and nits

src/transformers/integrations/executorch.py

src/transformers/modeling_utils.py

src/transformers/integrations/executorch.py

gante

Thank you for iterating, looks mostly good to me 💛

I've added a few comments to make the PR fit with a few transfomers nuances, like our import structure or documentation

src/transformers/cache_utils.py

src/transformers/integrations/executorch.py

src/transformers/modeling_utils.py

gante · 2024-09-04T17:15:57Z

haha I see @amyeroberts was reviewing this at the same time as I was, there are a few things we both suggested/mentioned

(@amyeroberts's comments take precedence, as she's in charge of overall library maintenance)

gante

Added a few suggestions to fix the import structure according to our expectations, otherwise LGTM 🤗

src/transformers/__init__.py

src/transformers/integrations/executorch.py

src/transformers/__init__.py

docs/source/en/internal/modeling_utils.md

amyeroberts

Thanks for iterating on this!

+1 on @gante's comment's about handling the imports such that most of the code isn't indented under is_torch_available, and a small nit on the docs.

Otherwise all looks good to me! ❤️

docs/source/en/main_classes/executorch.md

mergennachin · 2024-09-06T14:53:02Z

docs/source/en/main_classes/executorch.md

@@ -0,0 +1,32 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.


@guangy10

We should add the following here

"Copyright (c) Meta Platforms, Inc. and affiliates."

cc @amyeroberts @ArthurZucker

Also should "HuggingFace Team" be here since you technically haven't contributed to this file yet.

src/transformers/integrations/executorch.py

mergennachin · 2024-09-06T14:55:03Z

tests/utils/test_cache_utils.py

            self.skipTest(reason="This test requires torch >= 2.3 to run.")

+        set_seed(0)
        device = "cpu"
        dtype = torch.float32


Add copyright here in this file?

@ArthurZucker , @amyeroberts

What's your general guideline of having a file with mixed credits?

Found one with mixed credits (

transformers/src/transformers/integrations/ggml.py

Line 2 in d7b04ea

# Copyright 2024 The ggml.ai team and The HuggingFace Inc. team. and pygguf author (github.com/99991)

). Not sure if it's the standard way

guangy10 · 2024-09-06T16:52:14Z

Just updated the copyright headers and rebased to latest main to resolve conflicts. I start seeing new failures on this PR. @amyeroberts @gante Any idea why, or it's irrelevant and safe to merge?

gante · 2024-09-07T19:01:28Z

Hi @guangy10 👋

I wasn't sure what was missing at first, so I went into the code, found the issue, and pushed the fix (tl;dr it was missing the lazy import structure for transformers.integrations). I hope you don't mind 🤗

EDIT: it was also missing an entry for the new doc page in our doc's toc

HuggingFaceDocBuilderDev · 2024-09-08T09:34:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

guangy10 · 2024-09-09T17:15:11Z

Hi @guangy10 👋

I wasn't sure what was missing at first, so I went into the code, found the issue, and pushed the fix (tl;dr it was missing the lazy import structure for transformers.integrations). I hope you don't mind 🤗

EDIT: it was also missing an entry for the new doc page in our doc's toc

@gante Thanks! If no more comments, can we merge this PR?

gante · 2024-09-10T15:36:15Z

@guangy10 done! Thank you for iterating with us 🤗

Summary: bypass-github-export-checks [Done] ~~Require PR [Make StaticCache configurable at model construct time](huggingface/transformers#32830) in order to export, lower and run the 🤗 model OOTB.~~ [Done] ~~Require huggingface/transformers#33303 or huggingface/transformers#33287 to be merged to 🤗 `transformers` to resolve the export issue introduced by huggingface/transformers#32543 ----------- Now we can take the integration point from 🤗 `transformers` to lower compatible models to ExecuTorch OOTB. - This PR creates a simple script with recipe of XNNPACK. - This PR also created a secret `EXECUTORCH_HT_TOKEN` to allow download checkpoints in the CI - This PR connects the 🤗 "Export to ExecuTorch" e2e workflow to ExecuTorch CI ### Instructions to run the demo: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m extension.export_util.export_hf_model -hfm "google/gemma-2b" # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m extension.llm.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o tokenizer.bin ``` 3. Build llm runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) 4. Run the lowered model ``` cmake-out/examples/models/llama2/llama_main --model_path=gemma.pte --tokenizer_path=tokenizer.bin --prompt="My name is" ``` OOTB output and perf ``` I 00:00:00.003110 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003360 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003380 executorch:cpuinfo_utils.cpp:158] Number of efficient cores 4 I 00:00:00.003384 executorch:main.cpp:65] Resetting threadpool with num threads = 6 I 00:00:00.014716 executorch:runner.cpp:51] Creating LLaMa runner: model_path=gemma.pte, tokenizer_path=tokenizer_gemma.bin I 00:00:03.065359 executorch:runner.cpp:66] Reading metadata from model I 00:00:03.065391 executorch:metadata_util.h:43] get_n_bos: 1 I 00:00:03.065396 executorch:metadata_util.h:43] get_n_eos: 1 I 00:00:03.065399 executorch:metadata_util.h:43] get_max_seq_len: 123 I 00:00:03.065402 executorch:metadata_util.h:43] use_kv_cache: 1 I 00:00:03.065404 executorch:metadata_util.h:41] The model does not contain use_sdpa_with_kv_cache method, using default value 0 I 00:00:03.065405 executorch:metadata_util.h:43] use_sdpa_with_kv_cache: 0 I 00:00:03.065407 executorch:metadata_util.h:41] The model does not contain append_eos_to_prompt method, using default value 0 I 00:00:03.065409 executorch:metadata_util.h:43] append_eos_to_prompt: 0 I 00:00:03.065411 executorch:metadata_util.h:41] The model does not contain enable_dynamic_shape method, using default value 0 I 00:00:03.065412 executorch:metadata_util.h:43] enable_dynamic_shape: 0 I 00:00:03.130388 executorch:metadata_util.h:43] get_vocab_size: 256000 I 00:00:03.130405 executorch:metadata_util.h:43] get_bos_id: 2 I 00:00:03.130408 executorch:metadata_util.h:43] get_eos_id: 1 My name is Melle. I am a 20 year old girl from Belgium. I am living in the southern part of Belgium. I am 165 cm tall and I weigh 45kg. I like to play sports like swimming, running and playing tennis. I am very interested in music and I like to listen to classical music. I like to sing and I can play the piano. I would like to go to the USA because I like to travel a lot. I am looking for a boy from the USA who is between 18 and 25 years old. I PyTorchObserver {"prompt_tokens":4,"generated_tokens":118,"model_load_start_ms":1723685715497,"model_load_end_ms":1723685718612,"inference_start_ms":1723685718612,"inference_end_ms":1723685732965,"prompt_eval_end_ms":1723685719087,"first_token_ms":1723685719087,"aggregate_sampling_time_ms":182,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:17.482472 executorch:stats.h:70] Prompt Tokens: 4 Generated Tokens: 118 I 00:00:17.482475 executorch:stats.h:76] Model Load Time: 3.115000 (seconds) I 00:00:17.482481 executorch:stats.h:86] Total inference time: 14.353000 (seconds) Rate: 8.221278 (tokens/second) I 00:00:17.482483 executorch:stats.h:94] Prompt evaluation: 0.475000 (seconds) Rate: 8.421053 (tokens/second) I 00:00:17.482485 executorch:stats.h:105] Generated 118 tokens: 13.878000 (seconds) Rate: 8.502666 (tokens/second) I 00:00:17.482486 executorch:stats.h:113] Time to first generated token: 0.475000 (seconds) I 00:00:17.482488 executorch:stats.h:120] Sampling time over 122 tokens: 0.182000 (seconds) ``` Pull Request resolved: #4723 Reviewed By: huydhn, kirklandsign Differential Revision: D62543933 Pulled By: guangy10 fbshipit-source-id: 00401a39ba03d7383e4b284d25c8fc62a6695b34

) * Make StaticCache configurable at model construct time * integrations import structure * add new doc file to toc --------- Co-authored-by: Guang Yang <[email protected]> Co-authored-by: Joao Gante <[email protected]>

guangy10 force-pushed the config_model_with_static_cache branch from b4034c2 to 57aecb3 Compare August 15, 2024 01:27

This was referenced Aug 15, 2024

[Experimental] Script to export 🤗 models pytorch/executorch#4723

Closed

Make Cache statically configurable at model construction time #32500

Closed

ArthurZucker self-requested a review August 16, 2024 08:10

ArthurZucker reviewed Aug 16, 2024

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved

gante requested changes Aug 16, 2024

View reviewed changes

guangy10 force-pushed the config_model_with_static_cache branch from 57aecb3 to 1e61048 Compare August 28, 2024 03:50

guangy10 requested review from gante and ArthurZucker August 28, 2024 17:45

guangy10 force-pushed the config_model_with_static_cache branch from 1e61048 to 805c61d Compare August 28, 2024 19:06

mergennachin reviewed Aug 29, 2024

View reviewed changes

src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved

guangy10 force-pushed the config_model_with_static_cache branch from 805c61d to b93152e Compare August 29, 2024 21:16

ArthurZucker reviewed Sep 2, 2024

View reviewed changes

src/transformers/integrations/executorch.py Show resolved Hide resolved

src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved

tests/utils/test_cache_utils.py Show resolved Hide resolved

tests/utils/test_cache_utils.py Outdated Show resolved Hide resolved

guangy10 force-pushed the config_model_with_static_cache branch from b93152e to 7d4092b Compare September 3, 2024 21:36

guangy10 requested a review from ArthurZucker September 3, 2024 21:59

amyeroberts reviewed Sep 4, 2024

View reviewed changes

amyeroberts added the ExecuTorch label Sep 4, 2024

gante reviewed Sep 4, 2024

View reviewed changes

guangy10 mentioned this pull request Sep 4, 2024

Make executorch module importable guangy10/transformers#1

Closed

2 tasks

guangy10 force-pushed the config_model_with_static_cache branch from 3ad40ba to ad17e42 Compare September 4, 2024 20:58

guangy10 mentioned this pull request Sep 4, 2024

Make executorch module importable #33308

Closed

2 tasks

guangy10 force-pushed the config_model_with_static_cache branch from ad17e42 to a573552 Compare September 4, 2024 23:30

gante approved these changes Sep 5, 2024

View reviewed changes

src/transformers/__init__.py Outdated Show resolved Hide resolved

src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved

src/transformers/__init__.py Outdated Show resolved Hide resolved

amyeroberts reviewed Sep 5, 2024

View reviewed changes

docs/source/en/internal/modeling_utils.md Outdated Show resolved Hide resolved

amyeroberts approved these changes Sep 5, 2024

View reviewed changes

guangy10 force-pushed the config_model_with_static_cache branch from a573552 to 03ac195 Compare September 6, 2024 00:36

guangy10 requested a review from mergennachin September 6, 2024 00:43

guangy10 force-pushed the config_model_with_static_cache branch 3 times, most recently from 5199230 to 5ae5243 Compare September 6, 2024 01:06

mergennachin reviewed Sep 6, 2024

View reviewed changes

guangy10 force-pushed the config_model_with_static_cache branch 2 times, most recently from fa07acb to f213da1 Compare September 6, 2024 16:40

Make StaticCache configurable at model construct time

0ef678d

guangy10 force-pushed the config_model_with_static_cache branch from 0700b51 to 0ef678d Compare September 6, 2024 16:51

integrations import structure

5726526

add new doc file to toc

b95ef43

gante merged commit f38590d into huggingface:main Sep 10, 2024
23 checks passed

guangy10 mentioned this pull request Sep 12, 2024

Export to ExecuTorch #32253

Open

26 tasks

guangy10 mentioned this pull request Sep 25, 2024

torch.export support for the latest transformers DynamicCache as input pytorch/pytorch#136582

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make StaticCache configurable at model construct time #32830

Make StaticCache configurable at model construct time #32830

guangy10 commented Aug 15, 2024 •

edited

Loading

guangy10 commented Aug 15, 2024 •

edited

Loading

ArthurZucker left a comment

gante left a comment

guangy10 commented Aug 19, 2024

guangy10 commented Aug 19, 2024

ArthurZucker commented Aug 20, 2024

gante commented Aug 23, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

ArthurZucker left a comment

guangy10 commented Sep 3, 2024 •

edited

Loading

guangy10 commented Sep 3, 2024

amyeroberts left a comment

gante left a comment

gante commented Sep 4, 2024 •

edited

Loading

gante left a comment

amyeroberts left a comment

mergennachin Sep 6, 2024

mergennachin Sep 6, 2024

mergennachin Sep 6, 2024

guangy10 Sep 6, 2024

guangy10 commented Sep 6, 2024 •

edited

Loading

gante commented Sep 7, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 8, 2024

guangy10 commented Sep 9, 2024

gante commented Sep 10, 2024

		@@ -0,0 +1,32 @@
		<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Make StaticCache configurable at model construct time #32830

Make StaticCache configurable at model construct time #32830

Conversation

guangy10 commented Aug 15, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

guangy10 commented Aug 15, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

guangy10 commented Aug 19, 2024

guangy10 commented Aug 19, 2024

ArthurZucker commented Aug 20, 2024

gante commented Aug 23, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

guangy10 commented Aug 28, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

guangy10 commented Sep 3, 2024 • edited Loading

guangy10 commented Sep 3, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

gante commented Sep 4, 2024 • edited Loading

gante left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

mergennachin Sep 6, 2024

Choose a reason for hiding this comment

mergennachin Sep 6, 2024

Choose a reason for hiding this comment

mergennachin Sep 6, 2024

Choose a reason for hiding this comment

guangy10 Sep 6, 2024

Choose a reason for hiding this comment

guangy10 commented Sep 6, 2024 • edited Loading

gante commented Sep 7, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Sep 8, 2024

guangy10 commented Sep 9, 2024

gante commented Sep 10, 2024

guangy10 commented Aug 15, 2024 •

edited

Loading

guangy10 commented Aug 15, 2024 •

edited

Loading

guangy10 commented Sep 3, 2024 •

edited

Loading

gante commented Sep 4, 2024 •

edited

Loading

guangy10 commented Sep 6, 2024 •

edited

Loading

gante commented Sep 7, 2024 •

edited

Loading