Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StaticCache configurable at model construct time #32830

Merged
merged 3 commits into from
Sep 10, 2024

Conversation

guangy10
Copy link
Contributor

@guangy10 guangy10 commented Aug 15, 2024

What does this PR do?

This PR is to address #32500 for "Export to ExecuTorch"

  • Enable the ability to load a model with options to statically config the model using StaticCache
  • Create a new integration point for ExecuTorch at transformers/integrations/executorch.py and hosts the wrapper module class and util convert_and_export_with_cache there
  • Expose the module so that it can be used in other repos
  • Improve test
from transformers import convert_and_export_with_cache

model = AutoModelForCausalLM.from_pretrained(
    hf_model_repo,
    attn_implementation="sdpa",
    generation_config=GenerationConfig(
        use_cache=True,
        cache_implementation=cache_implementation,
        max_length=max_cache_len,
        cache_config={
            "batch_size": batch_size,
            "max_cache_len": max_cache_len,
        },
    ),
)

exported_prog = convert_and_export_with_cache(model)

# Further lower the exported program to ExecuTorch with delegates

A real world example/demo:
The test model gemma-2b is naturally exportable via convert_and_export_with_cache.
The test model gemma-2b is also lowerable and runnable via ExecuTorch w/ 15 tokens/s on XNNPACK backend. Checkout pytorch/executorch#4723 for details in ExecuTorch repo.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker
@amyeroberts
@gante

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 15, 2024

Is the test failure relevant?

FAILED tests/generation/test_beam_search.py::ConstrainedBeamSearchTest::test_constrained_beam_scorer_finalize - 
ValueError: Each list in `nested_token_ids` can't be a complete subset of another list, but is [[19, 40], [19, 40]].

I don't get what this test does and what the error means

@ArthurZucker ArthurZucker self-requested a review August 16, 2024 08:10
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Allowing configuration of the staticCache is most welcome! the other changes IMO do not belong in transformers and are way to heavy, while in optimum we could simply wrap arround transformers models as it's something we are more likely to push than change each modeling file

src/transformers/cache_utils.py Outdated Show resolved Hide resolved
src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved
src/transformers/models/gemma/modeling_gemma.py Outdated Show resolved Hide resolved
Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we've discussed in other places, the cache config should live inside GenerationConfig, not in Config 🤗 There should be no changes in the model architecture files (e.g. modeling_gemma.py), otherwise we expose ourselves to a very lengthy feature propagation cycle -- we would have to edit hundreds of files.

In a nutshell, to enable this feature:

  1. generation_config can hold and serialize the desired cache config
  2. PreTrainedModel.from_pretrained should accept generation_config of type GenerationConfig, and overwrite the default generation_config
  3. generate uses that field when it is set

@guangy10
Copy link
Contributor Author

As we've discussed in other places, the cache config should live inside GenerationConfig, not in Config 🤗 There should be no changes in the model architecture files (e.g. modeling_gemma.py), otherwise we expose ourselves to a very lengthy feature propagation cycle -- we would have to edit hundreds of files.

In a nutshell, to enable this feature:

  1. generation_config can hold and serialize the desired cache config
  2. PreTrainedModel.from_pretrained should accept generation_config of type GenerationConfig, and overwrite the default generation_config
  3. generate uses that field when it is set

I have no preference where the cache_config lives, so I can move it to GenerationConfig per your feedback.

I explained why some modeling code change is necessary in order to support torch.export() abave here, and the while work should be similar to support torch.compile. Because torch.export is more restricted due to its boarder use-cases in a non-python env, there are limitations to the graph input/outputs. The idea is that if we could make such changes in the fundamental models like llama, bert, clip, etc. many other models based on those could leverage the work. @amyeroberts could share more contexts just in case.

@guangy10
Copy link
Contributor Author

Hey! Allowing configuration of the staticCache is most welcome! the other changes IMO do not belong in transformers and are way to heavy, while in optimum we could simply wrap arround transformers models as it's something we are more likely to push than change each modeling file

It makes sense to have optimum being the place where the actual lowering to on-device happens. This would align it well with other on-device workflows like onnx and tflite. Here in this PR what I'm trying to enable is purely for torch.export, similarly to the work for torch.compile. One difference between torch.compile and torch.export is that export is the entry point for the PyTorch 2.x based on-device workflow ie ExecuTorch. So getting a transformer model compatible with torch.export in a way that can be further lowered to ExecuTorch later via optimum is what I'm trying to accomplish in the transformers repo.

@ArthurZucker
Copy link
Collaborator

Totally understand the motivations!
For now the changes are too "drastic" and constraining for general use-cases.
The changes done for torch compile were actually also benefic for non compile case and fairly "minimal".

@gante
Copy link
Member

gante commented Aug 23, 2024

@guangy10 my apologies for the acute review above -- the PR already had a positive review, I wanted to make sure it didn't go in at least without exploring alternatives 🤗

As Arthur wrote, let's try to explore options that do not require modeling changes. We've learned from recent experience that seemingly harmless changes can quickly escalate into difficult issues -- for us and for our users

@guangy10 guangy10 force-pushed the config_model_with_static_cache branch from 57aecb3 to 1e61048 Compare August 28, 2024 03:50
@guangy10
Copy link
Contributor Author

@guangy10 my apologies for the acute review above -- the PR already had a positive review, I wanted to make sure it didn't go in at least without exploring alternatives 🤗

As Arthur wrote, let's try to explore options that do not require modeling changes. We've learned from recent experience that seemingly harmless changes can quickly escalate into difficult issues -- for us and for our users

Thank you for reviewing this PR. All feedback is appreciated!

@guangy10
Copy link
Contributor Author

@ArthurZucker @gante This PR is updated according to comments about generation config plus what we've discussed in the meeting last week:

  1. Passed generation_config via from_pretrained and override the default generation_config.
  2. Implement the preferred option we discussed last week. That is, continue using the forward() adapter and define it in transformers where lives closest to the source of truth for all transformers models and use in all places where "Export to ExecuTorch" is suitable.

I was planning to split this PR into two parts then I realized it's easier to keep it in one so that you can see the full picture about how the mode is configured and exported.

@guangy10
Copy link
Contributor Author

cc: @amyeroberts @qubvel for additional eyes on reviewing.

@guangy10
Copy link
Contributor Author

Okay, all CIs are green 🚀 🚀

@guangy10 guangy10 force-pushed the config_model_with_static_cache branch from 805c61d to b93152e Compare August 29, 2024 21:16
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Would be nice to have a real world example

src/transformers/integrations/executorch.py Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
tests/utils/test_cache_utils.py Show resolved Hide resolved
tests/utils/test_cache_utils.py Outdated Show resolved Hide resolved
@guangy10
Copy link
Contributor Author

guangy10 commented Sep 3, 2024

Looks good! Would be nice to have a real world example

Yeah, there is an e2e demo with PR pytorch/executorch#4723 in ExecuTorch repo. You can see gemma2b is working out-of-the-box with ~15 tokens/s using convert_and_export from huggingface/transformers/integrations/executorch.py we added in this PR .

After this PR is merged, I can start adding an integration test for gemma2b utilizing the exported program (not lowered to ExecuTorch since no ExecuTorch integration in transformers repo) to generate sequences

@guangy10 guangy10 force-pushed the config_model_with_static_cache branch from b93152e to 7d4092b Compare September 3, 2024 21:36
@guangy10
Copy link
Contributor Author

guangy10 commented Sep 3, 2024

The examples_tensorflow failure doesn't seem to be relevant

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Just gave an overall review: @ArthurZucker and @gante are the cache kings, so I'll leave it to them to cover anything on the cache design patterns. Overall looks good to me - just a few small comments and nits

src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Show resolved Hide resolved
Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating, looks mostly good to me 💛

I've added a few comments to make the PR fit with a few transfomers nuances, like our import structure or documentation

src/transformers/cache_utils.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
@gante
Copy link
Member

gante commented Sep 4, 2024

haha I see @amyeroberts was reviewing this at the same time as I was, there are a few things we both suggested/mentioned

(@amyeroberts's comments take precedence, as she's in charge of overall library maintenance)

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few suggestions to fix the import structure according to our expectations, otherwise LGTM 🤗

src/transformers/__init__.py Outdated Show resolved Hide resolved
src/transformers/integrations/executorch.py Outdated Show resolved Hide resolved
src/transformers/__init__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

+1 on @gante's comment's about handling the imports such that most of the code isn't indented under is_torch_available, and a small nit on the docs.

Otherwise all looks good to me! ❤️

@guangy10 guangy10 force-pushed the config_model_with_static_cache branch from a573552 to 03ac195 Compare September 6, 2024 00:36
@guangy10 guangy10 force-pushed the config_model_with_static_cache branch 3 times, most recently from 5199230 to 5ae5243 Compare September 6, 2024 01:06
docs/source/en/main_classes/executorch.md Show resolved Hide resolved
@@ -0,0 +1,32 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guangy10

We should add the following here

"Copyright (c) Meta Platforms, Inc. and affiliates."

cc @amyeroberts @ArthurZucker

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should "HuggingFace Team" be here since you technically haven't contributed to this file yet.

src/transformers/integrations/executorch.py Show resolved Hide resolved
self.skipTest(reason="This test requires torch >= 2.3 to run.")

set_seed(0)
device = "cpu"
dtype = torch.float32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copyright here in this file?

@ArthurZucker , @amyeroberts

What's your general guideline of having a file with mixed credits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one with mixed credits (

# Copyright 2024 The ggml.ai team and The HuggingFace Inc. team. and pygguf author (github.com/99991)
). Not sure if it's the standard way

@guangy10 guangy10 force-pushed the config_model_with_static_cache branch 2 times, most recently from fa07acb to f213da1 Compare September 6, 2024 16:40
@guangy10 guangy10 force-pushed the config_model_with_static_cache branch from 0700b51 to 0ef678d Compare September 6, 2024 16:51
@guangy10
Copy link
Contributor Author

guangy10 commented Sep 6, 2024

Just updated the copyright headers and rebased to latest main to resolve conflicts. I start seeing new failures on this PR. @amyeroberts @gante Any idea why, or it's irrelevant and safe to merge?

@gante
Copy link
Member

gante commented Sep 7, 2024

Hi @guangy10 👋

I wasn't sure what was missing at first, so I went into the code, found the issue, and pushed the fix (tl;dr it was missing the lazy import structure for transformers.integrations). I hope you don't mind 🤗

EDIT: it was also missing an entry for the new doc page in our doc's toc

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 9, 2024

Hi @guangy10 👋

I wasn't sure what was missing at first, so I went into the code, found the issue, and pushed the fix (tl;dr it was missing the lazy import structure for transformers.integrations). I hope you don't mind 🤗

EDIT: it was also missing an entry for the new doc page in our doc's toc

@gante Thanks! If no more comments, can we merge this PR?

@gante gante merged commit f38590d into huggingface:main Sep 10, 2024
23 checks passed
@gante
Copy link
Member

gante commented Sep 10, 2024

@guangy10 done! Thank you for iterating with us 🤗

@guangy10 guangy10 mentioned this pull request Sep 12, 2024
26 tasks
facebook-github-bot pushed a commit to pytorch/executorch that referenced this pull request Sep 14, 2024
Summary:
bypass-github-export-checks

[Done] ~~Require PR [Make StaticCache configurable at model construct time](huggingface/transformers#32830) in order to export, lower and run the 🤗 model OOTB.~~
[Done] ~~Require huggingface/transformers#33303 or huggingface/transformers#33287 to be merged to 🤗 `transformers` to resolve the export issue introduced by huggingface/transformers#32543

-----------

Now we can take the integration point from 🤗 `transformers` to lower compatible models to ExecuTorch OOTB.
  - This PR creates a simple script with recipe of XNNPACK.
  - This PR also created a secret `EXECUTORCH_HT_TOKEN` to allow download checkpoints in the CI
  - This PR connects the 🤗 "Export to ExecuTorch" e2e workflow to ExecuTorch CI

### Instructions to run the demo:

1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch:
```
python -m extension.export_util.export_hf_model -hfm "google/gemma-2b" # The model is exported statical dims with static KV cache
```
2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime:
```
python -m extension.llm.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o tokenizer.bin
```
3. Build llm runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate)

4. Run the lowered model
```
cmake-out/examples/models/llama2/llama_main --model_path=gemma.pte --tokenizer_path=tokenizer.bin --prompt="My name is"
```
OOTB output and perf
```
I 00:00:00.003110 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003360 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003380 executorch:cpuinfo_utils.cpp:158] Number of efficient cores 4
I 00:00:00.003384 executorch:main.cpp:65] Resetting threadpool with num threads = 6
I 00:00:00.014716 executorch:runner.cpp:51] Creating LLaMa runner: model_path=gemma.pte, tokenizer_path=tokenizer_gemma.bin
I 00:00:03.065359 executorch:runner.cpp:66] Reading metadata from model
I 00:00:03.065391 executorch:metadata_util.h:43] get_n_bos: 1
I 00:00:03.065396 executorch:metadata_util.h:43] get_n_eos: 1
I 00:00:03.065399 executorch:metadata_util.h:43] get_max_seq_len: 123
I 00:00:03.065402 executorch:metadata_util.h:43] use_kv_cache: 1
I 00:00:03.065404 executorch:metadata_util.h:41] The model does not contain use_sdpa_with_kv_cache method, using default value 0
I 00:00:03.065405 executorch:metadata_util.h:43] use_sdpa_with_kv_cache: 0
I 00:00:03.065407 executorch:metadata_util.h:41] The model does not contain append_eos_to_prompt method, using default value 0
I 00:00:03.065409 executorch:metadata_util.h:43] append_eos_to_prompt: 0
I 00:00:03.065411 executorch:metadata_util.h:41] The model does not contain enable_dynamic_shape method, using default value 0
I 00:00:03.065412 executorch:metadata_util.h:43] enable_dynamic_shape: 0
I 00:00:03.130388 executorch:metadata_util.h:43] get_vocab_size: 256000
I 00:00:03.130405 executorch:metadata_util.h:43] get_bos_id: 2
I 00:00:03.130408 executorch:metadata_util.h:43] get_eos_id: 1
My name is Melle. I am a 20 year old girl from Belgium. I am living in the southern part of Belgium. I am 165 cm tall and I weigh 45kg. I like to play sports like swimming, running and playing tennis. I am very interested in music and I like to listen to classical music. I like to sing and I can play the piano. I would like to go to the USA because I like to travel a lot. I am looking for a boy from the USA who is between 18 and 25 years old. I
PyTorchObserver {"prompt_tokens":4,"generated_tokens":118,"model_load_start_ms":1723685715497,"model_load_end_ms":1723685718612,"inference_start_ms":1723685718612,"inference_end_ms":1723685732965,"prompt_eval_end_ms":1723685719087,"first_token_ms":1723685719087,"aggregate_sampling_time_ms":182,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:17.482472 executorch:stats.h:70] 	Prompt Tokens: 4    Generated Tokens: 118
I 00:00:17.482475 executorch:stats.h:76] 	Model Load Time:		3.115000 (seconds)
I 00:00:17.482481 executorch:stats.h:86] 	Total inference time:		14.353000 (seconds)		 Rate: 	8.221278 (tokens/second)
I 00:00:17.482483 executorch:stats.h:94] 		Prompt evaluation:	0.475000 (seconds)		 Rate: 	8.421053 (tokens/second)
I 00:00:17.482485 executorch:stats.h:105] 		Generated 118 tokens:	13.878000 (seconds)		 Rate: 	8.502666 (tokens/second)
I 00:00:17.482486 executorch:stats.h:113] 	Time to first generated token:	0.475000 (seconds)
I 00:00:17.482488 executorch:stats.h:120] 	Sampling time over 122 tokens:	0.182000 (seconds)
```

Pull Request resolved: #4723

Reviewed By: huydhn, kirklandsign

Differential Revision: D62543933

Pulled By: guangy10

fbshipit-source-id: 00401a39ba03d7383e4b284d25c8fc62a6695b34
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
)

* Make StaticCache configurable at model construct time

* integrations import structure

* add new doc file to toc

---------

Co-authored-by: Guang Yang <[email protected]>
Co-authored-by: Joao Gante <[email protected]>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
)

* Make StaticCache configurable at model construct time

* integrations import structure

* add new doc file to toc

---------

Co-authored-by: Guang Yang <[email protected]>
Co-authored-by: Joao Gante <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants