-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export to ExecuTorch #32253
Comments
Thank you for detailing Executorch's goals 🤗 Two follow-up questions:
|
@gante Thanks for the great follow-up questions: For #1, yes if we can pass/override the config while loading the pretrained model, e.g. For #2, yes I understand there are use-cases where make cache config closer to auto-regressive generation is cleaner. KV cache config can still be passed through |
Hey, saw your comments from another PR and wanted to share that I was thinking to make cache-config savable/loadable same way as generation config. It will hold all the needed args for all cache types, and loading a model |
I'm quite biased towards keeping the cache config inside
But happy to reconsider if there are strong arguments to keep them separate :) |
Wait, i just realized that we will save the cache config even if it's inside generation config. So it will be loadable from hub. Oke, that makes sense, thanks! |
@gante, I see there are two orthogonal things from your and @zucchini-nlp 's comments. Let's get more clarify on it:
It would take the cache config to construct the model. This is a new feature needed in order to support
So this is about whether |
Yes :)
We have indeed support for quantized caches! Their quantization configuration is set at initialization time, so it will belong in the cache config as well :) (we can have, e.g. a FP16 model and a quantized cache, to support very long generation) |
This could eventually enable also AOTI compilation: |
Yeah, we are having a proof-of-concept where making AOTI a backend of ExecuTorch enabling users can utilize both desktop GPU, HTP, and CPU altogether on a desktop where all these accelerators are available. |
@guangy10 Or am I missing something with how encoder/decoder architectures should be best implemented in executorch? |
Yeah, you can export the model to multiple artifacts. Here is an example of how another encoder-decoder model (t5) is supported: https://github.com/huggingface/transformers/blob/d9e6f307e71b5108a7882ec00ffcc0d0eb316cb7/tests/models/t5/test_modeling_t5.py#L1650-L1706https://github.com/huggingface/transformers/blob/d9e6f307e71b5108a7882ec00ffcc0d0eb316cb7/tests/models/t5/test_modeling_t5.py#L1650-L1706. The example is using torch.compile, the idea would be same for using torch.export to ExecuTorch. |
Feature request
Unlock a new workflow for on-device use-cases via torch.export and ExecuTorch.
So ideally the users can have an e2e experience by loading a pretrained transformer model from HuggingFace, export and lower it to
ExecuTorch
and get reasonable performance out-of-the-box.For example:
and then further lower the exported program to
ExecuTorch
with delegates for performance:With that you may get a model for on-device with reasonable performance to start with.
From there and still within
ExecuTorch
stack, you can easily tailor the experience for your use-cases, of course, with better performance! Note thatExecuTorch
supports delegatation to XNNPACK backend, Apple Core ML and MPS, Qualcomm QNN, ARM Ethos-U, Vulkan GPU and more. You can learn more by reading our tutorial.The example workflow above shows direct integration between
ExecuTorch
and HFtransformers
models. Eventually this workflow could be accessible viaoptimum exporters-et
,Transformers.js
or inExecuTorch
andtorchchat
.Motivation
Unlock a whole new on-device experience of using HuggingFace models w/o leaving the PyTorch ecosystem (
ExecuTorch
is native PyTorch!)Issues Tracker
Cache
StaticCache
compatible withtorch.export
: PR Make static cache compatible with torch.export #32168StaticCache
: PR [WIP] Dynamic length in static cache #30862generate
(inference) for torch exported text-generation models #32504: PR Generate using exported model and enable gemma2-2b in ExecuTorch #33707E2E workflow
Optimum
Export-to-ExecuTorch viaoptimum
integration #32511: Export to ExecuTorch: Initial Integration optimum#2090Tranformers.js
Export-to-ExecuTorch via transformers.js integration transformers.js#1039Optimization
Models
And more! We're ambitious to expanding the model coverage massively. Please comment below if you are interested in a particular model for on-device use-case!
Your contribution
generate
for exported model and the integration inOptimum
Here is how ExecuTorch implements the
generate()
for llama2/3 in eager python and c++.cc: @amyeroberts @gante @ArthurZucker @michaelbenayoun
The text was updated successfully, but these errors were encountered: