-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TGI: export model if configuration is cached #445
Conversation
a685b71
to
7cb9a5a
Compare
This allows to identify cached configurations that can be applied to models that differ only by their weights, like meta-llama/Llama-2-7b-hf and meta-llama/Llama-2-7b-chat-hf. This also allows to lookup cached configurations for local model folders containing a model config.
7cb9a5a
to
0ed7b77
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
docs/source/guides/cache_system.mdx
Outdated
This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time. | ||
Before loading a Transformers or Diffusion model on Neuron platforms, it needs to be exported to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx). | ||
When exporting a model, [`torch-neuronx`] will: | ||
- convert it to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs, stored as `model.hlo.pb` files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true? Since the extension of the file is a HLO, are they really torch script graphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say XLA JIT subgraphs like in their documentation, and remove the mention to model.hlo.pb, which might be transformers-neuronx specific.
docs/source/guides/cache_system.mdx
Outdated
|
||
It is important to understand that the cache operates on NEFF binaries, and not on the model itself. | ||
|
||
As explained previously, each model is first converted to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not true for training.
For training we use lazy tensors to accumulate subgraphs as HLO IR. For inference, I think that accumulate the model graph via tracing but end-up with a HLO IR as well. The only difference is how the program is first recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inference you mean when calling trace: this is different for transformers-neuronx that explicitly creates an HLO IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. but my point is that the path to get to an HLO IR is different depending on training / inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My new formulation should be vague enough to cover both cases: what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes perfect!
optimum/neuron/modeling_decoder.py
Outdated
raise SystemError("Decoder models can only be exported on a neuron platform.") | ||
|
||
auto_cast_type: Optional[str] = None, | ||
) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you return a config not a dictionary?
Args: | ||
model_id (`str`): | ||
The model id, used as a key for the cache entry. | ||
config (`~transformers.PretrainedConfig`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this format for type annotation work with the doc builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just left some nits!
This pull-request is two-fold.
First, it modifies the cache registry to use the
model_type
(i.e 'llama
,mistral
, ...) as the primary key for lookups.This allows to detect compatible configurations for models with different weights or even for local models.
Second, it modifies the NeuronX TGI server to export models under specific conditions.
Instead of raising an error if the model passed to Neuron X TGI is not a neuron model, we now try to lookup for cached artifacts in the hub cache.
If a compatible cached configuration is detected in the cache registry, the model is exported by the TGI server using cached artifacts (and thus in a much shorter time than a vanilla compilation).
The TGI server is extended with new environment variables to define the configuration for the export: