TGI: export model if configuration is cached #445

dacorvo · 2024-01-26T17:11:10Z

This pull-request is two-fold.

First, it modifies the cache registry to use the model_type (i.e 'llama, mistral, ...) as the primary key for lookups.
This allows to detect compatible configurations for models with different weights or even for local models.

Second, it modifies the NeuronX TGI server to export models under specific conditions.

Instead of raising an error if the model passed to Neuron X TGI is not a neuron model, we now try to lookup for cached artifacts in the hub cache.

If a compatible cached configuration is detected in the cache registry, the model is exported by the TGI server using cached artifacts (and thus in a much shorter time than a vanilla compilation).

The TGI server is extended with new environment variables to define the configuration for the export:

HF_BATCH_SIZE, default to 1,
HF_SEQUENCE_LENGTH, default to the model maximum,
HF_NUM_CORES, default to all cores,
HF_AUTO_CAST_TYPE, default to config.torch_dtype.

This allows to identify cached configurations that can be applied to models that differ only by their weights, like meta-llama/Llama-2-7b-hf and meta-llama/Llama-2-7b-chat-hf. This also allows to lookup cached configurations for local model folders containing a model config.

HuggingFaceDocBuilderDev · 2024-01-29T17:02:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

michaelbenayoun · 2024-01-30T09:51:41Z

docs/source/guides/cache_system.mdx

-This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time.
+Before loading a Transformers or Diffusion model on Neuron platforms, it needs to be exported to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx).
+When exporting a model, [`torch-neuronx`] will:
+- convert it to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs, stored as `model.hlo.pb` files,


Is it true? Since the extension of the file is a HLO, are they really torch script graphs?

I will say XLA JIT subgraphs like in their documentation, and remove the mention to model.hlo.pb, which might be transformers-neuronx specific.

michaelbenayoun · 2024-01-30T10:54:52Z

docs/source/guides/cache_system.mdx

+
+It is important to understand that the cache operates on NEFF binaries, and not on the model itself.
+
+As explained previously, each model is first converted to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs.


It is not true for training.

For training we use lazy tensors to accumulate subgraphs as HLO IR. For inference, I think that accumulate the model graph via tracing but end-up with a HLO IR as well. The only difference is how the program is first recorded.

For inference you mean when calling trace: this is different for transformers-neuronx that explicitly creates an HLO IR.

Yes. but my point is that the path to get to an HLO IR is different depending on training / inference.

My new formulation should be vague enough to cover both cases: what do you think ?

Yes perfect!

michaelbenayoun · 2024-01-30T10:59:00Z

optimum/neuron/modeling_decoder.py

-            raise SystemError("Decoder models can only be exported on a neuron platform.")
-
+        auto_cast_type: Optional[str] = None,
+    ) -> Dict[str, Any]:


It seems you return a config not a dictionary?

michaelbenayoun · 2024-01-30T11:03:42Z

optimum/neuron/utils/hub_neuronx_cache.py

+    Args:
+        model_id (`str`):
+            The model id, used as a key for the cache entry.
+        config (`~transformers.PretrainedConfig`):


Does this format for type annotation work with the doc builder?

JingyaHuang

LGTM, just left some nits!

docs/source/guides/cache_system.mdx

optimum/neuron/utils/hub_neuronx_cache.py

dacorvo force-pushed the tgi_export_model branch from a685b71 to 7cb9a5a Compare January 26, 2024 17:13

dacorvo added 6 commits January 29, 2024 10:21

feat(cache): use one registry per optimum version

d75bd90

doc(cache): fix image link

327dd89

doc(cache): add cache lookup

e527b55

refactor(decoder): add get_export_config helper

77f8a0c

feat(tgi): export model if cached

0ed7b77

dacorvo force-pushed the tgi_export_model branch from 7cb9a5a to 0ed7b77 Compare January 29, 2024 16:59

dacorvo marked this pull request as ready for review January 29, 2024 17:02

dacorvo requested review from JingyaHuang, philschmid and michaelbenayoun January 29, 2024 17:02

michaelbenayoun reviewed Jan 30, 2024

View reviewed changes

JingyaHuang approved these changes Jan 30, 2024

View reviewed changes

docs/source/guides/cache_system.mdx Outdated Show resolved Hide resolved

docs/source/guides/cache_system.mdx Outdated Show resolved Hide resolved

optimum/neuron/utils/hub_neuronx_cache.py Show resolved Hide resolved

dacorvo added 3 commits January 30, 2024 14:07

review: addressing code comments

5f15608

wip

46081fd

review: address doc comments

c09ca5a

michaelbenayoun approved these changes Jan 30, 2024

View reviewed changes

dacorvo merged commit c114fc8 into main Jan 30, 2024
6 of 7 checks passed

dacorvo deleted the tgi_export_model branch January 30, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI: export model if configuration is cached #445

TGI: export model if configuration is cached #445

dacorvo commented Jan 26, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 29, 2024

michaelbenayoun Jan 30, 2024

dacorvo Jan 30, 2024

michaelbenayoun Jan 30, 2024

dacorvo Jan 30, 2024

michaelbenayoun Jan 30, 2024

dacorvo Jan 30, 2024

michaelbenayoun Jan 30, 2024

michaelbenayoun Jan 30, 2024

michaelbenayoun Jan 30, 2024

JingyaHuang left a comment


		It is important to understand that the cache operates on NEFF binaries, and not on the model itself.

		As explained previously, each model is first converted to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs.

TGI: export model if configuration is cached #445

TGI: export model if configuration is cached #445

Conversation

dacorvo commented Jan 26, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

dacorvo commented Jan 26, 2024 •

edited

Loading