Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI: export model if configuration is cached #445

Merged
merged 9 commits into from
Jan 30, 2024
Merged

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Jan 26, 2024

This pull-request is two-fold.

First, it modifies the cache registry to use the model_type (i.e 'llama, mistral, ...) as the primary key for lookups.
This allows to detect compatible configurations for models with different weights or even for local models.

Second, it modifies the NeuronX TGI server to export models under specific conditions.

Instead of raising an error if the model passed to Neuron X TGI is not a neuron model, we now try to lookup for cached artifacts in the hub cache.

If a compatible cached configuration is detected in the cache registry, the model is exported by the TGI server using cached artifacts (and thus in a much shorter time than a vanilla compilation).

The TGI server is extended with new environment variables to define the configuration for the export:

  • HF_BATCH_SIZE, default to 1,
  • HF_SEQUENCE_LENGTH, default to the model maximum,
  • HF_NUM_CORES, default to all cores,
  • HF_AUTO_CAST_TYPE, default to config.torch_dtype.

This allows to identify cached configurations that can be applied to models
that differ only by their weights, like meta-llama/Llama-2-7b-hf and
meta-llama/Llama-2-7b-chat-hf.
This also allows to lookup cached configurations for local model folders
containing a model config.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo dacorvo marked this pull request as ready for review January 29, 2024 17:02
This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time.
Before loading a Transformers or Diffusion model on Neuron platforms, it needs to be exported to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx).
When exporting a model, [`torch-neuronx`] will:
- convert it to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs, stored as `model.hlo.pb` files,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true? Since the extension of the file is a HLO, are they really torch script graphs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say XLA JIT subgraphs like in their documentation, and remove the mention to model.hlo.pb, which might be transformers-neuronx specific.


It is important to understand that the cache operates on NEFF binaries, and not on the model itself.

As explained previously, each model is first converted to a set of [TorchScript](https://pytorch.org/docs/stable/jit.html) subgraphs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not true for training.

For training we use lazy tensors to accumulate subgraphs as HLO IR. For inference, I think that accumulate the model graph via tracing but end-up with a HLO IR as well. The only difference is how the program is first recorded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inference you mean when calling trace: this is different for transformers-neuronx that explicitly creates an HLO IR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. but my point is that the path to get to an HLO IR is different depending on training / inference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My new formulation should be vague enough to cover both cases: what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes perfect!

raise SystemError("Decoder models can only be exported on a neuron platform.")

auto_cast_type: Optional[str] = None,
) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you return a config not a dictionary?

Args:
model_id (`str`):
The model id, used as a key for the cache entry.
config (`~transformers.PretrainedConfig`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this format for type annotation work with the doc builder?

Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left some nits!

@dacorvo dacorvo merged commit c114fc8 into main Jan 30, 2024
6 of 7 checks passed
@dacorvo dacorvo deleted the tgi_export_model branch January 30, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants