Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guide to deploy TGI to SM #1409

Merged
merged 10 commits into from
Sep 9, 2024
93 changes: 93 additions & 0 deletions docs/sagemaker/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,99 @@ The `input.jsonl` looks like this:

📓 Open the [sagemaker-notebook.ipynb notebook](https://github.com/huggingface/notebooks/blob/main/sagemaker/12_batch_transform_inference/sagemaker-notebook.ipynb) for an example of how to run a batch transform job for inference.

## Deploy an LLM to SageMaker using TGI

If you are interested in using a high-performance serving container for LLMs, you can use the Hugging Face TGI container. This utilizes the [Text Generation Inference](https://github.com/huggingface/text-generation-inference) library. A list of compatible models can be found [here](https://huggingface.co/docs/text-generation-inference/supported_models#supported-models).

First, make sure that the latest version of SageMaker SDK is installed:

```bash
pip install sagemaker>=2.161.0
```

Then, we import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role.

```python
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
```

Next we retrieve the LLM image URI. We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference. The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model: “huggingface” refers to using Hugging Face TGI backend.

```python
image_uri = get_huggingface_llm_image_uri(
backend="huggingface",
region=region
)
```

Now that we have the image uri, the next step is to configure the model object. We specify a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the `HF_MODEL_ID` which corresponds to the model from the HuggingFace Hub that will be deployed, and the `HF_TASK` which configures the inference task to be performed by the model.

You should also define `SM_NUM_GPUS`, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. To learn more about tensor parallelism with inference, see our previous blog post. Here, you should set `SM_NUM_GPUS` to the number of available GPUs on your selected instance type. For example, in this tutorial, we set `SM_NUM_GPUS` to 4 because our selected instance type ml.g4dn.12xlarge has 4 available GPUs.

Note that you can optionally reduce the memory and computational footprint of the model by setting the `HF_MODEL_QUANTIZE` environment variable to `true`, but this lower weight precision could affect the quality of the output for some models.

```python
model_name = "gpt-neox-20b-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
'HF_TASK':'text-generation',
'SM_NUM_GPUS':'4',
'HF_MODEL_QUANTIZE':'true'
}

model = HuggingFaceModel(
name=model_name,
env=hub,
role=role,
image_uri=image_uri
)
```

Next, we invoke the deploy method to deploy the model.

```python
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.12xlarge",
endpoint_name=model_name
)
```

Once the model is deployed, we can invoke it to generate text. We pass an input prompt and run the predict method to generate a text response from the LLM running in the TGI container.

```python
input_data = {
"inputs": "The diamondback terrapin was the first reptile to",
"parameters": {
"do_sample": True,
"max_new_tokens": 100,
"temperature": 0.7,
"watermark": True
}
}

predictor.predict(input_data)
```

We receive the following auto-generated text response:
```python
[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]
```

Once we are done experimenting, we delete the endpoint and the model resources.

```python
predictor.delete_model()
predictor.delete_endpoint()
```

## User defined code and modules

The Hugging Face Inference Toolkit allows the user to override the default methods of the `HuggingFaceHandlerService`. You will need to create a folder named `code/` with an `inference.py` file in it. See [here](#create-a-model-artifact-for-deployment) for more details on how to archive your model artifacts. For example:
Expand Down
Loading