From a0aa057f2a7941006fbc1bc17e51b07765e44920 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Thu, 31 Aug 2023 16:31:31 +0200 Subject: [PATCH] - [Docs] Updated docs and examples to reflect the changes in `0.11.1` (part 2) --- README.md | 16 +++---- docs/blog/posts/multiple-clouds.md | 4 +- docs/examples/text-generation-inference.md | 53 ++++++++++++++++---- docs/examples/vllm.md | 24 ++++++---- docs/index.md | 2 +- docs/overrides/examples.html | 42 ++++++++-------- docs/overrides/home.html | 56 +++++++++++----------- mkdocs.yml | 4 +- setup.py | 3 +- 9 files changed, 122 insertions(+), 82 deletions(-) diff --git a/README.md b/README.md index 413177d89..0ee7b3f78 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@

-Train and deploy LLM models in multiple clouds +Run LLM workloads across any clouds

@@ -23,18 +23,16 @@ Train and deploy LLM models in multiple clouds [![PyPI - License](https://img.shields.io/pypi/l/dstack?style=flat-square&color=blue)](https://github.com/dstackai/dstack/blob/master/LICENSE.md) -`dstack` is an open-source tool that enables the execution of LLM workloads -across multiple cloud providers – ensuring the best GPU price and availability. +`dstack` is an open-source toolkit for running LLM workloads across any clouds, offering a +cost-efficient and user-friendly interface for training, inference, and development. -Deploy services, run tasks, and provision dev environments -in a cost-effective manner across multiple cloud GPU providers. - -## Latest news +## Latest news ✨ - [2023/08] [Fine-tuning with Llama 2](https://dstack.ai/examples/finetuning-llama-2) (Example) - [2023/08] [An early preview of services](https://dstack.ai/blog/2023/08/07/services-preview) (Release) -- [2023/07] [Port mapping, max duration, and more](https://dstack.ai/blog/2023/07/25/port-mapping-max-duration-and-more) (Release) -- [2023/07] [Serving with vLLM](https://dstack.ai/examples/vllm) (Example) +- [2023/08] [Serving SDXL with FastAPI](https://dstack.ai/examples/stable-diffusion-xl) (Example) +- [2023/07] [Serving LLMS with TGI](https://dstack.ai/examples/text-generation-inference) (Example) +- [2023/07] [Serving LLMS with vLLM](https://dstack.ai/examples/vllm) (Example) ## Installation diff --git a/docs/blog/posts/multiple-clouds.md b/docs/blog/posts/multiple-clouds.md index ffda14189..adb436b75 100644 --- a/docs/blog/posts/multiple-clouds.md +++ b/docs/blog/posts/multiple-clouds.md @@ -7,7 +7,7 @@ categories: - Releases --- -# Discover GPU across multiple clouds +# Automatic GPU discovery across clouds __The 0.11 update significantly cuts GPU costs and boosts their availability.__ @@ -16,7 +16,7 @@ configured cloud providers and regions. -## Multiple clouds per project +## Multiple backends per project Now, `dstack` leverages price data from multiple configured cloud providers and regions to automatically suggest the most cost-effective options. diff --git a/docs/examples/text-generation-inference.md b/docs/examples/text-generation-inference.md index 8b8b207e6..589461762 100644 --- a/docs/examples/text-generation-inference.md +++ b/docs/examples/text-generation-inference.md @@ -31,13 +31,11 @@ Here's the configuration that uses services: ```yaml type: service -# This configuration deploys a given LLM model as an API image: ghcr.io/huggingface/text-generation-inference:latest env: - # (Required) Specify the name of the model - - MODEL_ID=tiiuae/falcon-7b + - MODEL_ID=NousResearch/Llama-2-7b-hf port: 8000 @@ -84,11 +82,50 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com \ -!!! info "Gated models" - To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret - (using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)), - or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or - using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file). +### Gated models + +To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret +(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)), +or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or +using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file). + +

+ +```shell +$ dstack run . -f text-generation-inference/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=<token> --gpu 24GB +``` +
+ +### Memory usage and quantization + +An LLM typically requires twice the GPU memory compared to its parameter count. For instance, a model with `13B` parameters +needs around `26GB` of GPU memory. To decrease memory usage and fit the model on a smaller GPU, consider using +quantization, which TGI offers as `bitsandbytes` and `gptq` methods. + +Here's an example of the Llama 2 13B model tailored for a `24GB` GPU (A10 or L4): + +
+ +```yaml +type: service + +image: ghcr.io/huggingface/text-generation-inference:latest + +env: + - MODEL_ID=TheBloke/Llama-2-13B-GPTQ + +port: 8000 + +commands: + - text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq +``` + +
+ +A similar approach allows running the Llama 2 70B model on an `80GB` GPU (A100). + +To calculate the exact GPU memory required for a specific model with different quantization methods, you can use the +[hf-accelerate/memory-model-usage](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) Space. ??? info "Dev environments" diff --git a/docs/examples/vllm.md b/docs/examples/vllm.md index 7a2177abf..fe4f4d5d8 100644 --- a/docs/examples/vllm.md +++ b/docs/examples/vllm.md @@ -31,12 +31,10 @@ Here's the configuration that uses services to run an LLM as an OpenAI-compatibl ```yaml type: service -# (Optional) If not specified, it will use your local version python: "3.11" env: - # (Required) Specify the name of the model - - MODEL=facebook/opt-125m + - MODEL=NousResearch/Llama-2-7b-hf port: 8000 @@ -75,7 +73,7 @@ Once the service is up, you can query the endpoint: $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "facebook/opt-125m", + "model": "NousResearch/Llama-2-7b-hf", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 @@ -84,10 +82,18 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \ -!!! info "Gated models" - To use a model with gated access, ensure configuring either the `HUGGING_FACE_HUB_TOKEN` secret - (using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)), - or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or - using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file). +### Gated models + +To use a gated-access model from Hugging Face Hub, make sure to set up either the `HUGGING_FACE_HUB_TOKEN` secret +(using [`dstack secrets`](../docs/reference/cli/secrets.md#dstack-secrets-add)), +or environment variable (with [`--env`](../docs/reference/cli/run.md#ENV) in `dstack run` or +using [`env`](../docs/reference/dstack.yml/service.md#env) in the configuration file). + +
+ +```shell +$ dstack run . -f vllm/serve.dstack.yml --env HUGGING_FACE_HUB_TOKEN=<token> --gpu 24GB +``` +
[Source code](https://github.com/dstackai/dstack-examples){ .md-button .md-button--github } \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 133072ed8..725191658 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ --- template: home.html -title: Train and deploy LLM models in multiple clouds +title: Run LLM workloads across any clouds hide: - navigation - toc diff --git a/docs/overrides/examples.html b/docs/overrides/examples.html index 389e8c4ef..a739101f0 100644 --- a/docs/overrides/examples.html +++ b/docs/overrides/examples.html @@ -9,7 +9,7 @@

Examples

- +