Skip to content

Commit

Permalink
Document Sync by Tina
Browse files Browse the repository at this point in the history
  • Loading branch information
Chivier committed Dec 9, 2024
1 parent 61e7efa commit 001cc0d
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 2 deletions.
25 changes: 23 additions & 2 deletions docs/stable/cli/cli_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,16 +106,37 @@ This file can be incomplete, and missing sections will be filled in by the defau
"metric": "concurrency",
"target": 1,
"min_instances": 0,
"max_instances": 10
"max_instances": 10,
"keep_alive": 0
},
"backend_config": {
"pretrained_model_name_or_path": "facebook/opt-1.3b",
"device_map": "auto",
"torch_dtype": "float16"
"torch_dtype": "float16",
"hf_model_class": "AutoModelForCausalLM"
}
}
```

Below is a description of all the fields in config.json.

| Field | Description |
| ----- | ----------- |
| model | This should be a HuggingFace model name, used to identify model instance. |
| backend | Inference engine, support `transformers` and `vllm` now. |
| num_gpus | Number of GPUs used to deploy a model instance. |
| auto_scaling_config | Config about auto scaling. |
| auto_scaling_config.metric | Metric used to decide whether to scale up or down. |
| auto_scaling_config.target | Target value of the metric. |
| auto_scaling_config.min_instances | The minimum value for model instances. |
| auto_scaling_config.max_instances | The maximum value for model instances. |
| auto_scaling_config.keep_alive | How long a model instance lasts after inference ends. For example, if keep_alive is set to 30, it will wait 30 seconds after the inference ends to see if there is another request. |
| backend_config | Config about inference backend. |
| backend_config.pretrained_model_name_or_path | The path to load the model, this can be a HuggingFace model name or a local path. |
| backend_config.device_map | Device map config used to load the model, `auto` is suitable for most scenarios. |
| backend_config.torch_dtype | Torch dtype of the model. |
| backend_config.hf_model_class | HuggingFace model class. |

### sllm-cli delete
Delete deployed models by name.

Expand Down
2 changes: 2 additions & 0 deletions docs/stable/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ conda activate sllm
sllm-cli deploy --model facebook/opt-1.3b
```

This will download the model from HuggingFace, if you want load the model from local path, you can use `config.json`, see [here](../cli/cli_api.md#example-configuration-file-configjson) for details.

Now, you can query the model by any OpenAI API client. For example, you can use the following Python code to query the model:
```bash
curl http://127.0.0.1:8343/v1/chat/completions \
Expand Down

0 comments on commit 001cc0d

Please sign in to comment.