Document Sync by Tina

ServerlessLLM · Dec 9, 2024 · 001cc0d · 001cc0d
1 parent 61e7efa
commit 001cc0d
Show file tree

Hide file tree

Showing 2 changed files with 25 additions and 2 deletions.
diff --git a/docs/stable/cli/cli_api.md b/docs/stable/cli/cli_api.md
@@ -106,16 +106,37 @@ This file can be incomplete, and missing sections will be filled in by the defau
         "metric": "concurrency",
         "target": 1,
         "min_instances": 0,
-        "max_instances": 10
+        "max_instances": 10,
+        "keep_alive": 0
     },
     "backend_config": {
         "pretrained_model_name_or_path": "facebook/opt-1.3b",
         "device_map": "auto",
-        "torch_dtype": "float16"
+        "torch_dtype": "float16",
+        "hf_model_class": "AutoModelForCausalLM"
     }
 }
 ```
 
+Below is a description of all the fields in config.json.
+
+| Field | Description |
+| ----- | ----------- |
+| model | This should be a HuggingFace model name, used to identify model instance. |
+| backend | Inference engine, support `transformers` and `vllm` now. |
+| num_gpus | Number of GPUs used to deploy a model instance. |
+| auto_scaling_config | Config about auto scaling. |
+| auto_scaling_config.metric | Metric used to decide whether to scale up or down. |
+| auto_scaling_config.target | Target value of the metric. |
+| auto_scaling_config.min_instances | The minimum value for model instances. |
+| auto_scaling_config.max_instances | The maximum value for model instances. |
+| auto_scaling_config.keep_alive | How long a model instance lasts after inference ends. For example, if keep_alive is set to 30, it will wait 30 seconds after the inference ends to see if there is another request. |
+| backend_config | Config about inference backend. |
+| backend_config.pretrained_model_name_or_path | The path to load the model, this can be a HuggingFace model name or a local path. |
+| backend_config.device_map | Device map config used to load the model, `auto` is suitable for most scenarios. |
+| backend_config.torch_dtype | Torch dtype of the model. |
+| backend_config.hf_model_class | HuggingFace model class. |
+
 ### sllm-cli delete
 Delete deployed models by name.
 

diff --git a/docs/stable/getting_started/quickstart.md b/docs/stable/getting_started/quickstart.md
@@ -67,6 +67,8 @@ conda activate sllm
 sllm-cli deploy --model facebook/opt-1.3b
 ```
 
+This will download the model from HuggingFace, if you want load the model from local path, you can use `config.json`, see [here](../cli/cli_api.md#example-configuration-file-configjson) for details.
+
 Now, you can query the model by any OpenAI API client. For example, you can use the following Python code to query the model:
 ```bash
 curl http://127.0.0.1:8343/v1/chat/completions \