diff --git a/docs/stable/cli/cli_api.md b/docs/stable/cli/cli_api.md index 2cc7eb0..a68fd7c 100644 --- a/docs/stable/cli/cli_api.md +++ b/docs/stable/cli/cli_api.md @@ -106,16 +106,37 @@ This file can be incomplete, and missing sections will be filled in by the defau "metric": "concurrency", "target": 1, "min_instances": 0, - "max_instances": 10 + "max_instances": 10, + "keep_alive": 0 }, "backend_config": { "pretrained_model_name_or_path": "facebook/opt-1.3b", "device_map": "auto", - "torch_dtype": "float16" + "torch_dtype": "float16", + "hf_model_class": "AutoModelForCausalLM" } } ``` +Below is a description of all the fields in config.json. + +| Field | Description | +| ----- | ----------- | +| model | This should be a HuggingFace model name, used to identify model instance. | +| backend | Inference engine, support `transformers` and `vllm` now. | +| num_gpus | Number of GPUs used to deploy a model instance. | +| auto_scaling_config | Config about auto scaling. | +| auto_scaling_config.metric | Metric used to decide whether to scale up or down. | +| auto_scaling_config.target | Target value of the metric. | +| auto_scaling_config.min_instances | The minimum value for model instances. | +| auto_scaling_config.max_instances | The maximum value for model instances. | +| auto_scaling_config.keep_alive | How long a model instance lasts after inference ends. For example, if keep_alive is set to 30, it will wait 30 seconds after the inference ends to see if there is another request. | +| backend_config | Config about inference backend. | +| backend_config.pretrained_model_name_or_path | The path to load the model, this can be a HuggingFace model name or a local path. | +| backend_config.device_map | Device map config used to load the model, `auto` is suitable for most scenarios. | +| backend_config.torch_dtype | Torch dtype of the model. | +| backend_config.hf_model_class | HuggingFace model class. | + ### sllm-cli delete Delete deployed models by name. diff --git a/docs/stable/getting_started/quickstart.md b/docs/stable/getting_started/quickstart.md index af84d7a..8c034d9 100644 --- a/docs/stable/getting_started/quickstart.md +++ b/docs/stable/getting_started/quickstart.md @@ -67,6 +67,8 @@ conda activate sllm sllm-cli deploy --model facebook/opt-1.3b ``` +This will download the model from HuggingFace, if you want load the model from local path, you can use `config.json`, see [here](../cli/cli_api.md#example-configuration-file-configjson) for details. + Now, you can query the model by any OpenAI API client. For example, you can use the following Python code to query the model: ```bash curl http://127.0.0.1:8343/v1/chat/completions \