Skip to content

Commit

Permalink
Update user docs for running llm server,
Browse files Browse the repository at this point in the history
Add back `sentencepiece` as requirement for `sharktank` to enable `export_paged_llm_v1`
  • Loading branch information
stbaione committed Dec 11, 2024
1 parent 6c62ed1 commit 8655451
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 55 deletions.
81 changes: 27 additions & 54 deletions docs/shortfin/llm/user/e2e_llama8b_mi300x.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,30 +22,37 @@ python -m venv --prompt shark-ai .venv
source .venv/bin/activate
```

### Install `shark-ai`
### Install `shortfin`

You can install either the `latest stable` version of `shark-ai`
or the `nightly` version:
You can install either the `latest stable` version of shortfin by installing
`shark-ai` or the `nightly` version directly:

#### Stable

```bash
pip install shark-ai
pip install shark-ai[apps]
```

#### Nightly

```bash
pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin[apps] --pre -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

#### Install dataclasses-json
<!-- TODO: Remove when sharktank added to `shark-ai[apps]` -->
### Install `sharktank`

<!-- TODO: This should be included in release: -->
Install the `nightly` version of sharktank:

```bash
pip install dataclasses-json
pip install sharktank --pre -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

<!-- TODO: Remove once `sentencepiece` added to nightly `sharktank` -->
### Install `sentencepiece`

```bash
pip install sentencepiece
```

### Define a directory for export files
Expand Down Expand Up @@ -78,8 +85,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
that were downloaded in the previous step.

```bash
export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
```

#### General env vars
Expand All @@ -91,8 +98,6 @@ The following env vars can be copy + pasted directly:
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export edited_config.json file
export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
Expand All @@ -108,7 +113,7 @@ to export our model to `.mlir` format.

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file=$MODEL_PARAMS_PATH \
--gguf-file=$MODEL_PARAMS_PATH \
--output-mlir=$MLIR_PATH \
--output-config=$OUTPUT_CONFIG_PATH \
--bs=$BS
Expand Down Expand Up @@ -137,37 +142,6 @@ iree-compile $MLIR_PATH \
-o $VMFB_PATH
```

## Write an edited config

We need to write a config for our model with a slightly edited structure
to run with shortfin. This will work for the example in our docs.
You may need to modify some of the parameters for a specific model.

### Write edited config

```bash
cat > $EDITED_CONFIG_PATH << EOF
{
"module_name": "module",
"module_abi_version": 1,
"max_seq_len": 131072,
"attn_head_count": 8,
"attn_head_dim": 128,
"prefill_batch_sizes": [
$BS
],
"decode_batch_sizes": [
$BS
],
"transformer_block_count": 32,
"paged_kv_cache": {
"block_seq_stride": 16,
"device_block_count": 256
}
}
EOF
```

## Running the `shortfin` LLM server

We should now have all of the files that we need to run the shortfin LLM server.
Expand All @@ -178,15 +152,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
ls $EXPORT_DIR
```

- edited_config.json
- config.json
- meta-llama-3.1-8b-instruct.f16.gguf
- model.mlir
- model.vmfb
- tokenizer_config.json
- tokenizer.json

### Launch server:

<!-- #### Set the target device
TODO: Add instructions on targeting different devices,
when `--device=hip://$DEVICE` is supported -->
### Launch server

#### Run the shortfin server

Expand All @@ -209,7 +182,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
```bash
python -m shortfin_apps.llm.server \
--tokenizer_json=$TOKENIZER_PATH \
--model_config=$EDITED_CONFIG_PATH \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
Expand Down Expand Up @@ -252,7 +225,7 @@ port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
try:
resp = requests.post(generate_url, json=payload)
resp.raise_for_status() # Raises an HTTPError for bad responses
Expand Down
2 changes: 1 addition & 1 deletion sharktank/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ gguf==0.10.0
numpy<2.0

# Needed for newer gguf versions (TODO: remove when gguf package includes this)
# sentencepiece>=0.1.98,<=0.2.0
sentencepiece>=0.1.98,<=0.2.0

# Model deps.
huggingface-hub==0.22.2
Expand Down

0 comments on commit 8655451

Please sign in to comment.