Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update user docs for running llm server + upgrade gguf to 0.11.0 #676

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 27 additions & 54 deletions docs/shortfin/llm/user/e2e_llama8b_mi300x.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,30 +22,37 @@ python -m venv --prompt shark-ai .venv
source .venv/bin/activate
```

### Install `shark-ai`
### Install `shortfin`

You can install either the `latest stable` version of `shark-ai`
or the `nightly` version:
You can install either the `latest stable` version of shortfin by installing
`shark-ai` or the `nightly` version directly:
Comment on lines +27 to +28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just install shortfin or shortfin[apps] too. All the meta shark-ai package does is pin to matching versions of all packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I felt like it was better branding to use shark-ai for that part. I can switch it to shortfin[apps] though. What do you think?


#### Stable

```bash
pip install shark-ai
pip install shark-ai[apps]
```

#### Nightly

```bash
pip install sharktank -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
pip install shortfin[apps] --pre -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

#### Install dataclasses-json
<!-- TODO: Remove when sharktank added to `shark-ai[apps]` -->
### Install `sharktank`

<!-- TODO: This should be included in release: -->
Install the `nightly` version of sharktank:

```bash
pip install dataclasses-json
pip install sharktank --pre -f https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```
Comment on lines +43 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User guides should point to stable releases by default. I'd structure this as

## Install stable shark-ai packages

<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install shark-ai[apps] sharktank
```

### Nightly packages

To install nightly packages:

<!-- TODO: Add `sharktank` to `shark-ai` meta package -->

```bash
pip install shark-ai[apps] sharktank \
    --pre --find-links https://github.com/nod-ai/shark-ai/releases/expanded_assets/dev-wheels
```

See also the
[instructions here](https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).

(Should check that those install commands all work)


<!-- TODO: Remove once `sentencepiece` added to nightly `sharktank` -->
### Install `sentencepiece`

```bash
pip install sentencepiece
```
Comment on lines +51 to 56
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove now


### Define a directory for export files
Expand Down Expand Up @@ -78,8 +85,8 @@ This example uses the `llama8b_f16.gguf` and `tokenizer.json` files
that were downloaded in the previous step.

```bash
export MODEL_PARAMS_PATH=$EXPORT_DIR/llama3.1-8b/llama8b_f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/llama3.1-8b/tokenizer.json
export MODEL_PARAMS_PATH=$EXPORT_DIR/meta-llama-3.1-8b-instruct.f16.gguf
export TOKENIZER_PATH=$EXPORT_DIR/tokenizer.json
```

#### General env vars
Expand All @@ -91,8 +98,6 @@ The following env vars can be copy + pasted directly:
export MLIR_PATH=$EXPORT_DIR/model.mlir
# Path to export config.json file
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
# Path to export edited_config.json file
export EDITED_CONFIG_PATH=$EXPORT_DIR/edited_config.json
# Path to export model.vmfb file
export VMFB_PATH=$EXPORT_DIR/model.vmfb
# Batch size for kvcache
Expand All @@ -108,7 +113,7 @@ to export our model to `.mlir` format.

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file=$MODEL_PARAMS_PATH \
--gguf-file=$MODEL_PARAMS_PATH \
--output-mlir=$MLIR_PATH \
--output-config=$OUTPUT_CONFIG_PATH \
--bs=$BS
Expand Down Expand Up @@ -137,37 +142,6 @@ iree-compile $MLIR_PATH \
-o $VMFB_PATH
```

## Write an edited config

We need to write a config for our model with a slightly edited structure
to run with shortfin. This will work for the example in our docs.
You may need to modify some of the parameters for a specific model.

### Write edited config

```bash
cat > $EDITED_CONFIG_PATH << EOF
{
"module_name": "module",
"module_abi_version": 1,
"max_seq_len": 131072,
"attn_head_count": 8,
"attn_head_dim": 128,
"prefill_batch_sizes": [
$BS
],
"decode_batch_sizes": [
$BS
],
"transformer_block_count": 32,
"paged_kv_cache": {
"block_seq_stride": 16,
"device_block_count": 256
}
}
EOF
```

## Running the `shortfin` LLM server

We should now have all of the files that we need to run the shortfin LLM server.
Expand All @@ -178,15 +152,14 @@ Verify that you have the following in your specified directory ($EXPORT_DIR):
ls $EXPORT_DIR
```

- edited_config.json
- config.json
- meta-llama-3.1-8b-instruct.f16.gguf
- model.mlir
- model.vmfb
- tokenizer_config.json
- tokenizer.json

### Launch server:

<!-- #### Set the target device

TODO: Add instructions on targeting different devices,
when `--device=hip://$DEVICE` is supported -->
### Launch server

#### Run the shortfin server

Expand All @@ -209,7 +182,7 @@ Run the following command to launch the Shortfin LLM Server in the background:
```bash
python -m shortfin_apps.llm.server \
--tokenizer_json=$TOKENIZER_PATH \
--model_config=$EDITED_CONFIG_PATH \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
Expand Down Expand Up @@ -252,7 +225,7 @@ port = 8000 # Change if running on a different port
generate_url = f"http://localhost:{port}/generate"

def generation_request():
payload = {"text": "What is the capital of the United States?", "sampling_params": {"max_completion_tokens": 50}}
payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
try:
resp = requests.post(generate_url, json=payload)
resp.raise_for_status() # Raises an HTTPError for bad responses
Expand Down
5 changes: 1 addition & 4 deletions sharktank/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
iree-turbine

# Runtime deps.
gguf==0.10.0
gguf==0.11.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also leave this unpinned, so users can update if they want

Suggested change
gguf==0.11.0
gguf>=0.11.0

numpy<2.0

# Needed for newer gguf versions (TODO: remove when gguf package includes this)
# sentencepiece>=0.1.98,<=0.2.0

# Model deps.
huggingface-hub==0.22.2
transformers==4.40.0
Expand Down
Loading