You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The export script in sharktank was built specifically for llama 3.1 models and has some rough edges. Along with this, it requires users to chain together cli commands: python -m sharktank.examples.export_paged_llm_v1.py [--options], then iree-compile [--options].
It has some rough edges, is a bit cumbersome from a user perspective, and requires CI runs to invoke cli commands via subprocess, instead of having a programmatic in-memory alternative.
We should find a more general and easier to use solution to handle generating mlir for LLM models and compiling those models to .vmfb for shortfin server.
Below is a starting point recommendation provided by @ScottTodd:
Ah yes, I was just going to connect those dots too.
For SDXL there are multiple submodels (VAE + UNet + CLIP), so having the build system manage all of them is especially helpful. Ideally we can standardize on a similar set of APIs for llama, SDXL, and future supported models.
Discussion for this in #373 and #284.
The export script in sharktank was built specifically for llama 3.1 models and has some rough edges. Along with this, it requires users to chain together cli commands:
python -m sharktank.examples.export_paged_llm_v1.py [--options]
, theniree-compile [--options]
.It has some rough edges, is a bit cumbersome from a user perspective, and requires CI runs to invoke cli commands via
subprocess
, instead of having a programmatic in-memory alternative.We should find a more general and easier to use solution to handle generating mlir for LLM models and compiling those models to
.vmfb
for shortfin server.Below is a starting point recommendation provided by @ScottTodd:
"Users shouldn't need to chain together
python -m sharktank.examples
. andiree-compile ...
commands. We can aim for something like https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference(that's as minimal as it gets - we'll want to pass options like the compilation target though)"
The text was updated successfully, but these errors were encountered: