Use a simple model topology to configure ISQ and device mapping for per-layer with a single YAML file (examples here)!
To support per-layer mix of ISQ, Mistral.rs supports loading a model topology YAML file. This YAML file is formatted as follows:
- Top-level keys are either:
- A range of layers (
start-end
) wherestart < end
.start
is inclusive andend
is exclusive - A single layer number
- The topology for the range or layer:
- An optional key (
isq
) which maps to a single value, which can be any ISQ type. If not specified, there is no ISQ for this range of layers applied. - An optional key (
device
) which maps to a single value, which is one of the below. If not specified, the default loading deice will be used.cpu
cuda[ORDINAL]
metal[ORDINAL]
- An optional key (
- A range of layers (
Note that:
- The topology for the range is expanded to fill the range
- If ranges overlap, the range with the higher end layer takes precedence and will overwrite
- Any layers which are not covered will have no topology mapping. They will inherit any other ISQ (e.g. with
--isq
/in_situ_quant
) set. - Unless the layer is not covered by the topology, the topology value will override any other ISQ (e.g. with
--isq
/in_situ_quant
). - The topology device mapping will override any other device mapping.
- When using UQFF, only the device mapping is relevant.
0-8:
isq: Q3K
device: cuda[0]
8-16:
isq: Q4K
device: cpu
16-24:
isq: Q6K
# Skip 24-28
28-32:
isq: Q8_0
device: cuda[0]
Model topologies may be applied to all model types.
Note
You should replace --features ...
with one of the features specified here, or remove it for pure CPU inference.
cargo run --features ... -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3 --topology topologies/isq.yml
Note
You should replace --features ...
with one of the features specified here, or remove it for pure CPU inference.
cargo run --features ... -- --port 1234 plain -m microsoft/Phi-3-mini-128k-instruct -a phi3 --topology topologies/isq.yml
Example here.
Example here.