Quantization in mistral.rs

Mistral.rs supports the following quantization:

GGUF/GGML
- Q, K type
- Supported in GGUF/GGML and GGUF/GGML adapter models
- I quants coming!
- CPU, CUDA, Metal (all supported devices)
GPTQ
- Supported in all plain and adapter models
- CUDA only
ISQ
- Q, K type GGUF quants
- Supported in all plain and adapter models
- I quants coming!
- GPTQ quants coming!
- CPU, CUDA, Metal (all supported devices)

Using a GGUF quantized model

cargo run --features cuda -- -i gguf -f my-gguf-file.gguf

See the docs

cargo run --features cuda -- -i --isq Q4K plain -m microsoft/Phi-3-mini-4k-instruct -a phi3

cargo run --features cuda -- -i plain -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit -a phi3