Simple app providing a CLI command to run DeepSeek r1 (4 bit quant of the Qwen 32B distilled model, via unsloth and vLLM)
I recommend using uv
and running compileall
on the .venv
directory to get 40% faster import times:
uv venv
source .venv/bin/activate
uv pip install -e .
$(uv python find -m compileall .)
You can then symlink your entrypoint at .venv/bin/r1
to a directory in your path for use as a command without activating the venv,
~/opt/bin $ ln -s /home/louis/lab/r1/.venv/bin/r1 r1
or else export it onto your PATH in your bashrc.
Pass prompt messages (which will be given role of user) via CLI, as well as flags such as
--temperature
(default 1.0 is good, up to 1.8 can work, above that it loses coherence):
usage: r1 [-h] [-m [MAX_NEW_TOKENS]] [-d] [--temperature [TEMPERATURE]]
[--top-p [TOP_P]]
[messages ...]
positional arguments:
messages -
options:
-h, --help show this help message and exit
-m [MAX_NEW_TOKENS], --max-new-tokens [MAX_NEW_TOKENS]
-
-d, --deterministic False
--temperature [TEMPERATURE]
-
--top-p [TOP_P] -
This project integrates hftorchcache
to optimise Hugging Face model loading,
improving initialisation speed by reusing pre-converted PyTorch serialised checkpoints.
This is roughly 10x faster.
You can serve with
vllm serve "unsloth/DeepSeek-R1-Distill-Qwen-1.5B-bnb-4bit" --quantization bitsandbytes --load-format bitsandbytes
There are 2 quantisation formats on HuggingFace for r1: bitsandbytes and AWQ
See VLLM notes.
See TGI notes.
Aider bench, Python subset, 4-bit AWQ on vLLM: https://gist.github.com/lmmx/ab6563e681d936fd9c3c864447fbf19f
- 1.5B ⇢ 0% pass@1, 0% pass@2, 35% well-formed
- 7B ⇢ 0% pass rate, 42% well-formed
Aider bench, Python subset, 4-bit AWQ on TGI
- 14B ⇢ 0% pass rate, 55% well-formed
- 32B ⇢ ?% pass rate, ?% well-formed
This repo also contains two entrypoints silencio
and vllmsilencio
which demonstrate use of
logits processors:
src/r1/silent_thought.py
shows use with Transformers, post-processing the generation resultsrc/r1/silent_thought_streamer.py
shows use with Transformers, modifying theTextStreamer
output - this is exposed via thesilencio
entrypoint with arghsrc/r1/silent_thought_vllm.py
shows use with vLLM, post-processing the generation result - this is exposed via thevllmsilencio
entrypoint with argh
Since vLLM is async, you're more likely to be using it for the entire result, but if you wanted to
stream results then a similar approach taken with the transformers TextStreamer
would work.
Each of these works by switching to structured JSON output using the Outlines JSONLogitsProcessor
.
Note that the Pydantic model used to guide this JSON will not include a field for the
"reasoning" key that the CoT gets put into! It would be a simple extension to take the code as is
and modify it to accept a key that must exist in the guide Pydantic model to use to house the CoT
(the JSON schema could be modifed to hardcode the field as constr of length 0 or a constant of type
typing.Literal[""]
).