Replies: 1 comment
-
YASSSSS! vllm does this easily. would be awesome to see support in llamafile and llama-cpp!!! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been playing with tabbyAPI and its support for draft models. In short, the performance benefit is very, very obvious. And I wonder what it could mean for inference on the CPU. Or even a mixed CPU/NPU/GPU setup.
Intuitively (which may be very wrong, of course) I think this could allow for 14B models being more available to the GPU poor. In a practical sense.
See some ballpark numbers without speculative decoding here:
Beta Was this translation helpful? Give feedback.
All reactions