Support for speculative decoding, draft models in llamafile? #632

dagbdagb · 2024-11-19T19:13:19Z

dagbdagb
Nov 19, 2024

I have been playing with tabbyAPI and its support for draft models. In short, the performance benefit is very, very obvious. And I wonder what it could mean for inference on the CPU. Or even a mixed CPU/NPU/GPU setup.

Intuitively (which may be very wrong, of course) I think this could allow for 14B models being more available to the GPU poor. In a practical sense.

See some ballpark numbers without speculative decoding here:

richieledude · 2025-01-08T21:22:33Z

richieledude
Jan 8, 2025

YASSSSS! vllm does this easily. would be awesome to see support in llamafile and llama-cpp!!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for speculative decoding, draft models in llamafile? #632

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Support for speculative decoding, draft models in llamafile? #632

dagbdagb Nov 19, 2024

Replies: 1 comment

richieledude Jan 8, 2025

dagbdagb
Nov 19, 2024

richieledude
Jan 8, 2025