Faster than llama.cpp on CUDA #612
EricLBuehler
announced in
Announcements
Replies: 1 comment
-
Have any benchmarks been ran for hybrid inference (CPU + partial offload to GPU)? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
To reproduce with mistral.rs:
To reproduce with llama.cpp
Mistral 7b Q4K medium, BS=1
Beta Was this translation helpful? Give feedback.
All reactions