Optimization plans #46
Replies: 5 comments 5 replies
-
I'm not sure how you conducted these benchmarks, but for quantized (gguf) models on CUDA devices, much of the performance differences likely stem from the distinct quantized CUDA implementations in To my knowledge, |
Beta Was this translation helpful? Give feedback.
-
Update: Using the optimized When benchmarking |
Beta Was this translation helpful? Give feedback.
-
Without the experimental sampler, we are now at 92% of llama.cpp's speed on an A10 (73 vs 79 T/s) in #96! The experimental (on GPU, #67) sampler requires some work to implement the various decoding methods, specifically the topk/topp implementations as well as sampling for a start. Before #96, we were already matching If there is anyone who is able to contribute a working sampler on the GPU that would be great and very much appreciated. |
Beta Was this translation helpful? Give feedback.
-
Eager to help! |
Beta Was this translation helpful? Give feedback.
-
To increase performance, implement distributed inference! |
Beta Was this translation helpful? Give feedback.
-
mistral.rs
is currently at about 95% ofllama.cpp
's speed when comparing GGUF on an A10. To increase performance, we plan on optimizing the following areas:If you have any other ideas, please let us know!
Beta Was this translation helpful? Give feedback.
All reactions