Optimization plans #46

EricLBuehler · 2024-03-30T09:32:23Z

EricLBuehler
Mar 30, 2024
Maintainer

mistral.rs is currently at about 95% of llama.cpp's speed when comparing GGUF on an A10. To increase performance, we plan on optimizing the following areas:

Inplace ops
Support non-contiguous matmul
WIP PagedAttention implementation
Faster sampling, on device rather than CPU (if you can contribute this, please let us know!)
Remove unnecessary ops

If you have any other ideas, please let us know!

LLukas22 · 2024-03-30T21:02:27Z

LLukas22
Mar 30, 2024

I'm not sure how you conducted these benchmarks, but for quantized (gguf) models on CUDA devices, much of the performance differences likely stem from the distinct quantized CUDA implementations in llama.cpp and candle.

To my knowledge, candle only implements the "standard" dequantize_mul_mat_vec version of ggml's matrix multiplication operations. In contrast, llama.cpp offers several different implementations tailored for specific scenarios, which should yield a performance improvement, particularly on newer hardware.

4 replies

EricLBuehler Mar 31, 2024
Maintainer Author

Thanks for pointing that out! You are correct that candle only implements dmmv, and I think that by implementing mmvq we could gain a lot of performance. After taking a look at the llama.cpp code, it looks like the mmvq implementation is used over the dmmv whenever possible, which hints that this is a low hanging performance fruit.

The actual kernels (mmvq here and dmmv here) show that the dmmv needs to dequantize for each element in the kernel. I'm not sure how much using mmvq will achieve, but given the large performance discrepancy, I think it looks promising.

EricLBuehler Mar 31, 2024
Maintainer Author

Looks like huggingface/candle#1977 will add these, I was actually about to open a PR when I saw it - what a coincidence.

LLukas22 Mar 31, 2024

Yeah Laurent wanted to add those kernels for quite some time but it wasn't high on his priority list. Seams like he found some time to port them over into candle. I suspect those kernels will close the perfromamce gap quite a bit, especially if you also want to implement pages attention.

EricLBuehler Mar 31, 2024
Maintainer Author

Yes, hopefully! The only issue with implementing PagedAttention is that it requires a deep rewrite of the Scheduler/Sequence structure, and so it would probably be hard to keep the Metal/Cpu support, although maybe it could be gated under the cuda flag. I actually have a PR here and I plan on continuing work this week.

EricLBuehler · 2024-04-03T01:49:19Z

EricLBuehler
Apr 3, 2024
Maintainer Author

Update:

Using the optimized mmvq kernels increased performance significantly, and we are now matching llama.cpp performance when using the experimental sampler (used in #34) - exciting! The PagedAttention implementation (#47) currently builds and runs, but produces junk output. I believe the issue relates to the reshape_and_cache operation.

When benchmarking vLLM against mistral.rs (without #47), I discovered that the performance gap is not very large with a batch size of 1 (I do not have exact figures). However, with a batch size of 2, the vLLM reported T/s doubles almost exactly from the reported with batch size = 1. If this is not a mistake, then it represents an interesting angle, as it would mean that batch size throughput per sequence doubles when batch size increases.

0 replies

EricLBuehler · 2024-04-09T00:44:31Z

EricLBuehler
Apr 9, 2024
Maintainer Author

Without the experimental sampler, we are now at 92% of llama.cpp's speed on an A10 (73 vs 79 T/s) in #96! The experimental (on GPU, #67) sampler requires some work to implement the various decoding methods, specifically the topk/topp implementations as well as sampling for a start. Before #96, we were already matching llama.cpp performance on the expermintal sampler, so this may push us over the edge.

If there is anyone who is able to contribute a working sampler on the GPU that would be great and very much appreciated.

0 replies

chenwanqq · 2024-06-27T10:07:42Z

chenwanqq
Jun 27, 2024

Eager to help!
Just beat it🎶🎶🎶

1 reply

EricLBuehler Jun 27, 2024
Maintainer Author

Hi @chenwanqq! Here are a few points, in ascending order of complexity but also impact:

topk and topp sampling on the GPU - ideally a topk kernel, as this can be used in the Mixtral model and for AnyMoE: Build an MoE model from anything, quickly #476.
- This would likely have marginal T/s improvement, as I tested only argmax (which is just a CUDA kernel), and the improvement was not too great.
- It would be great to have for X-LoRA, and Mixtral (our current MoE offerings) and AnyMoE (upcoming...)
Development on the GPTQ implementation (which is faster + more precise than AWQ and GGUF), see Implement GPTQ quantization #467 (this would be much appreciated!)
- Currently that PR has the kernels and a function to invoke the GEMM
- Needs code and infrastructure to support loading
- For CPU: https://github.com/srush/llama2.rs/blob/main/src/gptq.rs
- Can be applied to all quantized (probably not ISQ for now) models, which will make mistral.rs competitive in the quantization space (we would have GGUF and GPTQ in the same project) w.r.t T/s
Integrate some PagedAttention kernels
- This would reduce the kv-cache bottleneck and could be integrated by using a trait to switch between the current ScaledDotProductAttention and a hypothetical PagedAttention, see https://github.com/EricLBuehler/candle-vllm for a working implementation
- Can be done for all models, which will make mistral.rs competitive in the batched inference space w.r.t req/s

Overall, for optimization:

The easiest place to start is probably the topk/topp kernels
GPTQ might be hard but would be amazing to have and would increase the available quantized T/s to users
- In the future we can support ISQ for GPTQ as most models have a GGUF quant already
PagedAttention is more difficult, largely because we need to abstract the scheduler and sequence infrastructure, but it would have a great impact

oldgithubman · 2024-07-27T01:01:28Z

oldgithubman
Jul 27, 2024

mistral.rs is currently at about 95% of llama.cpp's speed when comparing GGUF on an A10. To increase performance, we plan on optimizing the following areas:

Inplace ops

Support non-contiguous matmul

WIP PagedAttention implementation

Faster sampling, on device rather than CPU (if you can contribute this, please let us know!)

Remove unnecessary ops

If you have any other ideas, please let us know!

To increase performance, implement distributed inference!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization plans #46

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimization plans #46

EricLBuehler Mar 30, 2024 Maintainer

Replies: 5 comments · 5 replies

LLukas22 Mar 30, 2024

EricLBuehler Mar 31, 2024 Maintainer Author

EricLBuehler Mar 31, 2024 Maintainer Author

LLukas22 Mar 31, 2024

EricLBuehler Mar 31, 2024 Maintainer Author

EricLBuehler Apr 3, 2024 Maintainer Author

EricLBuehler Apr 9, 2024 Maintainer Author

chenwanqq Jun 27, 2024

EricLBuehler Jun 27, 2024 Maintainer Author

oldgithubman Jul 27, 2024

EricLBuehler
Mar 30, 2024
Maintainer

Replies: 5 comments 5 replies

LLukas22
Mar 30, 2024

EricLBuehler Mar 31, 2024
Maintainer Author

EricLBuehler Mar 31, 2024
Maintainer Author

EricLBuehler Mar 31, 2024
Maintainer Author

EricLBuehler
Apr 3, 2024
Maintainer Author

EricLBuehler
Apr 9, 2024
Maintainer Author

chenwanqq
Jun 27, 2024

EricLBuehler Jun 27, 2024
Maintainer Author

oldgithubman
Jul 27, 2024