Faster than llama.cpp on CUDA #612

EricLBuehler · 2024-07-23T16:00:00Z

EricLBuehler
Jul 23, 2024
Maintainer

To reproduce with mistral.rs:

cargo run --release --features cuda -- --log output.txt --port 1234 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

python3 example/server/chat.py

To reproduce with llama.cpp

make GGML_CUDA=1
./llama-cli -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -ngl 33 -n 1024 -p "Graphene is the best"

Mistral 7b Q4K medium, BS=1

Hardware	mistral.rs T/s (decode)	llama.cpp (decode)
A10	86	82
A100	127	137
A6000	113	100
RTX6000	103	96

Skarian · 2024-08-05T13:00:39Z

Skarian
Aug 5, 2024

Have any benchmarks been ran for hybrid inference (CPU + partial offload to GPU)?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster than llama.cpp on CUDA #612

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Faster than llama.cpp on CUDA #612

EricLBuehler Jul 23, 2024 Maintainer

Mistral 7b Q4K medium, BS=1

Replies: 1 comment

Skarian Aug 5, 2024

EricLBuehler
Jul 23, 2024
Maintainer

Skarian
Aug 5, 2024