-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda vs Triton on an RTX 3060 12GB #5
Comments
Thank you for the detailed bug report. I've got some optimizations in the works that should help. I'll reply again once those are done and hopefully you can test again. |
I just pushed a new commit which should resolve this issue. Please let me know if things are working faster for you now. I've included a new |
I ran the benchmark with the command
Additionally I ran
I tried cuda aswell (using ooba's webui, this time with --no-stream enabled, and again with the same parameters as in the duck prompt command above):
still seems to be a bit slower than cuda |
i re-converted the model the results are the same:
|
Thank you for the report, it's helpful. I've been digging at this. Two things. The numbers reported by text-generation-webui are higher than the CUDA kernel achieves on my benchmarks even on my 3090, so I suspect there's some apples-oranges here and either text-generation-webui is applying optimizations of its own or it's measuring speed oddly. Either way, on my generate benchmark the CUDA kernel does indeed perform slightly faster than Triton when the prompt is small. Quite odd since in isolation the Triton kernel is faster. So it's only slower in situ. I rigged up the model and the results consistently show the Triton kernel being slower at performing the Very odd. I'll update once I've cracked the problem. |
As of my latest commit with some more optimizations, I've gotten the Triton kernel to beat CUDA in all cases on the I'll take a closer look at text-generation-webui next. |
for some reason it appears to be much slower than last time, the benchmark reports high tokens/s and yet it takes like a few minutes to generate prompts:
as if there's something happening before it generates the prompt |
seems like the generate.py command isn't affected by this
quite faster than last time maybe the benchmark.py warms up the autotune cache every time it generates a prompt? |
another test with generate.py:
nevermind, my fault for not noticing the --average option
|
ooba's webui results for comparing to generate.py:
|
I got gptq-triton running with text-generation-webui and was able to benchmark it on my machine. Below are the numbers I'm seeing. The GPTQ-for-LLaMa numbers on my 3090 are slower than the ones you're seeing. Are you running with GPTQ-for-LLaMa
GPTQ-triton
Setup:
|
I dont use --xformers, i run the webui with:
|
cuda: 35tokens/s
triton: 5tokens/s
I used ooba's webui only for cuda, because I've been unable to get triton to work with ooba's webui, I made sure i used the same parameters as in the command for triton:
For triton I used this command:
python3.10 generate.py --model ./ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 1.99 --top-p 0.18 --repetition-penalty 1.15 --max-length 128
I used the 7B-4bit model (i quantized it for triton using
python3.10 convert_weights.py --quant ~/AI/2oobabooga/text-generation-webui/models/llama-7b-4bit/llama-7b-4bit.safetensors --model ~/AI/oobabooga/text-generation-webui/models/LLaMA-7B/ --output ./
)GPU: RTX 3060 12GB
OS: Debian
The text was updated successfully, but these errors were encountered: