-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cusparseLtMatmul example is much slower than cublasGemmEx #228
Comments
@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS? |
Hi @j4yan, thanks for reply. This is the code I used to test CUBLAS:
I tried to use FP16 to align with the example code provided for cuSPARSELt. Still the same conclusion. I am wondering, whether the behavior that tons of kernels are launched during the execution of cusparseLt example code (as I provided previously) is expected. It looks weird. I just copy paste the example code in this repository. Is there any bug in the example code that leads to this weird behavior? Thanks! |
Hi, @j4yan I found if set matmul_search=false, only one kernel will be launched as below, and the calculation result is correct. |
Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. |
Thanks for reply. Why the matmul_search will make the gemm launch so many kernels? And I did use small problem size 320 x 320 x 640, and I use a for loop to run it 10 times, the latency seems similar to dense gemm. It might be due to the layout? I will check it soon. Thanks! |
@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you won't observe much speedup. |
Hi, guys,
I compiled the example code for cusparseLt here: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul, which I used the default problem size, and used Nsight systems to profile the execution. I found it launched many kernels, which make the process slow:
cusparseLt:


cublas:
I then tried increase the problem size m, n, k to 320, 320, 640, cusparseLt is much slower,
cusparseLt:


cublas:
I used libcusparseLt.so.0.6.3.2, which is installed using apt-get following the official guide. CUDA version: 12.2; Hardware: NV A100
I am also wondering if it is expected that the libs are installed in
/usr/lib/x86_64-linux-gnu
but not the CUDA directory.Any advice is appreciated! Thanks.
The text was updated successfully, but these errors were encountered: