cusparseLtMatmul example is much slower than cublasGemmEx #228

SimonSongg · 2024-10-24T01:42:10Z

Hi, guys,

I compiled the example code for cusparseLt here: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul, which I used the default problem size, and used Nsight systems to profile the execution. I found it launched many kernels, which make the process slow:

cusparseLt:

cublas:

I then tried increase the problem size m, n, k to 320, 320, 640, cusparseLt is much slower,

cusparseLt:

cublas:

I used libcusparseLt.so.0.6.3.2, which is installed using apt-get following the official guide. CUDA version: 12.2; Hardware: NV A100
I am also wondering if it is expected that the libs are installed in /usr/lib/x86_64-linux-gnu but not the CUDA directory.

Any advice is appreciated! Thanks.

The text was updated successfully, but these errors were encountered:

j4yan · 2024-10-24T21:16:32Z

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

SimonSongg · 2024-10-25T02:25:24Z

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi @j4yan, thanks for reply. This is the code I used to test CUBLAS:

#include <iostream>
#include <cstdlib>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
using AB_t         = __half;
using C_t          = __half;
using COMPUTE_t    = float;
#define CHECK_CUDA(call)                                                     \
    {                                                                        \
        cudaError_t err = (call);                                            \
        if (err != cudaSuccess) {                                            \
            std::cerr << "CUDA error: " << cudaGetErrorString(err)           \
                      << " (code " << err << ") at " << __LINE__ << std::endl; \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

#define CHECK_CUBLAS(call)                                                   \
    {                                                                        \
        cublasStatus_t err = (call);                                         \
        if (err != CUBLAS_STATUS_SUCCESS) {                                  \
            std::cerr << "cuBLAS error: " << err                             \
                      << " at " << __LINE__ << std::endl;                    \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

int main() {
    constexpr int m = 320;
    constexpr int n = 320;
    constexpr int k = 640;
    float alpha = 1.0f;
    float beta = 1.0f;

    size_t sizeA = m * k * sizeof(__half);
    size_t sizeB = k * n * sizeof(__half);
    size_t sizeC = m * n * sizeof(__half);

    auto     hA             = new AB_t[sizeA / sizeof(AB_t)];
    auto     hB             = new AB_t[sizeB / sizeof(AB_t)];
    auto     hC             = new C_t[sizeC / sizeof(C_t)];

    // Initialize matrices
    for (int i = 0; i < m * k; i++) 
        hA[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2)); // -2 ~ 2

    for (int i = 0; i < k * n; I++)
        hB[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2));

    for (int i = 0; i < m * n; I++)
        hC[i] = static_cast<C_t>(static_cast<float>(std::rand() % 5 - 2));

    float *dA, *dB, *dC;
    CHECK_CUDA(cudaMalloc((void**)&dA, sizeA));
    CHECK_CUDA(cudaMalloc((void**)&dB, sizeB));
    CHECK_CUDA(cudaMalloc((void**)&dC, sizeC));

    CHECK_CUDA(cudaMemcpy(dA, hA, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dB, hB, sizeB, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dC, hC, sizeC, cudaMemcpyHostToDevice));
    cublasHandle_t handle;
    CHECK_CUBLAS(cublasCreate(&handle));

    // CUDA Timer setup
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    
    CHECK_CUDA(cudaEventRecord(start, nullptr));
    
    CHECK_CUBLAS(cublasGemmEx(handle,
                              CUBLAS_OP_N, CUBLAS_OP_N,
                              m, n, k,
                              &alpha,
                              dA, CUDA_R_16F, m,
                              dB, CUDA_R_16F, k,
                              &beta,
                              dC, CUDA_R_16F, m,
                              CUDA_R_32F,
                              CUBLAS_GEMM_DEFAULT));

    CHECK_CUDA(cudaEventRecord(stop, nullptr));
    CHECK_CUDA(cudaEventSynchronize(stop));

    float milliseconds = 0;
    CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, stop));
    std::cout << "Matrix multiplication took " << milliseconds << " ms" << std::endl;

    CHECK_CUDA(cudaMemcpy(hC, dC, sizeC, cudaMemcpyDeviceToHost));

    // Cleanup
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
    free(hA);
    free(hB);
    free(hC);
    cublasDestroy(handle);

    return 0;
}

I tried to use FP16 to align with the example code provided for cuSPARSELt. Still the same conclusion.

I am wondering, whether the behavior that tons of kernels are launched during the execution of cusparseLt example code (as I provided previously) is expected. It looks weird. I just copy paste the example code in this repository. Is there any bug in the example code that leads to this weird behavior?

Thanks!

SimonSongg · 2024-10-25T02:56:57Z

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi, @j4yan I found if set matmul_search=false, only one kernel will be launched as below, and the calculation result is correct.

j4yan · 2024-10-28T21:25:54Z

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm.
I'd suggest timing many matmul calls using cudaEvent.
Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

SimonSongg · 2024-10-29T13:15:25Z

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

Thanks for reply.

Why the matmul_search will make the gemm launch so many kernels? And I did use small problem size 320 x 320 x 640, and I use a for loop to run it 10 times, the latency seems similar to dense gemm. It might be due to the layout? I will check it soon.

Thanks!

j4yan · 2024-10-29T16:26:40Z

@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you won't observe much speedup.

JanuszL added the cuSPARSELt label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cusparseLtMatmul example is much slower than cublasGemmEx #228

cusparseLtMatmul example is much slower than cublasGemmEx #228

SimonSongg commented Oct 24, 2024 •

edited

Loading

j4yan commented Oct 24, 2024

SimonSongg commented Oct 25, 2024 •

edited

Loading

SimonSongg commented Oct 25, 2024

j4yan commented Oct 28, 2024

SimonSongg commented Oct 29, 2024

j4yan commented Oct 29, 2024

cusparseLtMatmul example is much slower than cublasGemmEx #228

cusparseLtMatmul example is much slower than cublasGemmEx #228

Comments

SimonSongg commented Oct 24, 2024 • edited Loading

j4yan commented Oct 24, 2024

SimonSongg commented Oct 25, 2024 • edited Loading

SimonSongg commented Oct 25, 2024

j4yan commented Oct 28, 2024

SimonSongg commented Oct 29, 2024

j4yan commented Oct 29, 2024

SimonSongg commented Oct 24, 2024 •

edited

Loading

SimonSongg commented Oct 25, 2024 •

edited

Loading