Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cusparseLtMatmul example is much slower than cublasGemmEx #228

Open
SimonSongg opened this issue Oct 24, 2024 · 6 comments
Open

cusparseLtMatmul example is much slower than cublasGemmEx #228

SimonSongg opened this issue Oct 24, 2024 · 6 comments

Comments

@SimonSongg
Copy link

SimonSongg commented Oct 24, 2024

Hi, guys,

I compiled the example code for cusparseLt here: https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul, which I used the default problem size, and used Nsight systems to profile the execution. I found it launched many kernels, which make the process slow:

cusparseLt:
Image
cublas:
Image

I then tried increase the problem size m, n, k to 320, 320, 640, cusparseLt is much slower,

cusparseLt:
Image
cublas:
Image

I used libcusparseLt.so.0.6.3.2, which is installed using apt-get following the official guide. CUDA version: 12.2; Hardware: NV A100
I am also wondering if it is expected that the libs are installed in /usr/lib/x86_64-linux-gnu but not the CUDA directory.

Any advice is appreciated! Thanks.

@j4yan
Copy link
Contributor

j4yan commented Oct 24, 2024

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

@SimonSongg
Copy link
Author

SimonSongg commented Oct 25, 2024

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi @j4yan, thanks for reply. This is the code I used to test CUBLAS:

#include <iostream>
#include <cstdlib>
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
using AB_t         = __half;
using C_t          = __half;
using COMPUTE_t    = float;
#define CHECK_CUDA(call)                                                     \
    {                                                                        \
        cudaError_t err = (call);                                            \
        if (err != cudaSuccess) {                                            \
            std::cerr << "CUDA error: " << cudaGetErrorString(err)           \
                      << " (code " << err << ") at " << __LINE__ << std::endl; \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

#define CHECK_CUBLAS(call)                                                   \
    {                                                                        \
        cublasStatus_t err = (call);                                         \
        if (err != CUBLAS_STATUS_SUCCESS) {                                  \
            std::cerr << "cuBLAS error: " << err                             \
                      << " at " << __LINE__ << std::endl;                    \
            return EXIT_FAILURE;                                             \
        }                                                                    \
    }

int main() {
    constexpr int m = 320;
    constexpr int n = 320;
    constexpr int k = 640;
    float alpha = 1.0f;
    float beta = 1.0f;

    size_t sizeA = m * k * sizeof(__half);
    size_t sizeB = k * n * sizeof(__half);
    size_t sizeC = m * n * sizeof(__half);

    auto     hA             = new AB_t[sizeA / sizeof(AB_t)];
    auto     hB             = new AB_t[sizeB / sizeof(AB_t)];
    auto     hC             = new C_t[sizeC / sizeof(C_t)];

    // Initialize matrices
    for (int i = 0; i < m * k; i++) 
        hA[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2)); // -2 ~ 2

    for (int i = 0; i < k * n; I++)
        hB[i] = static_cast<AB_t>(static_cast<float>(std::rand() % 5 - 2));

    for (int i = 0; i < m * n; I++)
        hC[i] = static_cast<C_t>(static_cast<float>(std::rand() % 5 - 2));

    float *dA, *dB, *dC;
    CHECK_CUDA(cudaMalloc((void**)&dA, sizeA));
    CHECK_CUDA(cudaMalloc((void**)&dB, sizeB));
    CHECK_CUDA(cudaMalloc((void**)&dC, sizeC));

    CHECK_CUDA(cudaMemcpy(dA, hA, sizeA, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dB, hB, sizeB, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(dC, hC, sizeC, cudaMemcpyHostToDevice));
    cublasHandle_t handle;
    CHECK_CUBLAS(cublasCreate(&handle));

    // CUDA Timer setup
    cudaEvent_t start, stop;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&stop));
    
    CHECK_CUDA(cudaEventRecord(start, nullptr));
    
    CHECK_CUBLAS(cublasGemmEx(handle,
                              CUBLAS_OP_N, CUBLAS_OP_N,
                              m, n, k,
                              &alpha,
                              dA, CUDA_R_16F, m,
                              dB, CUDA_R_16F, k,
                              &beta,
                              dC, CUDA_R_16F, m,
                              CUDA_R_32F,
                              CUBLAS_GEMM_DEFAULT));

    CHECK_CUDA(cudaEventRecord(stop, nullptr));
    CHECK_CUDA(cudaEventSynchronize(stop));

    float milliseconds = 0;
    CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, stop));
    std::cout << "Matrix multiplication took " << milliseconds << " ms" << std::endl;

    CHECK_CUDA(cudaMemcpy(hC, dC, sizeC, cudaMemcpyDeviceToHost));

    // Cleanup
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
    free(hA);
    free(hB);
    free(hC);
    cublasDestroy(handle);

    return 0;
}

I tried to use FP16 to align with the example code provided for cuSPARSELt. Still the same conclusion.

I am wondering, whether the behavior that tons of kernels are launched during the execution of cusparseLt example code (as I provided previously) is expected. It looks weird. I just copy paste the example code in this repository. Is there any bug in the example code that leads to this weird behavior?

Thanks!

@SimonSongg
Copy link
Author

@SimonSongg Could you double check the data types and layouts are the same in cuSPARSELt and cuBLAS?

Hi, @j4yan I found if set matmul_search=false, only one kernel will be launched as below, and the calculation result is correct.
Image

@j4yan
Copy link
Contributor

j4yan commented Oct 28, 2024

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm.
I'd suggest timing many matmul calls using cudaEvent.
Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

@SimonSongg
Copy link
Author

Many kernels are launched by cusparseLtMatmulSearch(), by setting matmul_search=false this routine is disabled. For small problem sizes like 320 x 320 x 640 you probably observe much speedup against dense gemm. I'd suggest timing many matmul calls using cudaEvent. Also the layouts of dense gemm (A and B are both column-major) is different than the layouts of sparse gemm (A is row-major and B is column-major).

Thanks for reply.

Why the matmul_search will make the gemm launch so many kernels? And I did use small problem size 320 x 320 x 640, and I use a for loop to run it 10 times, the latency seems similar to dense gemm. It might be due to the layout? I will check it soon.

Thanks!

@j4yan
Copy link
Contributor

j4yan commented Oct 29, 2024

@SimonSongg cusparseLtMatmulSearch() is the auto-tuning API. Sorry I mean for very small sizes you won't observe much speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants