Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error Internal with large batch size in gemm::device::GemmBatched #1369

Closed
RaulPPelaez opened this issue Feb 29, 2024 · 3 comments
Closed
Labels
? - Needs Triage bug Something isn't working

Comments

@RaulPPelaez
Copy link
Contributor

Describe the bug
I am trying to run the example https://github.com/NVIDIA/cutlass/blob/main/examples/05_batched_gemm/batched_gemm.cu
Which I am compiling to run on an RTX 4090 with:

nvcc -arch=sm_89 batched_gemm.cu -I${CUTLASS_ROOT}/include -run

The code runs well until I change the problem size:

// Arbitrary problem size
int const m = 520;
int const n = 219;
int const k = 129;
int const batch_count = 17;

to:

    int const m = 16;
    int const n = 16;
    int const k = 8;
    int const batch_count = 65536;

Anything below, like batch_count=65536-1 works.

Steps/Code to reproduce bug
I managed to reduce the failing code to this:

#include <iostream>
#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm_batched.h"
#include "cutlass/layout/matrix.h"

cudaError_t cutlass_strided_batched_sgemm(int m, int n, int k, float alpha, float const* A, int lda,
                                          long long int batch_stride_A, float const* B, int ldb,
                                          long long int batch_stride_B, float* C, int ldc,
                                          long long int batch_stride_C, float beta,
                                          int batch_count) {

    using Gemm = cutlass::gemm::device::GemmBatched<float, cutlass::layout::ColumnMajor, float,
                                                    cutlass::layout::ColumnMajor, float,
                                                    cutlass::layout::ColumnMajor>;
    Gemm gemm_op;
    cutlass::Status status = gemm_op({{m, n, k},
                                      {A, lda},
                                      batch_stride_A,
                                      {B, ldb},
                                      batch_stride_B,
                                      {C, ldc},
                                      batch_stride_C,
                                      {C, ldc},
                                      batch_stride_C,
                                      {alpha, beta},
                                      batch_count});
    if (status != cutlass::Status::kSuccess) {
        std::cerr << "Cutlass failed with error string " << cutlass::cutlassGetStatusString(status)
                  << std::endl;
        return cudaErrorUnknown;
    }
    return cudaSuccess;
}

cudaError_t run_batched_gemm() {
    // Arbitrary problem size
    int const m = 16;
    int const n = 16;
    int const k = 8;
    int const batch_count = 65536;
    // A, B are non-transpose, column major
    int const lda = m;
    int const ldb = k * batch_count;
    int const ldc = m;
    int const count_A = batch_count * lda * k;
    int const count_B = ldb * n;
    int const count_C = batch_count * ldc * n;
    // the memory is batched along K dimension
    long long int batch_stride_A = static_cast<long long int>(lda) * static_cast<long long int>(k);
    long long int batch_stride_B = static_cast<long long int>(k);
    long long int batch_stride_C = static_cast<long long int>(ldc) * static_cast<long long int>(n);
    // alpha and beta
    float alpha = 1.0f;
    float beta = 2.0f;
    cudaError_t result = cudaSuccess;
    // // allocate the device memory
    float* A;
    float* B;
    float* C;
    result = cudaMalloc(&A, count_A * sizeof(float));
    result = cudaMalloc(&B, count_B * sizeof(float));
    result = cudaMalloc(&C, count_C * sizeof(float));
    if (result != cudaSuccess) {
        std::cerr << "cudaMalloc result = " << result << std::endl;
        return result;
    }
    result =
        cutlass_strided_batched_sgemm(m, n, k, alpha, A, lda, batch_stride_A, B, ldb,
                                      batch_stride_B, C, ldc, batch_stride_C, beta, batch_count);
    return result;
}

int main() {
    cudaError_t result = run_batched_gemm();
    if (result == cudaSuccess) {
        std::cout << "Passed." << std::endl;
    } else {
        std::cout << "There was an error." << std::endl;
    }

    return result == cudaSuccess ? 0 : -1;
}

Expected behavior

I would not expect this API call to have a maximum number of blocks so low. Or at least have a more informative error status if that is the actual limit.

Environment details (please complete the following information):

  • Environment location: Bare-metal, a CUDA 12.3 installation with the current master, a8f2c80

Additional context

The error also arises in the v3.3.0 git tag.

@RaulPPelaez RaulPPelaez added ? - Needs Triage bug Something isn't working labels Feb 29, 2024
@hwu36
Copy link
Collaborator

hwu36 commented Feb 29, 2024

block number is assigned to blockIdx.z which is 16bit. you could split the batch to 2 kernels to run. should not impact the performance with this large batch number.

@hwu36
Copy link
Collaborator

hwu36 commented Feb 29, 2024

btw, i think cublas has the same restriction.

@RaulPPelaez
Copy link
Contributor Author

Thanks for the quick answer!
I suspected it had to do with block sizes. I was surprised about the error type.
Did I missed something in the documentation about it? Sorry, I am still learning to navigate it.
Thanks for the good work, you are awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants