Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: getrs serial internal implementations #2488

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

yasahi-hpc
Copy link
Contributor

@yasahi-hpc yasahi-hpc commented Jan 27, 2025

Fixes #2485

  • unit-test passes with H100 and Cuda 12.0.0. There seems to be a compiler bug in Cuda 12.0.0 which applies an aggressive loop unroll that crashes the SerialLaswpVectorBackwardInternal. This can be avoided by disallowing the loop unrolling inside this function. I did not observe failures for other Cuda versions.
  • Remove using namespace KokkosBatched from getrs unit-test

@cwpearson
It seems fine on my env, but could you please test on your side?

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's give it a try

@lucbv lucbv added the AT2-SPECIAL-APPROVAL Mark .github changes as approved. label Jan 27, 2025
@cwpearson
Copy link
Contributor

Any theories as to why removing the wrapping struct fixes this?

@lucbv lucbv removed the AT2-SPECIAL-APPROVAL Mark .github changes as approved. label Jan 27, 2025
@lucbv
Copy link
Contributor

lucbv commented Jan 27, 2025

I guess instead of overloading the whole struct you only get the call to invoke to be instantiated, I am not sure how that affects the other members of the struct?

@yasahi-hpc
Copy link
Contributor Author

Any theories as to why removing the wrapping struct fixes this?

Not very clear to me. Even if it there is a bug in previous implementation, the failure for Transpose only with BlkSize >= 4 is difficult to understand.
@cwpearson Does this change fix the issue on your env?

I guess instead of overloading the whole struct you only get the call to invoke to be instantiated, I am not sure how that affects the other members of the struct?

I agree. At least, the current implementation is straightforward and is typically used in batched functions.

@cwpearson
Copy link
Contributor

cwpearson commented Jan 28, 2025

Fortunately (for my sanity) but unfortunately for #2485, the issue is not quite resolved. Slight difference to how it manifested in 2485:

__remote_shared__ read rather than __global__

[ RUN      ] Cuda.test_batched_getrs_nt_double
[       OK ] Cuda.test_batched_getrs_nt_double (212 ms)
[ RUN      ] Cuda.test_batched_getrs_t_double
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __remote_shared__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x16ec9c in CTA 253 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x1490c]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x6bb4b]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x2d3a1]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:__device_stub__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_22CombinedFunctorReducerIN4Test5Getrs26Functor_BatchedSerialGetrsINS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_4ViewIPPPdJNS_10LayoutLeftESA_EEENSB_IPPiJSF_SA_EEENSB_ISD_JSF_SA_EEENS5_8ParamTagIN10KokkosBlas5Trans9TransposeEEENSM_4Algo6Level39UnblockedEEENS0_15FunctorAnalysisINS0_23FunctorPatternInterface6REDUCEENS_11RangePolicyIJS8_SP_EEEST_iE7ReducerEvEESY_S8_EEEEvT_(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25acc]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:void Kokkos::Impl::__wrapper__device_stub_cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> >(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25aea]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda

Yuuichi Asahi added 3 commits January 29, 2025 06:54
@yasahi-hpc
Copy link
Contributor Author

Fortunately (for my sanity) but unfortunately for #2485, the issue is not quite resolved. Slight difference to how it manifested in 2485:

__remote_shared__ read rather than __global__

[ RUN      ] Cuda.test_batched_getrs_nt_double
[       OK ] Cuda.test_batched_getrs_nt_double (212 ms)
[ RUN      ] Cuda.test_batched_getrs_t_double
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __remote_shared__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x16ec9c in CTA 253 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x1490c]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x6bb4b]
=========                in /projects/x86-64-icelake-rocky8/tpls/cuda/12.0.0/gcc/11.3.0/base/ksndyya/lib64/libcudart.so.12
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x2d3a1]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:__device_stub__ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceINS0_22CombinedFunctorReducerIN4Test5Getrs26Functor_BatchedSerialGetrsINS_6DeviceINS_4CudaENS_9CudaSpaceEEENS_4ViewIPPPdJNS_10LayoutLeftESA_EEENSB_IPPiJSF_SA_EEENSB_ISD_JSF_SA_EEENS5_8ParamTagIN10KokkosBlas5Trans9TransposeEEENSM_4Algo6Level39UnblockedEEENS0_15FunctorAnalysisINS0_23FunctorPatternInterface6REDUCEENS_11RangePolicyIJS8_SP_EEEST_iE7ReducerEvEESY_S8_EEEEvT_(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25acc]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda
=========     Host Frame:void Kokkos::Impl::__wrapper__device_stub_cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> >(Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<int**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose> >, Kokkos::Cuda> const&) [0x25aea]
=========                in .../build-kernels/batched/dense/unit_test/KokkosKernels_batched_dla_cuda

After further investigation, I found an issue in Laswp.
Can you give it a try again with the new fix.

My conclusion is that there is a compiler bug in Cuda 12.0.0 which crashes Laswp with an aggressive loop unrolling. As far as I am concerned, the error happens specifically with Cuda 12.0.0 and H100, so I added a directive to disallow loop unrolling for Cuda 12.0.0 and H100.

@yasahi-hpc yasahi-hpc requested a review from lucbv January 29, 2025 15:18
Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unfortunate if that's the only solution but let's test at least and see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out of bounds memory access in Cuda.test_batched_getrs_t_double unit test (H100, Cuda 12.0.0)
3 participants