Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds error when running multi-GPU/partitioned HISQ MG with long links dropped #1512

Open
weinbe2 opened this issue Nov 6, 2024 · 1 comment
Assignees
Labels

Comments

@weinbe2
Copy link
Contributor

weinbe2 commented Nov 6, 2024

In brief, there is an oob error when running HISQ MG with long links dropped, though it can be triggered without ever dropping to a true coarser level. It only appears with non-zero partitioning; I haven't tested if running it with true multi-GPU is fine or not. There are no issues when "normal" HISQ MG is run (improved staggered on the pseudo-fine level as well), suggesting that something is going awry with switching between the improved staggered (outer level) and unimproved staggered (inner level) operators.

The error does not hit until the first solve, i.e. after setup as completed. It more specifically triggers when returning to the fine level from the pseudo-fine level, aka when going to applying the improved operator from the unimproved operator. The time at which it hits (when it does) depends on the local volume---no error on ~16^4, but it hits on the first iteration on ~24^4+. It does seem to be deterministic at fixed command incl volume, at least.

This error hits independent of if tuning is enabled or not.

A command that triggers it is as follows:

mpirun -np 1 ./staggered_invert_test \
  --mass 0.1 \
  --dim 24 24 24 24 --gridsize 1 1 1 1 --partition 8 \
  --dslash-type asqtad --tol 1e-5 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 2 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized-drop-long \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --nsrc 1 --niter 25 \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 8 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 8 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose

This is roughly trimmed down as much as possible, the various combinations of mat and direct are non-default but required for HISQ MG as is currently implemented. As noted above you never actually need to enter a true coarse solve to trigger the error, but you do still need to compile with Nc = 24 for the KD-operator construction.

A representative error message is:

QMP m0,[email protected] error: abort: 1
MG level 0 (GPU): ERROR: qudaEventQuery_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (dslash_policy.hpp:398 in operator()())
 (rank 0, host viking-prod-259.nvidia.com, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
MG level 0 (GPU):        last kernel called was (name=N4quda9StaggeredINS_12StaggeredArgIfLi3ELi4EL21QudaReconstructType_s18ELS2_18ELb1EL20QudaStaggeredPhase_s1EEEEE,volume=24x24x24x24,aux=policy_kernel=interior,GPU-offline,vol=331776,parity=2,precision=4,order=2,Ns=1,Nc=3,commDim=0001,xpay,n_rhs=1,comm=0001)

My cmake command was:

cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \
      -DQUDA_DIRAC_STAGGERED=ON \
      -DCMAKE_BUILD_TYPE=DEVEL \
      -DQUDA_BACKWARDS=ON \
      -D CMAKE_INSTALL_PREFIX=/scratch/local/install \
      -DQUDA_PRECISION=12 \
      -DQUDA_RECONSTRUCT=4 \
      -DQUDA_MPI=ON \
      -DQUDA_FAST_COMPILE_DSLASH=ON \
      -DQUDA_FAST_COMPILE_REDUCE=ON \
      -DQUDA_GPU_ARCH=sm_80 \
      -DQUDA_MULTIGRID=ON \
      -DQUDA_MULTIGRID_NVEC_LIST="24" \
     [quda]
@weinbe2
Copy link
Contributor Author

weinbe2 commented Nov 6, 2024

When running with export CUDA_LAUNCH_BLOCKING=1, I can see it's being triggered by the ghost packing kernel:

MG level 0 (GPU): ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/scratch.eweinberg_sw/2024-08-29QudaMilc/quda/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 0, host ipp1-1776.nvidia.com, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
MG level 0 (GPU):        last kernel called was (name=N4quda4PackIfLi3ELb0EEE,volume=24x24x24x24,aux=policy_kernel,vol=331776,parity=2,precision=4,order=2,Ns=1,Nc=3,n_rhs=1,comm=0001,topo=1111,nFace=3,device-device,striped)
Stack trace (most recent call last):
#28   Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in
#27   Object "./staggered_invert_test", at 0x55dae0b93034, in
#26   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fe3f, in __libc_start_main
#25   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x153a75b4fd8f, in
#24   Object "./staggered_invert_test", at 0x55dae0b92bf9, in
#23   Object "./staggered_invert_test", at 0x55dae0b9863f, in
#22   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7748233a, in invertQuda
#21   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773278d9, in quda::solve(std::vector<void*, std::allocator<void*> > const&, std::vector<void*, std::allocator<void*> > const&, QudaInvertParam_s&, quda::GaugeField const&)
#20   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77324f18, in quda::solve(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField> const&, quda::Dirac&, quda::Dirac&, quda::Dirac&, quda::Dirac&, QudaInvertParam_s&)
#19   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77440931, in quda::GCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#18   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a773b5b8c, in quda::MG::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#17   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7741ce00, in quda::CAGCR::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&)
#16   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77329c1c, in quda::DiracM::operator()(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#15   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774ea226, in quda::DiracImprovedStaggered::M(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&) const
#14   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772a72da, in quda::ApplyImprovedStaggered(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, quda::vector_ref<quda::ColorSpinorField const> const&, int, bool, int const*, quda::TimeProfile&)
#13   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b38db, in void quda::instantiate<quda::ImprovedStaggeredApply, quda::ReconstructStaggered, float, 3, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&>(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double&, int&, bool&, int const*&, quda::TimeProfile&)
#12   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b3336, in quda::ImprovedStaggeredApply<float, 3, (QudaReconstructType_s)18>::ImprovedStaggeredApply(quda::vector_ref<quda::ColorSpinorField> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::GaugeField const&, quda::GaugeField const&, double, int, bool, int const*, quda::TimeProfile&)
#11   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772af731, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::DslashPolicyTune(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#10   Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772aee07, in quda::dslash::DslashPolicyTune<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::apply(quda::qudaStream_t const&)
#9    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772be19c, in quda::dslash::DslashBasic<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >::operator()(quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> >&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::ColorSpinorField const&, quda::TimeProfile&)
#8    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772b66f0, in void quda::dslash::issuePack<quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > >(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::Staggered<quda::StaggeredArg<float, 3, 4, (QudaReconstructType_s)18, (QudaReconstructType_s)18, true, (QudaStaggeredPhase_s)1> > const&, int, quda::MemoryLocation, int, int)
#7    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774bb3fe, in quda::ColorSpinorField::packGhost(int, QudaParity_s, int, quda::qudaStream_t const&, quda::MemoryLocation*, quda::MemoryLocation, bool, double, double, double, int, quda::vector_ref<quda::ColorSpinorField const> const&) const
#6    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772d7f4e, in quda::PackGhost(void**, quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#5    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f6792, in quda::GhostPack<float, 3>::GhostPack(quda::ColorSpinorField const&, quda::vector_ref<quda::ColorSpinorField const> const&, void**, quda::MemoryLocation, int, bool, int, bool, double, double, double, int, quda::qudaStream_t const&)
#4    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a772f5a3a, in quda::Pack<float, 3, false>::apply(quda::qudaStream_t const&)
#3    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a77591756, in quda::qudaLaunchKernel(void const*, quda::TuneParam const&, quda::qudaStream_t const&, void const*)
#2    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7758ff71, in quda::target::cuda::set_runtime_error(cudaError, char const*, char const*, char const*, char const*, bool)
#1    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a774b061b, in errorQuda_(char const*, char const*, int, ...)
#0    Object "/scratch/local/build-mg/lib/libquda.so", at 0x153a7750c63d, in quda::comm_abort(int)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants