Feature/mrhs misc #1515

maddyscientist · 2024-11-08T00:19:00Z

This PR is a bit of a catch all

Adds register talking for MRHS staggered dslash
- Level of register tiling is controlled by a CMake parameter QUDA_MAX_MULTI_RHS_TILE, with the default left at size 1 for now.
- This feature will be further developed in subsequent PRs
- (Although not included in this PR, it's straightforward to this support to other stencils)
Fixes performance regressions of the MMA dslash when the memory pool is switched off
- The FieldTmp now supports creating temporaries using parameters as opposed another field instance
- We use this to create the temporary used for the reordered quark fields
Adds WAR for performance regressions with ROCm
- This improves performance on ROCm 5.3 by 30% for the Laplace 3-d operator, though it's still off by integer factors
Various fixes for nvc++ compilation
Add alternative sentinel for heterogeneous reductions in the case that the compiler optimizes away non-finite math (enabled with QUDA_HETEROGENEOUS_ATOMIC_INF_INIT=OFF). Not a problem by default, but is with latest clang with -Ofast.
Fix various compiler warnings with more recent compilers, e.g., gcc-15
Fixes a hang caused by process divergence when calling printGenericMatrix

…ocation overheads when memory pool is disabled

This reverts commit 8f72560.

…e/mrhs-misc

…imization can optimize out infinities rendering the use of heterogenous atomics with an inifinity sentinal. As an alternative we can use negative zero as the sentinal. Default remains infinity, disable with QUDA_HETEROGENEOUS_ATOMIC_INF_INIT=OFF

…nvc++ allowing for the removal of the WARs deployed previously

…PrintMatrix routine. Apply same patch to genericPrintVector for future proofing

include/kernels/laplace.cuh

lib/inv_mr_quda.cpp

hummingtree · 2024-11-27T18:41:39Z

include/kernels/dslash_staggered.cuh

@@ -104,13 +103,18 @@ namespace quda
        if (doHalo<kernel_type>(d) && ghost) {
          const int ghost_idx = ghostFaceIndexStaggered<1>(coord, arg.dim, d, 1);
          const Link U = arg.improved ? arg.U(d, coord.x_cb, parity) : arg.U(d, coord.x_cb, parity, StaggeredPhase(coord, d, +1, arg));
-          Vector in = arg.halo.Ghost(d, 1, ghost_idx + src_idx * arg.nFace * arg.dc.ghostFaceCB[d], their_spinor_parity);
-          out = mv_add(U, in, out);
+          for (auto s = 0; s < n_src_tile; s++) {


Should #pragma unroll be added here? Although I would assume the compiler does this by itself already.

Probably a good idea. I'll add that on next push.

maddyscientist and others added 12 commits October 11, 2024 02:23

Initial support for register for staggered dslash kernel

8292b17

Sanity check for QUDA_MAX_MULTI_RHS_TILE

e816fb1

Possible WAR for ROCm performance issue with MRHS staggered kernels

eeea3b0

FieldTmp now supports creating temporaries using a T::param_type

467a493

DslashCoarse now uses getFieldTmp for it mma temporaries to avoid all…

36fcf3d

…ocation overheads when memory pool is disabled

Small cleanup of staggered dslash

db865e4

Add default initialization to array::data

8f72560

Fix multi-gpu bug

b2cebb0

Fixed a logic bug in the MR convergence check

704f6b2

Revert "Add default initialization to array::data"

57c631c

This reverts commit 8f72560.

n_src should be a member of DslashArg

9ddd25b

Fix fused pack + dslash kernels with n_src_tile > 1

01725e6

maddyscientist added feature optimization labels Nov 8, 2024

maddyscientist added this to the QUDA 2.0 milestone Nov 8, 2024

maddyscientist requested review from a team as code owners November 8, 2024 00:19

maddyscientist added 8 commits November 18, 2024 14:11

Merge branch 'feature/trlm_3d' of github.com:lattice/quda into featur…

bbd07a7

…e/mrhs-misc

Fix bug in dslash_test_utils.h

77b6b59

nvc++ no longer needs to use constant memory args for dslash

27339e4

Fix for nvc++ and remove unneeded target specific thread_array.h files

936c2c4

Remove WAR for nvc++ in reduce_helper which is no longer needed

fd00d08

Add versioned CPM files to .gitignore

fc5ac97

Fix complex_quda.h to be C++20 compliant

a10d6f3

maddyscientist requested review from a team as code owners November 20, 2024 19:01

maddyscientist added 3 commits November 20, 2024 11:45

Update to use latest release of Eigen 3.4: this fixes some bugs with …

8b61db2

…nvc++ allowing for the removal of the WARs deployed previously

Apply ROCm perf WAR to Laplace operator

490930b

Fix nvc++ compiler warning

26c3e59

maddyscientist added 4 commits November 22, 2024 12:00

Merge branch 'develop' of github.com:lattice/quda into feature/mrhs-misc

1af8c52

Merge branch 'develop' of github.com:lattice/quda into feature/mrhs-misc

c2547fc

Fix compiler warning introduced with 3debb29

2b81309

Fix process divergence issues (could hang when autotuning) in generic…

0a3f608

…PrintMatrix routine. Apply same patch to genericPrintVector for future proofing

hummingtree reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mrhs misc #1515

Feature/mrhs misc #1515

maddyscientist commented Nov 8, 2024 •

edited

Loading

hummingtree Nov 27, 2024

maddyscientist Nov 27, 2024

Feature/mrhs misc #1515

Are you sure you want to change the base?

Feature/mrhs misc #1515

Conversation

maddyscientist commented Nov 8, 2024 • edited Loading

hummingtree Nov 27, 2024

Choose a reason for hiding this comment

maddyscientist Nov 27, 2024

Choose a reason for hiding this comment

maddyscientist commented Nov 8, 2024 •

edited

Loading