Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mrhs misc #1515

Open
wants to merge 27 commits into
base: develop
Choose a base branch
from
Open

Feature/mrhs misc #1515

wants to merge 27 commits into from

Conversation

maddyscientist
Copy link
Member

@maddyscientist maddyscientist commented Nov 8, 2024

This PR is a bit of a catch all

  • Adds register talking for MRHS staggered dslash
    • Level of register tiling is controlled by a CMake parameter QUDA_MAX_MULTI_RHS_TILE, with the default left at size 1 for now.
    • This feature will be further developed in subsequent PRs
    • (Although not included in this PR, it's straightforward to this support to other stencils)
  • Fixes performance regressions of the MMA dslash when the memory pool is switched off
    • The FieldTmp now supports creating temporaries using parameters as opposed another field instance
    • We use this to create the temporary used for the reordered quark fields
  • Adds WAR for performance regressions with ROCm
    • This improves performance on ROCm 5.3 by 30% for the Laplace 3-d operator, though it's still off by integer factors
  • Various fixes for nvc++ compilation
  • Add alternative sentinel for heterogeneous reductions in the case that the compiler optimizes away non-finite math (enabled with QUDA_HETEROGENEOUS_ATOMIC_INF_INIT=OFF). Not a problem by default, but is with latest clang with -Ofast.
  • Fix various compiler warnings with more recent compilers, e.g., gcc-15
  • Fixes a hang caused by process divergence when calling printGenericMatrix

@maddyscientist maddyscientist requested review from a team as code owners November 20, 2024 19:01
include/kernels/laplace.cuh Show resolved Hide resolved
lib/inv_mr_quda.cpp Show resolved Hide resolved
@@ -104,13 +103,18 @@ namespace quda
if (doHalo<kernel_type>(d) && ghost) {
const int ghost_idx = ghostFaceIndexStaggered<1>(coord, arg.dim, d, 1);
const Link U = arg.improved ? arg.U(d, coord.x_cb, parity) : arg.U(d, coord.x_cb, parity, StaggeredPhase(coord, d, +1, arg));
Vector in = arg.halo.Ghost(d, 1, ghost_idx + src_idx * arg.nFace * arg.dc.ghostFaceCB[d], their_spinor_parity);
out = mv_add(U, in, out);
for (auto s = 0; s < n_src_tile; s++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should #pragma unroll be added here? Although I would assume the compiler does this by itself already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea. I'll add that on next push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants