Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of HYPRE for more linear algebra on GPU #193

Merged
merged 10 commits into from
Mar 4, 2024

Conversation

sebastiangrimberg
Copy link
Contributor

NOTE: Builds on #184

  • Use HYPRE's inner-product routine (wraps cuBLAS/equivalent) instead of MFEM's custom kernel
  • Use HYPRE's sequential hypre_CSRMatrix data structure instead of mfem::SparseMatrix to speed up CPU-based CSR assembly with OpenMP

@sebastiangrimberg sebastiangrimberg added enhancement New feature or request performance Related to performance labels Feb 9, 2024
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/gpu-build-system-dev branch 3 times, most recently from bf79e54 to dd42790 Compare February 16, 2024 21:27
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch 2 times, most recently from 4798f29 to de2db6f Compare February 17, 2024 01:04
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch 2 times, most recently from 41491c7 to d70f023 Compare February 17, 2024 02:04
@sebastiangrimberg sebastiangrimberg marked this pull request as ready for review February 19, 2024 16:44
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch 2 times, most recently from 74e35d9 to 83f60dd Compare February 22, 2024 02:13
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch 3 times, most recently from 4962970 to 1ed1123 Compare February 29, 2024 01:00
Copy link
Collaborator

@hughcars hughcars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo the mac m1 and threading issue with SuperLU, which I agree should just be a bug report for SuperLU, lgtm.

Comment on lines +1028 to +1020
pr.UseDevice(false);
pi.UseDevice(false);
pr.Assemble();
pi.Assemble();
pr.UseDevice(true);
pi.UseDevice(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I am understanding correctly, this is performing the assembly on host and then transferring the assembled operators to device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag set by UseDevice is just an execution flag, rather than a memory location flag. So this is just a minor optimization so that operations like pr = 0.0 inside of LinearForm::Assemble happen on the host since we know assembly is going to happen on the host anyway. No need to move the Vector to the device only to pull it back after setting it equal to a value.

namespace
{

static HypreVector X, Y;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you've considered this, but for posterity, using static like this is going to be thread safe because these are basically just pointers to hypre variables, and hypre is handling the threads itself yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected that you would not call any of the HypreCSRMatrix routines which use these variables from multiple threads simultaneously. So I think the answer to your question is no this isn't thread-safe but it doesn't have to be based on the expected use case.

@@ -49,15 +49,9 @@ BilinearForm::PartialAssemble(const FiniteElementSpace &trial_fespace,
// This should work fine if some threads create an empty operator (no elements or boundary
// elements).
const std::size_t nt = ceed::internal::GetCeedObjects().size();
PalacePragmaOmp(parallel for schedule(static))
for (std::size_t i = 0; i < nt; i++)
PalacePragmaOmp(parallel if (nt > 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new optional omp looping, this is to remove some overhead of starting an omp loop if there's only 1 thread correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead yes but also in the case where nt = 1 but OMP_NUM_THREADS > 1 (running on GPU), we need to make sure this is serial.

Comment on lines 59 to 61
void ComplexVector::MakeRef(Vector &y, int offset, int size)
{
Set(py, n, on_dev);
MFEM_ASSERT(y.Size() <= 2 * size,
"Insufficient storage for ComplexVector alias reference of the given size!");
data.MakeRef(y, offset, 2 * size);
xr.MakeRef(data, offset, size);
xi.MakeRef(data, offset + size, size);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm getting this correctly, this is MakeRef is to treat a vector as a ComplexVector, stacked [xr;xi], starting at offset, with size 2*size. Then I think the assertion should be offset + 2 *size <= y.Size(), to check the accessed block isn't going off the end of y. As is, offset greater than y.Size(), wouldn't be triggered, and I think if size == 10 and y.size() == 1, this would also go off the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, this was actually fixed in ff7e163.

{
nnz++;
if (k < j - 1 && k > j + 1 && el_trans_j(k) != 0.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be (k < j - 1 || k > j + 1) && el_trans_j(k) != 0.0? k < j - 1 && k > j + 1 should return false for any k, j.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. This is an old bug in a debug check.

@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch 2 times, most recently from 19f9618 to 36b3cfb Compare March 1, 2024 19:31
Copy link
Collaborator

@hughcars hughcars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as part of testing on #204

@sebastiangrimberg sebastiangrimberg force-pushed the sjg/gpu-build-system-dev branch 2 times, most recently from 8132f54 to b802a98 Compare March 4, 2024 19:51
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch from 36b3cfb to 4932816 Compare March 4, 2024 19:51
Base automatically changed from sjg/gpu-build-system-dev to main March 4, 2024 21:01
@sebastiangrimberg sebastiangrimberg force-pushed the sjg/hypre-interface-dev branch from 4932816 to c2c5ae9 Compare March 4, 2024 21:04
@sebastiangrimberg sebastiangrimberg merged commit e9670fc into main Mar 4, 2024
17 checks passed
@sebastiangrimberg sebastiangrimberg deleted the sjg/hypre-interface-dev branch March 4, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Related to performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants