Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BlockVector init bug #75

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Heinrich-BR
Copy link
Collaborator

@Heinrich-BR Heinrich-BR commented Jan 6, 2025

This PR fixes a small bug which can crash runs unpredictably.

In ProblemOperatorInterface::Init, the BlockVector of a given variable has its size and offsets updated, with the relevant amount of memory allocated for the object. This is done by MFEM's Vector::SetSize which calls upon Memory<T>::New. The latter function allocates memory without initialising it, meaning any memory remains already existing in that space allocated for it will stay there. The BlockVector will then interpret those values as elements of its array. In principle this shouldn't be an issue because before the run begins properly these values will get initialised to their proper values or they will host the result of some calculation step. However, MFEM's NewtonSolver::Mult has a check that the BlockVector is finite before it gets initialised. Thus, if some memory remains in those spaces and any element gets interpreted by BlockVector as an inf or nan, this will fail MFEM's finiteness check and crash the run. Because the content of newly allocated memory spaces is effectively unpredictable, this crash may happen at any time. All of this can be prevented by simply initialising new BlockVectors with finite values. Here we choose zero.

Copy link

codecov bot commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
src/problem_operators/problem_operator_interface.C 65.00% <100.00%> (+1.84%) ⬆️

@Heinrich-BR Heinrich-BR self-assigned this Jan 6, 2025
@Heinrich-BR Heinrich-BR added the bug Something isn't working label Jan 6, 2025
@k-collie
Copy link
Collaborator

k-collie commented Jan 8, 2025

However, MFEM's NewtonSolver::Mult has a check that the BlockVector is finite before it gets initialised.

Assuming leaving memory uninitialised is a choice to improve performance wouldn't this be a bug in MFEM?

@Heinrich-BR
Copy link
Collaborator Author

Assuming leaving memory uninitialised is a choice to improve performance wouldn't this be a bug in MFEM?

Potentially, yeah. Maybe they deemed that the circumstances under which this crash happens are rare enough that it doesn't compensate the performance cost of initialising vector elements. Indeed, this is the first time I've come across this issue and in my case it only happens in the AMD GPU build, while running on the host, for the specific case of the MOOSE variable transfer test, when we have more than one variable in the config file.

I'll check what happens with those memory spaces in other builds, but given no one seems to have raised this issue in the MFEM repo (as far as I know), I doubt they'd consider it a worthwhile bug to sacrifice performance for. Let's see what @alexanderianblair thinks about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants