Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy compilation fails for isolator driver #640

Open
anderson2981 opened this issue Apr 15, 2022 · 7 comments
Open

Lazy compilation fails for isolator driver #640

anderson2981 opened this issue Apr 15, 2022 · 7 comments
Assignees

Comments

@anderson2981
Copy link
Contributor

Lazy compilation fails from a clean install of mirgecom (production) using the isolator driver. Error message is linked below.

error message

Error can be produced by running smoke_test_lazy from isolator_driver with a clean build of mirgecom.

Here are links to the pass/failing CI runs

first failing CI

CI fails

last passing CI

CI passes

Note this CI also fails, but lazy smoke test passes. The other failure is under further investigation

@majosm
Copy link
Collaborator

majosm commented Apr 15, 2022

Seems to be caused by inducer/pytato@acfd5d3. Reverting that commit made it step again for me.

@majosm
Copy link
Collaborator

majosm commented Apr 15, 2022

Some more information: whatever is triggering it seems to be specific to artificial viscosity; I don't get it when just using NS or diffusion (or both). I do get it with AV enabled, even if I turn off the other operators.

@MTCam
Copy link
Member

MTCam commented Apr 15, 2022

No surprise: fails on Lassen with the same error. Added a lazy test for the AV test case doublemach-mpi-lazy.py to mirgecom@production which is failing out-of-the-box currently with the same error.

@inducer
Copy link
Contributor

inducer commented Apr 15, 2022

NotImplementedError: find_distributed_partition cannot partition DictOfNamedArrays

I think that's a corner case that got left out of inducer/pytato#277 when @kaushikcfd put that together (for context, that PR is what significantly decreased the number of graph parts needed in distributed execution, so reverting it takes us out of this frying pan back into the previous fire).

The error comes from

https://github.com/inducer/pytato/blob/aab7400935c8734ebb6fbe7d8abfde2bcf70c63e/pytato/distributed.py#L920-L922=

by which point most of the interesting things in the partitioner have already happened. The problem is that there's a call_loopy in AV:

https://github.com/illinois-ceesd/mirgecom/pull/582/files#diff-3b6c52208008a2c6e430f17eb23d73feb8076ecb7f497baf40de319a51d39675R297

  • It shouldn't be hard to handle this additional case in pytato, including a test that the partitioner can handle call_loopy. @matthiasdiener, could you take a look?
  • In general, we've been moving away from using call_loopy nodes in the DAG, because they present an obstacle to the transform pipeline. We should probably rewrite the AV operator in terms of pure numpy. @MTCam, could you give this a go? (Maybe as a stacked PR on top of Av operator #582 to give me a chance to take a look?)

Note that, while each one of those fixes would remove the failure, I am suggesting that we tackle both aspects.

@MTCam
Copy link
Member

MTCam commented Apr 15, 2022

  • In general, we've been moving away from using call_loopy nodes in the DAG, because they present an obstacle to the transform pipeline. We should probably rewrite the AV operator in terms of pure numpy. @MTCam, could you give this a go? (Maybe as a stacked PR on top of Av operator #582 to give me a chance to take a look?)

Is this issue resolved by the following change by @kaushikcfd ?

if actx.supports_nonscalar_broadcasting:

@MTCam
Copy link
Member

MTCam commented Apr 15, 2022

Is this issue resolved by the following change by @kaushikcfd ?

if actx.supports_nonscalar_broadcasting:

Just to add a data point, the failure is resolved by using the above change and:
--editable git+https://github.com/kaushikcfd/meshmode.git#egg=meshmode
--editable git+https://github.com/kaushikcfd/pytato.git#egg=pytato

Shouldn't we just switch production to fusion array context and move forward?

@anderson2981
Copy link
Contributor Author

I'm not opposed to this if it passes all the tests of the current mirgecom suite and works with the new capabilities coming online, i.e. the wall model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants