You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 208, in <module>
main()
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
run_command_line(args)
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
run_path(sys.argv[0], run_name='__main__')
File "<frozen runpy>", line 291, in run_path
File "<frozen runpy>", line 98, in _run_module_code
File "<frozen runpy>", line 88, in _run_code
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 609, in <module>
_test_mpi_boundary_swap(dim, order, num_groups)
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 426, in _test_mpi_boundary_swap
conns = bdry_setup_helper.complete_some()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/meshmode/distributed.py", line 332, in complete_some
data = [self._internal_mpi_comm.recv(status=status)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
Downgrading to libfabric (see here) appears to resolve this.
This is the code in mpi4py that ultimately fails, it's a matched receive (mrecv).
@majosm Got any ideas? (Pinging you since the two of us last touched this code.)
The text was updated successfully, but these errors were encountered:
Maybe this could be a workaround - we disable mpi4py's mprobe in mirgecom due to a similar crash (observed in Spectrum MPI, illinois-ceesd/mirgecom#132):
Exciting news: while I don't know what exactly the issue is, OpenMPI 4.1.5-1 seems to include a fix that makes it work properly with the previously-offending version of libfabric1.
Sample CI failure:
https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461
Similar failure in grudge:
https://gitlab.tiker.net/inducer/grudge/-/jobs/533485
Sample traceback:
Downgrading to libfabric (see here) appears to resolve this.
This is the code in mpi4py that ultimately fails, it's a matched receive (
mrecv
).@majosm Got any ideas? (Pinging you since the two of us last touched this code.)
The text was updated successfully, but these errors were encountered: