Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfabric=1.17.0-3 on Debian causes MPI tests to fail with MPI_ERR_OTHER #370

Open
inducer opened this issue Mar 16, 2023 · 3 comments
Open

Comments

@inducer
Copy link
Owner

inducer commented Mar 16, 2023

Sample CI failure:
https://gitlab.tiker.net/inducer/meshmode/-/jobs/533461

Similar failure in grudge:
https://gitlab.tiker.net/inducer/grudge/-/jobs/533485

Sample traceback:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 208, in <module>
    main()
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/.env/lib/python3.11/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 609, in <module>
    _test_mpi_boundary_swap(dim, order, num_groups)
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/test/test_partition.py", line 426, in _test_mpi_boundary_swap
    conns = bdry_setup_helper.complete_some()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/gitlab-runner/builds/VFPjm48d/0/inducer/meshmode/meshmode/distributed.py", line 332, in complete_some
    data = [self._internal_mpi_comm.recv(status=status)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mpi4py/MPI/Comm.pyx", line 1438, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 341, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 303, in mpi4py.MPI.PyMPI_recv_match
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list

Downgrading to libfabric (see here) appears to resolve this.

This is the code in mpi4py that ultimately fails, it's a matched receive (mrecv).

@majosm Got any ideas? (Pinging you since the two of us last touched this code.)

@matthiasdiener
Copy link
Collaborator

matthiasdiener commented Mar 16, 2023

Maybe this could be a workaround - we disable mpi4py's mprobe in mirgecom due to a similar crash (observed in Spectrum MPI, illinois-ceesd/mirgecom#132):

https://github.com/illinois-ceesd/mirgecom/blob/babc6d2b9859719a3ba4a45dc91a6915583f175d/mirgecom/mpi.py#L183-L186

@inducer
Copy link
Owner Author

inducer commented Mar 16, 2023

Thanks for the tip! Though it seems that setting recv_mprobe = False does not avoid this particular issue.

@inducer
Copy link
Owner Author

inducer commented May 28, 2023

Exciting news: while I don't know what exactly the issue is, OpenMPI 4.1.5-1 seems to include a fix that makes it work properly with the previously-offending version of libfabric1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants