Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5: infinite loop error on Setonix (using singularity/3.8.6-mpi) #668

Open
gduclaux opened this issue Jul 27, 2023 · 5 comments
Open

HDF5: infinite loop error on Setonix (using singularity/3.8.6-mpi) #668

gduclaux opened this issue Jul 27, 2023 · 5 comments
Assignees

Comments

@gduclaux
Copy link

gduclaux commented Jul 27, 2023

Hello guys,

I've installed UW2 latest container on Setonix (Pawsey Center) using Singularity and it went quite smoothly 👍

There are 2 versions of Singularity available on Setonix: 1) singularity/3.8.6-nompi et 2) singularity/3.8.6-mpi

I first ran a test job in serial using the singularity/3.8.6-nompi module and all went well.

But, when I try to run the same test job in parallel using the singularity/3.8.6-mpi module I get an error message (related to hdf5 AFAICT) that takes place when the code tries to write the step 0 outputs (either on one or on multiple ranks).

Below it the stdout returned when running singularity/3.8.6-mpi version on a single core:

loaded rc file /opt/venv/lib/python3.10/site-packages/underworld/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
	Global element size: 256x256
	Local offset of rank 0: 0x0
	Local range of rank 0: 256x256
In func WeightsCalculator_CalculateAll(): for swarm "UTTBHS5P__swarm"
	done 33% (21846 cells)...
	done 67% (43691 cells)...
	done 100% (65536 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "UTTBHS5P__swarm"
/opt/venv/lib/python3.10/site-packages/underworld/UWGeodynamics/_model.py:1582: UserWarning: Skipping the steady state calculation: No diffusivity variable defined on Model
  warnings.warn("Skipping the steady state calculation: No diffusivity variable defined on Model")
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14dcd6441c4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14dcd5df3684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14dcd6472775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14dcd64ae1c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14dcd6453625]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(+0x30c6bd) [0x14dcce6646bd]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5FD_open+0x13c) [0x14dcce457f1c]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5F_open+0x494) [0x14dcce449b94]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5VL__native_file_create+0x1a) [0x14dcce62e2ba]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5VL_file_create+0xcd) [0x14dcce6192cd]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5Fcreate+0x12c) [0x14dcce43d5bc]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(+0x66e02) [0x14dcce3bee02]
/opt/venv/lib/python3.10/site-packages/h5py/h5f.cpython-310-x86_64-linux-gnu.so(+0x4c7bf) [0x14dcccf377bf]
/opt/venv/bin/python3(+0x15c8de) [0x5628ca12a8de]
/opt/venv/lib/python3.10/site-packages/h5py/_objects.cpython-310-x86_64-linux-gnu.so(+0xc13b) [0x14dcdb6ce13b]
/opt/venv/bin/python3(_PyObject_MakeTpCall+0x25b) [0x5628ca1213bb]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x73b3) [0x5628ca11a583]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x1a31) [0x5628ca114c01]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyObject_FastCallDictTstate+0x16d) [0x5628ca1205fd]
/opt/venv/bin/python3(+0x166d74) [0x5628ca134d74]
/opt/venv/bin/python3(+0x15376b) [0x5628ca12176b]
/opt/venv/bin/python3(PyObject_Call+0xbb) [0x5628ca13975b]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x2955) [0x5628ca115b25]
/opt/venv/bin/python3(+0x16ad71) [0x5628ca138d71]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x26c5) [0x5628ca115895]
/opt/venv/bin/python3(+0x16ab11) [0x5628ca138b11]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x1a31) [0x5628ca114c01]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x816) [0x5628ca1139e6]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
MPICH ERROR [Rank 0] [job id 3488434.0] [Thu Jul 27 06:23:51 2023] [nid002309] - Abort(1): Internal error

HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
srun: error: nid002309: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3488434.0

I suspect this is a singularity problem and not an UW2 problem... are you familiar with this type of error?
I can report with the Pawsey center Helpdesk if you confirm this is a singularity problem.

Cheers

Guillaume

@gduclaux
Copy link
Author

After digging further into the Pawsey doco I found this https://pawsey.org.au/technical-newsletter/ (see 13 March 2023 entry):

Parallel IO within Containers
Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

Example of error message

Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]

Currently it is unclear exactly what is causing this issue. Investigations are ongoing.

Workaround:

There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built “bare-metal” on Setonix itself (see How to Install Software).

I guess I'm about to install UW2 from source on Setonix... Would you have any step-by-step recipe at hands for this specific Cray machine? I found the one you put together for Magnus a few years back.

@julesghub
Copy link
Member

Hey Gilly,

Yeah this is an on going issue we have raise with setonix on several occasions. For now we are stuck with build bare metal builds on setonix. I will upload some instructions for it later today.

@julesghub julesghub self-assigned this Aug 3, 2023
@julesghub
Copy link
Member

Hey Gilly,
To update you on this.
Setonix's permission setup means I can't install things for a project I'm not a user in. So I'm trying to put together bare metal instructions for you that make things as smooth as possible from your end.
I'm testing some instructions I have put together this afternoon and if things work out I'll send them though later.

@gduclaux
Copy link
Author

Hi Jules,

I have been off grid for the past couple weeks and back in the office now. If you have a recipe at hand for the install I would love to give it to!
Cheers
Gilly

@julesghub
Copy link
Member

julesghub commented Sep 5, 2023

Hi Gilly,

https://support.pawsey.org.au/documentation/display/US/Containers+changes
I'm going to rebuild the docker image and try singularity again on setonix. I'll keep you posted.
cheers,
J

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants