-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading backend while it is being written sometimes throws an error #389
Comments
Hi, I am running into the same error when using I am assuming this is because each worker is trying to access the same file and the workers are conflicting each other? This error does NOT appear when I simply call I see that the origin issue does not use any parallelization so I'm not sure if I'm helping or should create a new issue. My python code is simply:
EDIT: |
@Thalos12: Thanks for the detailed code! I'm not sure if I have too much to suggest here because this isn't really a supported use case for this backend and it looks like it's a deeper h5py issue rather than something specific to the emcee implementation, but I could be wrong. I'm happy to leave this open if someone wants to try to build support for this workflow. @axiezai: I think that your issue is not related. Instead, it looks like you've forgotten to include: if not pool.is_master():
pool.wait()
sys.exit(0) Which is required for use of the |
Hi @dfm, I understand that it is not a supported use case, but it would be useful to me because I have long running chains and I would like to sometimes check how they are performing. After reading a bit about how HDF5 works I have found that it has a Single Writer Multiple Reader mode (https://docs.h5py.org/en/stable/swmr.html) that might be the solution I am looking for.
If you are not against it, I would like to try to add support for this, I think it might be helpful. |
@dfm thank you for pointing this out, I totally missed it... I edited my code accordingly and it turns out its just not waiting for the master process to finish, I also have to define the backends and a few other things inside the
Just documenting this in case other new users run into the same problem with MPIPool. |
@Thalos12: great! I'd be happy to review such a PR! |
Hi @dfm, I played a bit with the SWMR mode and below is what I found. To use SWMR the following has to happen:
What's important is that the reader must open the file after the writer. Nonetheless, I might have found another solution in this thread. In short, the environment variable Unfortunately, while the reader has HDF locking disabled it can still crash sometimes, but the writer survives and this, in my opinion, is a reasonable trade-off. I could make a PR in a few days if you are still interested in this feature. |
@Thalos12: Thanks for looking into this in such detail! Yes - I would be very happy to have this implemented. Can you also include a tutorial in the PR? I'm happy to fill out the details and help with formatting if you can at least get the example implemented. Thanks again! |
Thank you! I will add a Jupyter notebook with the example and I will surely take your help in formatting it properly. I should be able to submit in the next few days. |
General information:
Problem description:
Expected behavior:
The backend (HDF5 file) can be read with no errors while the chain is running and the backend is being written.
Actual behavior:
The process writing to the backend sometimes raises an error when another process is trying to read the HDF5 file.
The errors, copied from the shell, is this one
What have you tried so far?:
I tried setting
read_only=True
when instantiating theHDFBackend
in the script that tries to read the backend, but the problem was not solved.Minimal example:
Run a chain using
writer.py
and read multiple times withreader.py
. After a few tries the error should appear.writer.py
Edit for the sake of completeness: while the example above does not use
multiprocessing
, in my actual code I do use it. I see the error both with and withoutmutiprocessing
.The text was updated successfully, but these errors were encountered: