Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename error (Docker) #41

Closed
ncbss opened this issue Oct 11, 2024 · 11 comments · Fixed by #47
Closed

Filename error (Docker) #41

ncbss opened this issue Oct 11, 2024 · 11 comments · Fixed by #47
Assignees
Labels
bug Something isn't working docker

Comments

@ncbss
Copy link

ncbss commented Oct 11, 2024

Hi there,
Thanks for all your work on developing DLMUSE! I am trying to use for some analysis in the lab and, during initial testing of the Docker container, an error occurred. Please see below:

Here's the command I used on Docker (version: 4.34.2 (167172)) on my MacBook Pro (Apple M1 Pro, Sequoia 15.0)

docker run -it --name DLMUSE_inference --rm \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/input,target=/input,readonly \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/output,target=/output \
    --platform linux/amd64 cbica/nichart_dlmuse:1.0.1-cuda11.8 \
    -d cpu

Here's the error:

Arguments:
Namespace(in_data='/input', out_dir='/output', device='cpu', clear_cache=False, help=False)

Detected 1 images ...
Number of valid images is 1 ...
------------------------
   Reorient images
Out file exists, skip reorientation ...
------------------------
   Apply DLICV
Running DLICV
Renaming dic is saved to /output/temp_working_dir/s2_dlicv/renamed_image/renaming.json
Loading the model...
perform_everything_on_device=True is only supported for cuda devices! Setting this to False
There are 1 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 1 cases that I would like to predict

Predicting case_ 000:
perform_everything_on_device: False
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [41:57<00:00, 93.22s/it]
sending off prediction to background worker for resampling and export
done with case_ 000
Bus error
Rename dlicv out file
------------------------
   Apply DLICV mask
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/nibabel/loadsave.py", line 100, in load
    stat_result = os.stat(filename)
                  ^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/output/temp_working_dir/s2_dlicv/mni_t1w_DLICV.nii.gz'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/NiChart_DLMUSE", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/NiChart_DLMUSE/__main__.py", line 113, in main
    run_pipeline(in_data, out_dir, device)
  File "/opt/conda/lib/python3.11/site-packages/NiChart_DLMUSE/dlmuse_pipeline.py", line 97, in run_pipeline
    apply_mask_img(df_img, in_dir, in_suff, mask_dir, mask_suff, out_dir, out_suff)
  File "/opt/conda/lib/python3.11/site-packages/NiChart_DLMUSE/MaskImage.py", line 159, in apply_mask_img
    mask_img(in_img, in_mask, out_img)
  File "/opt/conda/lib/python3.11/site-packages/NiChart_DLMUSE/MaskImage.py", line 77, in mask_img
    nii_mask = nib.load(mask_img)
               ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/nibabel/loadsave.py", line 102, in load
    raise FileNotFoundError(f"No such file or no access: '{filename}'")
FileNotFoundError: No such file or no access: '/output/temp_working_dir/s2_dlicv/mni_t1w_DLICV.nii.gz'

Thanks for your help!

@AlexanderGetka-cbica
Copy link
Contributor

Hi @ncbss , thanks for your interest in DLMUSE!

Admittedly, the images currently up on Docker Hub are not in a thoroughly tested state yet, and don't necessarily reflect the latest version. Nevertheless, we haven't seen this bus error before -- at a glance everything else is a downstream error of that.

We will do some research on this issue and get back to you, probably with an updated image for you to pull.

Tagging @spirosmaggioros , our local Mac expert. Could you try to replicate this using Docker on your Mac?

@spirosmaggioros
Copy link
Member

spirosmaggioros commented Oct 11, 2024

Hi @ncbss, thank you for pointing out this issue, we have the same chip so im quite sure what is going on. nnUNetv2 uses 3d operations and the M1 chip is not very good at performing these operations. Also, i believe that the M1 macs are not emulating the x86 environment very good either. I don't work on the docker versions so i never tested it, but now that i did we have the same problem. What is happening is that the workers are failing in the background(notice that you have a Bus error) and then the output files aren't present in the next step.

Also the docker image is not updated with the improvements i did in my last PR's, now the inference is parallelized and it might help with this. Give us and specifically @AlexanderGetka-cbica a moment to figure out if we can fix that.
Thanks once again for your feedback! Hope we find a solution soon.

@AlexanderGetka-cbica
Copy link
Contributor

Hi @ncbss , I just pushed a version to Docker Hub under the following tag:
cbica/nichart:1.0.4-default

Can you give this a try and see if it works for you? Of note, there are multiple options that might be useful. You can append them after the -d cpu part (by the way, since you are on Mac, maybe you can try -d mps for improved performance.)

-c sets the number of cores used for parallelization of the whole pipeline (default: 4). Setting this higher might result in faster inference, setting this to 1 will minimize the amount of resources consumed.

--dlmuse_args "-nps 1 -npp 1" --dlicv_args "-nps 1 -npp 1" will cause all the various resampling/export steps to use only one worker thread, minimizing resource consumption and the risk of out-of-memory or similar errors, which we have previously observed to cause failures in this step. It will be slightly slower though.

To get the inference to work bare-minimum, I suggest values of 1 for all the above just to see, then gradually increasing values until you find something that works optimally for your system. If you could report back on this, that would be very helpful for us, too.

@spirosmaggioros
Copy link
Member

spirosmaggioros commented Oct 11, 2024

Just to save you time following Alex's response, only the new M3 chip supports 3d convolution(that nnunetv2 performs), so you can't use MPS to run NiChart DLMUSE. Unfortunately only cuda offers a faster option. You can take a look at nnunet's documentation.

I use a VM with A100 GPUs to run it.

@spirosmaggioros
Copy link
Member

Is this resolved?

@AlexanderGetka-cbica
Copy link
Contributor

Hi @ncbss, I just found another potential solution since I encountered this in my own environment. Try passing --ipc=host to the docker run command (to be clear, this should go before the image name).

@ncbss
Copy link
Author

ncbss commented Oct 28, 2024

Thank you all for your helping! Just testing this out tonight. I will report back ASAP.

@ncbss
Copy link
Author

ncbss commented Oct 28, 2024

Hi again!

So, testing the new container cbica/nichart_dlmuse:1.0.4-default , when I run the code below:

docker run -it --name DLMUSE_inference --rm \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/input/,target=/input,readonly \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/output,target=/output \
    --platform linux/amd64 \
    --ipc=host cbica/nichart_dlmuse:1.0.4-default \
    -d cpu

I get this error:

Arguments:
Namespace(in_data='/input', out_dir='/output', device='cpu', cores=4, clear_cache=False, dlmuse_args='', dlicv_args='', help=False)

mkdir: cannot create directory ‘/input/split_1’: Read-only file system
cp: cannot create regular file '/input/split_1': Read-only file system
rm: cannot remove '/output/split_*': No such file or directory
rm: cannot remove '/input/split_*': No such file or directory

If I remove the flag readonly then I get this error:

Arguments:
Namespace(in_data='/input', out_dir='/output', device='cpu', cores=4, clear_cache=False, dlmuse_args='', dlicv_args='', help=False)

rm: cannot remove '/output/split_*': No such file or directory

This is what my working directory looks like prior to running the code above:

input      output     runmuse.sh

./input:
T1w.nii.gz

./output:

@AlexanderGetka-cbica
Copy link
Contributor

Thanks for the detailed reporting!

@spirosmaggioros it seems this is relevant to your parallelization code, but in retrospect we should probably avoid writing anything to the input dir. Let's discuss the approach tomorrow.

@ncbss You might be able to try the previous version you tried, but with the --ipc=host fix mentioned above. I should have noted earlier that this does alter the security profile of Docker, so just be aware of that (some discussion on this here: https://stackoverflow.com/questions/38907708/docker-ipc-host-and-security ). We'll be in touch about an updated container, thanks for your patience.

@spirosmaggioros
Copy link
Member

spirosmaggioros commented Oct 28, 2024

Hi again!

So, testing the new container cbica/nichart_dlmuse:1.0.4-default , when I run the code below:

docker run -it --name DLMUSE_inference --rm \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/input/,target=/input,readonly \
    --mount type=bind,source=/Users/narlonsilva/Desktop/test-nichart/output,target=/output \
    --platform linux/amd64 \
    --ipc=host cbica/nichart_dlmuse:1.0.4-default \
    -d cpu

I get this error:

Arguments:
Namespace(in_data='/input', out_dir='/output', device='cpu', cores=4, clear_cache=False, dlmuse_args='', dlicv_args='', help=False)

mkdir: cannot create directory ‘/input/split_1’: Read-only file system
cp: cannot create regular file '/input/split_1': Read-only file system
rm: cannot remove '/output/split_*': No such file or directory
rm: cannot remove '/input/split_*': No such file or directory

If I remove the flag readonly then I get this error:

Arguments:
Namespace(in_data='/input', out_dir='/output', device='cpu', cores=4, clear_cache=False, dlmuse_args='', dlicv_args='', help=False)

rm: cannot remove '/output/split_*': No such file or directory

This is what my working directory looks like prior to running the code above:

input      output     runmuse.sh

./input:
T1w.nii.gz

./output:

@ncbss The issue you have here is that you only have one file and the default cores for the data splitting is 4, so, in order to just do one file you have to set "--cores 1", as for one file parallelization can't do any better than one core. It's my mistake to not take this into consideration, but i didn't expect single nifti files. Will update this for the newer PyPI version. Thanks for noticing.

Quick update: I fixed the issue and will be merged soon, until then, if you have < 4 files please try to set --cores to 1.

@spirosmaggioros
Copy link
Member

We will close this for now as the latest commit fix this issue. If any other issues appear we will reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants