Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

label-free predictions: CUDA out of memory in prediction #90

Closed
LPUoO opened this issue Apr 26, 2021 · 13 comments
Closed

label-free predictions: CUDA out of memory in prediction #90

LPUoO opened this issue Apr 26, 2021 · 13 comments

Comments

@LPUoO
Copy link

LPUoO commented Apr 26, 2021

Hi,
I am trying to make label-free predictions with a model I previously trained with the fnet notebook.

However I am getting the following message:
RuntimeError: CUDA out of memory. Tried to allocate 4.75 GiB (GPU 0; 15.90 GiB total capacity; 11.32 GiB already allocated; 1.63 GiB free; 13.15 GiB reserved in total by PyTorch)

This happens even if I restart the runtime. Is there anything I can do about that?
Thank you very much!

@krentzd
Copy link
Collaborator

krentzd commented Apr 26, 2021

Hi,

I'd try to reduce the size of the image that you're trying to predict on by tiling it into smaller images.

Cheers,
Daniel

@LPUoO
Copy link
Author

LPUoO commented Apr 26, 2021

Thank you @krentzd
Does the number of z-slices for prediction need to be the same as for the training?

PS: section 6.2 says Select the slice of the **slice** you want to visualize.

@krentzd
Copy link
Collaborator

krentzd commented Apr 26, 2021

No, the number of z-slices for prediction shouldn't depend on the number of z-slices used for training.
Thanks for spotting that typo!

@LPUoO
Copy link
Author

LPUoO commented Apr 26, 2021

Thanks @krentzd,

I'd try to reduce the size of the image that you're trying to predict on by tiling it into smaller images.

I tried multiple times and no matter what I am getting:
RuntimeError: CUDA out of memory. Tried to allocate 4.75 GiB (GPU 0; 15.90 GiB total capacity; 11.32 GiB already allocated; 1.63 GiB free; 13.15 GiB reserved in total by PyTorch).

My trainning was with 36 mb images (924x624 pix, 32 slices) and it worked fine. For my prediction even a 4 slice 200 x 200 pix
causes that error.
Also it always tries to allocate 4.75 GiB so I am not sure whether the size of the image being predicted is the issue?

Thanks a lot

@krentzd
Copy link
Collaborator

krentzd commented Apr 27, 2021

Hi,

Could you check what GPU you're assigned when you do the prediction. And do you run the entire notebook or do you skip sections in between before loading your pre-trained model in Section 6?

Cheers,
Daniel

@LPUoO
Copy link
Author

LPUoO commented Apr 27, 2021

Thank you @krentzd,

The first time it happened it was immediately after training sections 1, 2, 3, 4.1, 5.1, 6.1

Yesterday and today I have been doing sections 1.1, 1.2, 2 and 6.1.
I have been assigned a Tesla V100-SXM2...

The full error message I get in 6.1 is:
`Requirement already up-to-date: scipy==1.2.0 in /usr/local/lib/python3.7/dist-packages (1.2.0)
Requirement already satisfied, skipping upgrade: numpy>=1.8.2 in /usr/local/lib/python3.7/dist-packages (from scipy==1.2.0) (1.19.5)
Collecting tifffile==2019.7.26
Downloading https://files.pythonhosted.org/packages/ca/96/2fcac22c806145b34e682e03874b490ae09bc3e48013a0c77e590cd6be29/tifffile-2019.7.26-py2.py3-none-any.whl (131kB)
|████████████████████████████████| 133kB 14.8MB/s
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from tifffile==2019.7.26) (1.19.5)
Installing collected packages: tifffile
Found existing installation: tifffile 2021.4.8
Uninstalling tifffile-2021.4.8:
Successfully uninstalled tifffile-2021.4.8
Successfully installed tifffile-2019.7.26
The DICtoFIBfnet network will be used.
--class_dataset TiffDataset \

  • DATASET=TempPredictionFolder
  • MODEL_DIR=saved_models/TempPredictionFolder
  • N_IMAGES=1000
  • GPU_IDS=0
  • for TEST_OR_TRAIN in test
  • python predict.py --path_model_dir saved_models/TempPredictionFolder --class_dataset TiffDataset --path_dataset_csv data/csvs/TempPredictionFolder/test.csv --n_images 1000 --no_prediction_unpropped --path_save_dir results/3d/TempPredictionFolder/test --gpu_ids 0
    Propper(-) => transformer: Cropper('-', 16, 'mid', 20000000)
    <fnet.data.tiffdataset.TiffDataset object at 0x7f5736a54b90>
    DEBUG: cropper shape change [64, 624, 912] becomes (64, 432, 720)
    /content/gdrive/My Drive/pytorch_fnet/fnet/transforms.py:172: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
    x_out = x_in[slices].copy()
    saved: results/3d/TempPredictionFolder/test/00/signal.tiff
    saved: results/3d/TempPredictionFolder/test/00/target.tiff
    fnet_nn_3d | {} | iter: 100000
    /content/gdrive/My Drive/pytorch_fnet/fnet/fnet_model.py:102: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
    signal = torch.tensor(signal, dtype=torch.float32, device=self.device)
    Traceback (most recent call last):
    File "predict.py", line 118, in
    main()
    File "predict.py", line 104, in main
    prediction = model.predict(signal) if model is not None else None
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/fnet_model.py", line 112, in predict
    prediction = module(signal).cpu()
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/nn_modules/fnet_nn_3d_params.py", line 21, in forward
    x_rec = self.net_recurse(x)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/nn_modules/fnet_nn_3d_params.py", line 62, in forward
    x_cat = torch.cat((x_2conv_more, x_relu1), 1) # concatenate
    RuntimeError: CUDA out of memory. Tried to allocate 4.75 GiB (GPU 0; 15.78 GiB total capacity; 11.32 GiB already allocated; 453.75 MiB free; 13.74 GiB reserved in total by PyTorch)
    Time elapsed: 0.0 hour(s) 1.0 min(s) 10 sec(s)`

Cheers

@lucpaul
Copy link
Collaborator

lucpaul commented Apr 27, 2021

Hi,
First thing I'd try, would be to reduce the number of images you're loading. The crucial bit of the error message is at the bottom. 'CUDA out of memory'. It's basically telling you that the dataset is too big or (which I've also seen) that the number of images per batch is too big. Maybe start with a really low batch size and see if you still hit this error.
If this doesn't work, then it might be the dataset itself that's too big to handle in the runtime. In fnet, this is loaded into memory before training and can crash the notebook too. So if reducing batch size doesn't help, see if you can make the model train by using fewer images.

Hope this helps.

@LPUoO
Copy link
Author

LPUoO commented Apr 27, 2021

Hi @lucpaul,

Thank you for your feedback.

The issue I am having is for the prediction in 6.1 not the training.
I had no issues making the model with eighty 36MB images (924x624 pix, 32 slices) (total 2.7 GB).

I am getting the CUDA out of memory. error in the prediction in 6.1 even with just 1 single image (200x 200 pix, 4 z-slices, 1.7 MB).

Are you saying that because I used too many images for the model creation I am having trouble making predictions?

Thanks a lot

@lucpaul
Copy link
Collaborator

lucpaul commented Apr 27, 2021

Oh, I see, I misunderstood. Apologies.
No, it should not matter how many images you trained with when you use the model for prediction. Might need a few more checks then to see what the issue is.
First thing to rule out: How big is the model file? - Unlikely to be a problem but worth checking if it's unreasonably huge.
Second: Do you get this error if you run the QC section on a test dataset? If not, are your test images (the ones you use in QC) different from the one you're trying to run prediction on? If they are actually the same, and QC works, then it might be a bug in the prediction cell (6.1.) that we need to find. If they are not the same, then I would give your idea a shot and test prediction on data the same size as the training data. Does this work? If yes, we need to find out why it doesn't work on images of different sizes.
Third: You said you get this error even when you restart the run time and go basically straight to 6.1.? Can you observe anything about the RAM or disk space before and after you run the prediction cell? Does it fill up after you run the 6.1. cell or even before, or do you see nothing until it crashes. This might tell us where the memory gets overloaded. In the recent version of colab you can even check which line of code is running if you look into the code of the cells (there is a green arrow beside the line of the code that's running). Maybe that gives us some more clues.

Sorry again for the misunderstanding earlier.

@LPUoO
Copy link
Author

LPUoO commented Apr 27, 2021

Oh, I see, I misunderstood. Apologies.

No problem, thanks for your help

How big is the model file?

The model.p file is 266MB

Do you get this error if you run the QC section on a test dataset?

Actually, I can't run this cell right now as I am getting FileNotFoundError for the predicted file in the QC folder.

You said you get this error even when you restart the run time and go basically straight to 6.1.? Can you observe anything about the RAM or disk space before and after you run the prediction cell?

I saw no issues with the RAM or disk space before I got the CUDA out of memory. in 6.1.

I noticed that I never purged the the pytorch_fnet folder from previous models training attempts (section 6.4). Maybe I should purge it and start again from scratch. I will probably do that and report back tomorrow or the day after tomorrow.

Thanks a lot!

@LPUoO
Copy link
Author

LPUoO commented Apr 30, 2021

Hi @lucpaul,
So I tried again from scratch and I am having the same CUDA out of memory in the prediction step.

How big is the model file?

The model.p file is now 280MB.

Do you get this error if you run the QC section on a test dataset?

I have no problem making the plot of training errors in 5.1 but I can't run 5.2. as it tries to find the predicted images in the QC folder but there are none in there (unless I misunderstand and I need to predict them first then go back and to the QC?)

If not, are your test images (the ones you use in QC) different from the one you're trying to run prediction on? If they are actually the same, and QC works, then it might be a bug in the prediction cell (6.1.)

They are the same.

You said you get this error [CUDA out of memory in prediction] even when you restart the run time and go basically straight to 6.1.?

Yes, I did everything again from scratch and this happened both straight after training and with a new runtime (I did install the fnet dependencies)

Can you observe anything about the RAM or disk space before and after you run the prediction cell? Does it fill up after you run the 6.1. cell or even before, or do you see nothing until it crashes.

Everything seemed normal.

In the recent version of colab you can even check which line of code is running if you look into the code of the cells (there is a green arrow beside the line of the code that's running). Maybe that gives us some more clues.

The last thing I see being executed is:
Cell > system() > _system_compat() > _run_command() > _monitor_process() > _poll_process()

The whole output I get for cell 6.1 is:

`Requirement already up-to-date: scipy==1.2.0 in /usr/local/lib/python3.7/dist-packages (1.2.0)
Requirement already satisfied, skipping upgrade: numpy>=1.8.2 in /usr/local/lib/python3.7/dist-packages (from scipy==1.2.0) (1.19.5)
Requirement already satisfied: tifffile==2019.7.26 in /usr/local/lib/python3.7/dist-packages (2019.7.26)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from tifffile==2019.7.26) (1.19.5)
The DICtoFIBv2 network will be used.
--class_dataset TiffDataset \

  • DATASET=TempPredictionFolder
  • MODEL_DIR=saved_models/TempPredictionFolder
  • N_IMAGES=1000
  • GPU_IDS=0
  • for TEST_OR_TRAIN in test
  • python predict.py --path_model_dir saved_models/TempPredictionFolder --class_dataset TiffDataset --path_dataset_csv data/csvs/TempPredictionFolder/test.csv --n_images 1000 --no_prediction_unpropped --path_save_dir results/3d/TempPredictionFolder/test --gpu_ids 0
    Propper(-) => transformer: Cropper('-', 16, 'mid', 20000000)
    <fnet.data.tiffdataset.TiffDataset object at 0x7f6c670f4f90>
    DEBUG: cropper shape change [32, 624, 912] becomes (32, 624, 912)
    /content/gdrive/My Drive/pytorch_fnet/fnet/transforms.py:172: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
    x_out = x_in[slices].copy()
    saved: results/3d/TempPredictionFolder/test/00/signal.tiff
    saved: results/3d/TempPredictionFolder/test/00/target.tiff
    fnet_nn_3d | {} | iter: 100000
    /content/gdrive/My Drive/pytorch_fnet/fnet/fnet_model.py:102: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
    signal = torch.tensor(signal, dtype=torch.float32, device=self.device)
    Traceback (most recent call last):
    File "predict.py", line 118, in
    main()
    File "predict.py", line 104, in main
    prediction = model.predict(signal) if model is not None else None
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/fnet_model.py", line 112, in predict
    prediction = module(signal).cpu()
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/nn_modules/fnet_nn_3d_params.py", line 21, in forward
    x_rec = self.net_recurse(x)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/content/gdrive/My Drive/pytorch_fnet/fnet/nn_modules/fnet_nn_3d_params.py", line 62, in forward
    x_cat = torch.cat((x_2conv_more, x_relu1), 1) # concatenate
    RuntimeError: CUDA out of memory. Tried to allocate 4.34 GiB (GPU 0; 15.90 GiB total capacity; 10.38 GiB already allocated; 2.72 GiB free; 12.05 GiB reserved in total by PyTorch)
    Time elapsed: 0.0 hour(s) 0.0 min(s) 26 sec(s)`

@lucpaul
Copy link
Collaborator

lucpaul commented May 11, 2021

Hello, I apologize for not coming back sooner. I have been working on a proper update on the fnet notebook which hopefully can be released soon, given some of the comments here and some other things I noticed. Interestingly, I encountered this error too. And it appears to be already existing in the original code, see here, for example: AllenCellModeling/pytorch_fnet#153.
I can't say I have found a solution yet that I am happy with, but I think it has to do with the type of GPU I was allocated as this error occurred when using a P4, but did not occur when using T4. So although I am still not certain it might be worth checking if a different GPU in the runtime makes a difference. I hope this helps a little. I will keep this issue in my mind and hopefully, we can fix it.

@lucpaul
Copy link
Collaborator

lucpaul commented May 13, 2021

Hi again, I have been looking at this error now and can reproduce the error by using larger images (1024x1024x32 in my case) on a K80 GPU. So it has to do with the size of the image being loaded into the network. However, I am not sure it can be fixed easily within the notebook. It has to do with memory allocation in the GPUs and how CUDA handles the memory allocation to pytorch. It seems odd that such an error would occur during prediction and not training but it's not something I found a solution for yet and not sure if I have the capacity either. I have tried clearing out the cache of the GPU using torch.cuda.empty_cache() and gc.collect() both within the cells themselves and inside the fnet_model.py, neither of which fixed things.
I also played around with the with torch.no_grad() option in the fnet_model.py and that did not do anything either. All of these are suggested fixes I found when searching for similar problems on github and stackoverflow.

The only fix I could find was reducing the image dimensions of the individual images I wanted to predict. So I would suggest you reduce your image dimensions, for example by making smaller patches, if you want to use the notebook on your data. Otherwise, you could search for the CUDA out of memory issue in the prediction context and see if you find a solution, maybe along the lines I tried above.

After checking the issue now, I don't believe it is related to our implementation of label-free prediction in this project but is related either to the source code or is a pytorch specific problem, so I will close this issue here. Feel free to open this again if you find a more satisfactory solution.

@lucpaul lucpaul closed this as completed May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants