-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors during refine #38
Comments
I will add to this: I've tried the suggested OOM error (reducing batch size from default, which is 8 for 4x GPUs, to 4) |
When I type --log_level debug I get the following: ######Isonet starts refining###### 2023-01-07 10:45:03.212576: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 |
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc (2) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (3) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (4) Resource exhausted: OOM when allocating tensor with shape[2,32,32,32,128] and type bool on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. Function call stack: 2023-01-07 12:18:06.622056: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. |
Hi, I have not encountered this type of problem before. And you are using A100/A40, both should have sufficient VRAM. It seems that the data can not be loaded onto GPU. I think this could be related to Nvidia cuda/driver problems. So how did you install the environment. I typically use "conda install" to install cuda toolkit and cudnn in an anaconda environment. |
Hello, thank you for your help, I've attached here how the program was installed (as a .png) as well as the .yml file (saved as a .pdf) which contains all the dependencies |
Hello everyone, I've been running into difficulties running Isonet's refinement and was hoping to find some assistance.
To give some background details, I'm trying to correct for the missing wedge on 5 tomograms. After following the tutorial online, I was able to generate a star file, correct the CTF, generate masks, and extract the subtomograms. However, when running the refine program, the job fails.
I ran the following script: isonet.py refine subtomo.star --gpuID 0,1,2,3,4,5 --iterations 30 --noise_start_iter 10,15,20,25 --noise_level 0.05,0.1,0.15,0.2
I submitted the job on a node on our university cluster. I asked for 6 GPU A100 Devices and 600GB memory.
Later in the evening the job failed after stalling out at Epoch 1/10.
I've been in contact with our CHPC department who tried to look further into this and found that we have 3 issues convoluted which makes it hard to find exactly what the problem is. So, let's break it down in its constituent parts:
a. Isonet was originally installed on rocky8. Our CHPC department tested out Tensorflow and checked whether CUDNN worked correctly. It did.
The corresponding module was written for Rocky8, however a few days ago we realized we need the software to run on a Centos7 node (this is the server I have access to).
But attempting this on Centos7 (or Rocky8), I didn't see any tangible progress (still stuck at Epoch 1/10). At the end, both jobs were stopped and threw an error shown in the screenshot below:
Do you all have any suggestions for getting the refinement to work? Please let me know if you need any additional details
Best,
Ben
The text was updated successfully, but these errors were encountered: