training with just the preprocessed folder in det_models folder crashes #278

DSRajesh · 2024-10-10T21:15:09Z

Recently I transferred the 'preprocessed' folder (along with its unpacked preprocessed dataset, etc) alone into "det_data/Task16_Luna/" folder and the 'plan.pkl' file alone in and as "det_models/Task116_Luna/RetinaUNetV001_D3V001_3d/fold0/plan.pkl". And all these folders were put in a SSD. Then started training. I made several attempts to run, but the training crashed each time leaving no error logs. Could anyone please tell what may be happening ?

Rajesh

DSRajesh · 2024-10-10T21:31:15Z

The preprocessed folder size is 1.2 Tera Bytes . I reduced det_num_threads=2 and tried. Even then it used to crash

partha-ghosh · 2024-10-11T11:26:56Z

Hi @DSRajesh,
Please provide more information, such as the output you got when you ran the training, the environment you used to run the commands, etc. Please check if you can run the training with the toy dataset in that environment. Also, please check that you transferred the dataset.json or dataset.yaml to the Task16_Luna along with the preprocessed folder.

Best,
Partha

DSRajesh · 2024-10-22T15:08:41Z

Thanks Partha-Ghosh
Recently I transferred (as mentioned above) "preprocessed" folder etc, of a nnDetection object detection training model , which had stopped training midway, on a machine 'A'. The probable reason the training code quoted was "RAM becoming full", as shown below.

And I got the the following output while resuming the training on another machine 'B', using the command line "python scripts/train.py 116 -o train.mode=resume"

and then stopped abruptly without any error logs.

When i tried resuming the training using the command line "python scripts/train.py 116 --sweep -o train.mode=resume" , I got the following extra lines in the output when compared to the output shown above

In the above output we can see the training code was running inferences on all validation datasets one by one (we can see one of the many such inferences being run on validation dataset 127 out of the total of 131).

"dataset.json" was present in the det_data/Task016_Luna folder. GPU/CUDA details are as follows.

I hope these details are sufficient. Was, changing the training location from machine 'A' to 'B' , the reason for this error ? Also why did the training stop in the first place, was it burdening the RAM ? The preprocessed folder was stationed in a SSD.
Thanks again
Rajesh

partha-ghosh · 2024-10-30T10:25:24Z

Hi @DSRajesh,

It’s difficult to pinpoint the exact issue from the logs since the initial error suggests that the RAM being full may not be the root cause. Please monitor the RAM consumption to confirm whether it’s actually a memory-related issue.

The patch size might also be a factor. Since nnDetection adjusts patch size based on available GPU resources, you could try reducing it. For instance, setting [80, 192, 160] in the plan.json has worked well on our clusters.

It also looks like you’re resuming the training process at the validation phase, which might indicate that the training has already completed. Without the full log, it's hard to be sure. You could try manually loading the checkpoint to verify the current epoch.

If the issue persists, you might consider restarting the training from scratch on the preferred machine. Additionally, please share the complete log if the problem continues, and we’ll take another look.

Best,
Partha

Rajesh-ParaxialTech · 2024-10-30T17:36:18Z

Thanks a lot Partha
The issue was solved. As you had mentioned, the no of epochs of training i had fixed for the training had elapsed and hence it was in the validation stage. I increased the no of epochs and resumed training. It works fine now. Thanks a lot again

mibaumgartner assigned partha-ghosh Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training with just the preprocessed folder in det_models folder crashes #278

training with just the preprocessed folder in det_models folder crashes #278

DSRajesh commented Oct 10, 2024 •

edited

Loading

DSRajesh commented Oct 10, 2024

partha-ghosh commented Oct 11, 2024 •

edited

Loading

DSRajesh commented Oct 22, 2024 •

edited

Loading

partha-ghosh commented Oct 30, 2024

Rajesh-ParaxialTech commented Oct 30, 2024

training with just the preprocessed folder in det_models folder crashes #278

training with just the preprocessed folder in det_models folder crashes #278

Comments

DSRajesh commented Oct 10, 2024 • edited Loading

DSRajesh commented Oct 10, 2024

partha-ghosh commented Oct 11, 2024 • edited Loading

DSRajesh commented Oct 22, 2024 • edited Loading

partha-ghosh commented Oct 30, 2024

Rajesh-ParaxialTech commented Oct 30, 2024

DSRajesh commented Oct 10, 2024 •

edited

Loading

partha-ghosh commented Oct 11, 2024 •

edited

Loading

DSRajesh commented Oct 22, 2024 •

edited

Loading