Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training with just the preprocessed folder in det_models folder crashes #278

Open
DSRajesh opened this issue Oct 10, 2024 · 5 comments
Open
Assignees

Comments

@DSRajesh
Copy link

DSRajesh commented Oct 10, 2024

Recently I transferred the 'preprocessed' folder (along with its unpacked preprocessed dataset, etc) alone into "det_data/Task16_Luna/" folder and the 'plan.pkl' file alone in and as "det_models/Task116_Luna/RetinaUNetV001_D3V001_3d/fold0/plan.pkl". And all these folders were put in a SSD. Then started training. I made several attempts to run, but the training crashed each time leaving no error logs. Could anyone please tell what may be happening ?

Rajesh

@DSRajesh
Copy link
Author

The preprocessed folder size is 1.2 Tera Bytes . I reduced det_num_threads=2 and tried. Even then it used to crash

@partha-ghosh
Copy link
Contributor

partha-ghosh commented Oct 11, 2024

Hi @DSRajesh,
Please provide more information, such as the output you got when you ran the training, the environment you used to run the commands, etc. Please check if you can run the training with the toy dataset in that environment. Also, please check that you transferred the dataset.json or dataset.yaml to the Task16_Luna along with the preprocessed folder.

Best,
Partha

@DSRajesh
Copy link
Author

DSRajesh commented Oct 22, 2024

Thanks Partha-Ghosh
Recently I transferred (as mentioned above) "preprocessed" folder etc, of a nnDetection object detection training model , which had stopped training midway, on a machine 'A'. The probable reason the training code quoted was "RAM becoming full", as shown below.

image

And I got the the following output while resuming the training on another machine 'B', using the command line "python scripts/train.py 116 -o train.mode=resume"

image

and then stopped abruptly without any error logs.

When i tried resuming the training using the command line "python scripts/train.py 116 --sweep -o train.mode=resume" , I got the following extra lines in the output when compared to the output shown above

image

In the above output we can see the training code was running inferences on all validation datasets one by one (we can see one of the many such inferences being run on validation dataset 127 out of the total of 131).

"dataset.json" was present in the det_data/Task016_Luna folder. GPU/CUDA details are as follows.

image

image

I hope these details are sufficient. Was, changing the training location from machine 'A' to 'B' , the reason for this error ? Also why did the training stop in the first place, was it burdening the RAM ? The preprocessed folder was stationed in a SSD.
Thanks again
Rajesh

@partha-ghosh
Copy link
Contributor

Hi @DSRajesh,

It’s difficult to pinpoint the exact issue from the logs since the initial error suggests that the RAM being full may not be the root cause. Please monitor the RAM consumption to confirm whether it’s actually a memory-related issue.

The patch size might also be a factor. Since nnDetection adjusts patch size based on available GPU resources, you could try reducing it. For instance, setting [80, 192, 160] in the plan.json has worked well on our clusters.

It also looks like you’re resuming the training process at the validation phase, which might indicate that the training has already completed. Without the full log, it's hard to be sure. You could try manually loading the checkpoint to verify the current epoch.

If the issue persists, you might consider restarting the training from scratch on the preferred machine. Additionally, please share the complete log if the problem continues, and we’ll take another look.

Best,
Partha

@Rajesh-ParaxialTech
Copy link

Thanks a lot Partha
The issue was solved. As you had mentioned, the no of epochs of training i had fixed for the training had elapsed and hence it was in the validation stage. I increased the no of epochs and resumed training. It works fine now. Thanks a lot again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants