-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training with just the preprocessed folder in det_models folder crashes #278
Comments
The preprocessed folder size is 1.2 Tera Bytes . I reduced det_num_threads=2 and tried. Even then it used to crash |
Hi @DSRajesh, Best, |
Thanks Partha-Ghosh And I got the the following output while resuming the training on another machine 'B', using the command line "python scripts/train.py 116 -o train.mode=resume" and then stopped abruptly without any error logs. When i tried resuming the training using the command line "python scripts/train.py 116 --sweep -o train.mode=resume" , I got the following extra lines in the output when compared to the output shown above In the above output we can see the training code was running inferences on all validation datasets one by one (we can see one of the many such inferences being run on validation dataset 127 out of the total of 131). "dataset.json" was present in the det_data/Task016_Luna folder. GPU/CUDA details are as follows. I hope these details are sufficient. Was, changing the training location from machine 'A' to 'B' , the reason for this error ? Also why did the training stop in the first place, was it burdening the RAM ? The preprocessed folder was stationed in a SSD. |
Hi @DSRajesh, It’s difficult to pinpoint the exact issue from the logs since the initial error suggests that the RAM being full may not be the root cause. Please monitor the RAM consumption to confirm whether it’s actually a memory-related issue. The patch size might also be a factor. Since nnDetection adjusts patch size based on available GPU resources, you could try reducing it. For instance, setting It also looks like you’re resuming the training process at the validation phase, which might indicate that the training has already completed. Without the full log, it's hard to be sure. You could try manually loading the checkpoint to verify the current epoch. If the issue persists, you might consider restarting the training from scratch on the preferred machine. Additionally, please share the complete log if the problem continues, and we’ll take another look. Best, |
Thanks a lot Partha |
Recently I transferred the 'preprocessed' folder (along with its unpacked preprocessed dataset, etc) alone into "det_data/Task16_Luna/" folder and the 'plan.pkl' file alone in and as "det_models/Task116_Luna/RetinaUNetV001_D3V001_3d/fold0/plan.pkl". And all these folders were put in a SSD. Then started training. I made several attempts to run, but the training crashed each time leaving no error logs. Could anyone please tell what may be happening ?
Rajesh
The text was updated successfully, but these errors were encountered: