Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training process was killed when load data batch #45

Open
RyanPham19092002 opened this issue Aug 5, 2024 · 5 comments
Open

The training process was killed when load data batch #45

RyanPham19092002 opened this issue Aug 5, 2024 · 5 comments

Comments

@RyanPham19092002
Copy link

RyanPham19092002 commented Aug 5, 2024

Hi! Thanks to your great work!
When I tried to train the PandaSet according to your guide, The model couldn't load full data batch, it was killed like the image below despite I reduced the number of scenes in PandaSet (I just used one scene) and my GPUs is 4 GPU V100 (Moreover , I want to ask that could your model training on multi-gpu ? )
image
How can I fix that ?
Thanks.
P/s : I run version NeuRAD tiny but it still killed like the image below
image
I don't know why, did the reason is my GPU is V100 - 32gb so the model can not train ?

@TheScientist1900
Copy link

I have the same issue. Can anyone help me?

@georghess
Copy link
Owner

Hi, it's a bit hard to figure out exactly what is causing the crash. My best guess is that your system runs out of RAM. We cache the data in memory, but it shouldn't really be that heavy to run.

Are you running this in a container, or in a conda environment? Could you maybe track the memory consumption while training?

@bob020416
Copy link

Hi how you solve this issue ? i have a 4060ti 16g and i cannot load the data... it is killed every time, is there a recommend ram ?

@georghess
Copy link
Owner

Have you tried reducing the number of workers and/or queue lenght? https://github.com/georghess/neurad-studio/blob/main/nerfstudio/data/datamanagers/image_lidar_datamanager.py#L83-L85

@bob020416
Copy link

Have you tried reducing the number of workers and/or queue lenght? https://github.com/georghess/neurad-studio/blob/main/nerfstudio/data/datamanagers/image_lidar_datamanager.py#L83-L85

Hi, I discovered that it takes approximately 50gb of ram when data loading, I tried to lower the worker and it seems still cannot work, I use another computer to solve this thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants