-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training speed #6
Comments
The maximum number of iterations in the training script is probably much more than is necessary for the model to converge. How many iterations are you running it for? Does the loss decrease and begin to converge? The training script will save out model checkpoints at intermediate points, and you can try saving out meshes from these models to see how they look. |
I haven't been able to replicate this issue, so closing for now. Please follow up if this is still a problem. |
I have the same issue training the 3D models. Currently, I used the default epochs in the thai statue config file, which is 10000 epochs. It trained on v100 for 24hours but only got 60000 iterations done, which is not many numbers of epochs. I exported the dae mesh and it didn't seem like it's converged. Following is a snapshot of that dae mesh in meshlab: |
Training to 60,000 iterations should yield a better result than what you're showing above, so something seems off. We optimize to 48k iters in the paper for the Thai Statue and it looks much more detailed than the above. Also, it seems strange that it takes 24 hours to get to 60k iterations. I can run around 10 it/s on my laptop GPU (GTX 1650), and at this rate it should only take a couple hours to reach 60k. I guess a V100 should be even faster. Are you sure you are using the default config file without any changes from the repo? How many workers are you using for the dataloader? Can you also post the tensorboard summaries for the occupancy loss? |
I'm sure that I used the config cloned from the repo. The followings are loss for thai statue, my config used I remembered the speed is around 3-4 iters/s or even below. I initially suspect that the GPU is not utilized during the training but I check 'torch.cuda.is_avaiable() = True' . It's weird to have this effect. Thanks for the help in advance. PS: Is the training 3D mesh watertight? Does this matter? We used watertight mesh to train. |
Hmm, unfortunately I'm still having a hard time reproducing this. One observation is that my occupancy loss curves look very different compared to yours. It's almost as if the block optimization is not happening at all in your case. You should see the error spike a bit at intervals where the block optimization is done. The loss also doesn't go down monotonically because as the blocks subdivide, there are more blocks and hence more fitting error. I followed the below steps:
I get the below result after 20K iterations (I export the mesh and visualize using meshlab). This took an hour or so on an old Titan X GPU. |
Thank you for your feedback. It seems the issue still happened on our side even though I try to replicate using the process you mentioned. Either
I will dig a little bit more into this and will update if I find the issue. Thank you for your help again! |
Hmm this is strange, and it's hard to diagnose since I can't reproduce it. A few other thoughts: Maybe there is some difference in the hardware or python packages? Otherwise did you rename the downloaded model file to thai_statue.ply in the data directory? Does it work if you try on a different machine? |
Hi, I'm training a 3d model (engine) by your code. And I completely followed the steps in README. But the code runs too slow (more than 1000 hours to finish). So where the problem is ?
(I used a GeForce RTX 2080 Ti)
The text was updated successfully, but these errors were encountered: