-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU 100% when finetuning SE-Net with nvcaffe 0.17.3 #567
Comments
Hi @whria78 could you attach complete log here please? |
Thank you for the reply. I attached log and trainval.prototxt |
@whria78 thanks.
Is this your custom layer? CPU only? If so, you might need to consider to add GPU implementaion to prevent CPU overload. |
Thank you @drnikolaev I tried to fine-tuning SE-ResNeXt-50. I changed the last classify layer as followings, From (https://github.com/hujie-frank/SENet ; class = 1000)
To (class = 178)
To my knowledge, this modification is commonly used for fine-tuning. I think it does not make all 8 CPU cores to 100%. This phenomenon does not occur if I changed to use the old version NVCaffe 0.17.2. Another fact is that I had no such a problem when I finetuned VGG-19 with the new NVCaffe 0.17.3. |
As I have observed. the cpu overload problem occured around "batch_transformer.cpp:51] Started BatchTransformer thread 32123" Before "batch_transformer.cpp:51] Started BatchTransformer thread 32123", there is "Waiting for datum". I guess at that stage, the CPU overload already started, which result in "waiting for datum". The CPU overload problem occur before the Iteration 1. I0618 10:02:14.725428 32119 internal_thread.cpp:78] Started internal thread 32119 on device 0, rank 0 |
"waiting for datum" means that data reader (CPU-based) delivers data not fast enough for the solver. Therefore, it uses CPU up to 100% |
Hello, Thank you for the maintaining the nvcaffe. I have a problem when I use nvcaffe with SE-Net models.
CPU usage 100% when finetuning SE-Net (or all SE-NeXt) with nvcaffe 0.17.3
There was no problem when I fine-tuning SE-Net with old nvcaffe 0.17.2.
However, If I trained with the new nvcaffe 0.17.3, the CPU utilizations of all cores burn to 100% after initial test 0,1,2 iterations.
I tested on 2 systems and I got the same results (Intel skylake, Ubuntu 16.04 , CUDA 10.1 , cudnn 10.1 , driver=nvidia-driver-418)
For this problem, I could not train SENet, SE-ReNext-50, SE-ResNext-100. However, there was no problem in training VGG model with the new nvcaffe 0.17.3
The text was updated successfully, but these errors were encountered: