-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation runs on CPU when using multi-gpu #521
Comments
Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation). I'll leave the issue open, but I'll change the title to be more fitting. |
Yeah, I've noticed that However, I reailized the bad implemetation of Keras's multi-gpu, and I tried to take the eval operation on just a single GPU, so I edit it from And it seems goes right, the eval operation seems running on GPU0. Is there any problem or potential risks to my modification? Does the multi-gpu model merging automatically on evaluating? Or I lose the other part of weights on GPU1? Thanks. |
If that runs for you then it's fine. The biggest risk is that it simply
doesn't fit in your GPU memory.
…On Sat, 23 Jun 2018, 05:51 parap1uie-s, ***@***.***> wrote:
Not that I'm aware of. The thing is, Keras' multi-gpu implementation is
quite bad. It mangles the model to split and merge over multiple GPUs. You
don't want this splitting and merging when evaluating, so a different model
is required for evaluation. This unlikely fits on one GPU, so the only
other option really is the CPU. My advice, disable evaluation when using
multi-gpu (or don't use multi-gpu until Keras fixes the implementation).
Yeah, I've noticed that with tf.device('/cpu:0'): in line ***@***.***
/train.py.
However, I reailized the bad implemetation of Keras's multi-gpu, and I
tried to take the eval operation on just a single GPU, so I edit it from with
tf.device('/cpu:0'): to with tf.device('/gpu:0'):
And it seems goes right, the eval operation seems running on GPU0.
Is there any problem or potential risks to my modification?
Does the multi-gpu model merging automatically on evaluating? Or I lose
the other part of weights on GPU1?
Thanks.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#521 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AArtaiDswDR-R57RUCK7RUcvG5ZFdrOrks5t_btRgaJpZM4UzSqk>
.
|
hi @hgaiser your release pretrain-model trianing with multi_gpu or single-gpu? |
Single GPU |
retinanet-evaluate --convert-model ./model/resnet50_csv_100.h5 csv ./train.csv ./class.csv |
This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I try to train the model with resnet50 backbone on custom CSV dataset.
Our dataset is about 7200 TRAIN images with ~11000 boxes, and 1800 val images with ~ 2600 boxes.
Our GPU is GTX 1080 x2, so I train the model with --mutigpus=2
However, I find that the validation step with Evaluate callback, is very slow, with high CPU load and low GPU(~0%) load.
Each epoch training take about 10 mins, and the validation take more than 30 mins.
Is this caused by the evaluate operation running on CPU?
Is there any way to solve this problem?
Thanks.
The text was updated successfully, but these errors were encountered: