Evaluation runs on CPU when using multi-gpu #521

parap1uie-s · 2018-06-22T07:47:20Z

I try to train the model with resnet50 backbone on custom CSV dataset.
Our dataset is about 7200 TRAIN images with ~11000 boxes, and 1800 val images with ~ 2600 boxes.
Our GPU is GTX 1080 x2, so I train the model with --mutigpus=2

However, I find that the validation step with Evaluate callback, is very slow, with high CPU load and low GPU(~0%) load.

Each epoch training take about 10 mins, and the validation take more than 30 mins.

Is this caused by the evaluate operation running on CPU?
Is there any way to solve this problem?

Thanks.

hgaiser · 2018-06-22T12:10:53Z

Is there any way to solve this problem?

Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation).

I'll leave the issue open, but I'll change the title to be more fitting.

parap1uie-s · 2018-06-23T03:51:44Z

Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation).

Yeah, I've noticed that with tf.device('/cpu:0'): in line 105@bin/train.py.

However, I reailized the bad implemetation of Keras's multi-gpu, and I tried to take the eval operation on just a single GPU, so I edit it from with tf.device('/cpu:0'): to with tf.device('/gpu:0'):

And it seems goes right, the eval operation seems running on GPU0.

Is there any problem or potential risks to my modification?

Does the multi-gpu model merging automatically on evaluating? Or I lose the other part of weights on GPU1?

Thanks.

hgaiser · 2018-06-23T06:57:19Z

If that runs for you then it's fine. The biggest risk is that it simply doesn't fit in your GPU memory.

…

On Sat, 23 Jun 2018, 05:51 parap1uie-s, ***@***.***> wrote: Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation). Yeah, I've noticed that with tf.device('/cpu:0'): in line ***@***.*** /train.py. However, I reailized the bad implemetation of Keras's multi-gpu, and I tried to take the eval operation on just a single GPU, so I edit it from with tf.device('/cpu:0'): to with tf.device('/gpu:0'): And it seems goes right, the eval operation seems running on GPU0. Is there any problem or potential risks to my modification? Does the multi-gpu model merging automatically on evaluating? Or I lose the other part of weights on GPU1? Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#521 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AArtaiDswDR-R57RUCK7RUcvG5ZFdrOrks5t_btRgaJpZM4UzSqk> .

DenceChen · 2019-03-04T03:39:55Z

hi @hgaiser your release pretrain-model trianing with multi_gpu or single-gpu?

hgaiser · 2019-03-04T08:30:04Z

Single GPU

beibeiZ · 2019-09-05T07:33:41Z

Single GPU

retinanet-evaluate --convert-model ./model/resnet50_csv_100.h5 csv ./train.csv ./class.csv
Using TensorFlow backend.
usage: retinanet-evaluate [-h] [--convert-model] [--backbone BACKBONE]
[--gpu GPU] [--score-threshold SCORE_THRESHOLD]
[--iou-threshold IOU_THRESHOLD]
[--max-detections MAX_DETECTIONS]
[--save-path SAVE_PATH]
[--image-min-side IMAGE_MIN_SIDE]
[--image-max-side IMAGE_MAX_SIDE] [--config CONFIG]
{coco,pascal,csv} ... model
retinanet-evaluate: error: argument dataset_type: invalid choice: './model/resnet50_csv_100.h5' (choose from 'coco', 'pascal', 'csv')
why error？

stale · 2019-11-08T05:24:57Z

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hgaiser changed the title ~~validation is very slow~~ Evaluation runs on CPU when using multi-gpu Jun 22, 2018

hgaiser added the wontfix label Jun 22, 2018

stale bot added the stale Issues with no activity for a long time label Nov 8, 2019

hgaiser removed the stale Issues with no activity for a long time label Nov 11, 2019

hgaiser mentioned this issue Nov 11, 2019

GPU is not used when validating #1177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation runs on CPU when using multi-gpu #521

Evaluation runs on CPU when using multi-gpu #521

parap1uie-s commented Jun 22, 2018

hgaiser commented Jun 22, 2018 •

edited by de-vri-es

Loading

parap1uie-s commented Jun 23, 2018

hgaiser commented Jun 23, 2018 via email

DenceChen commented Mar 4, 2019

hgaiser commented Mar 4, 2019

beibeiZ commented Sep 5, 2019

stale bot commented Nov 8, 2019

Evaluation runs on CPU when using multi-gpu #521

Evaluation runs on CPU when using multi-gpu #521

Comments

parap1uie-s commented Jun 22, 2018

hgaiser commented Jun 22, 2018 • edited by de-vri-es Loading

parap1uie-s commented Jun 23, 2018

hgaiser commented Jun 23, 2018 via email

DenceChen commented Mar 4, 2019

hgaiser commented Mar 4, 2019

beibeiZ commented Sep 5, 2019

stale bot commented Nov 8, 2019

hgaiser commented Jun 22, 2018 •

edited by de-vri-es

Loading