Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation runs on CPU when using multi-gpu #521

Open
parap1uie-s opened this issue Jun 22, 2018 · 7 comments
Open

Evaluation runs on CPU when using multi-gpu #521

parap1uie-s opened this issue Jun 22, 2018 · 7 comments
Labels

Comments

@parap1uie-s
Copy link

I try to train the model with resnet50 backbone on custom CSV dataset.
Our dataset is about 7200 TRAIN images with ~11000 boxes, and 1800 val images with ~ 2600 boxes.
Our GPU is GTX 1080 x2, so I train the model with --mutigpus=2

However, I find that the validation step with Evaluate callback, is very slow, with high CPU load and low GPU(~0%) load.

Each epoch training take about 10 mins, and the validation take more than 30 mins.

Is this caused by the evaluate operation running on CPU?
Is there any way to solve this problem?

Thanks.

@hgaiser
Copy link
Contributor

hgaiser commented Jun 22, 2018

Is there any way to solve this problem?

Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation).

I'll leave the issue open, but I'll change the title to be more fitting.

@hgaiser hgaiser changed the title validation is very slow Evaluation runs on CPU when using multi-gpu Jun 22, 2018
@parap1uie-s
Copy link
Author

Not that I'm aware of. The thing is, Keras' multi-gpu implementation is quite bad. It mangles the model to split and merge over multiple GPUs. You don't want this splitting and merging when evaluating, so a different model is required for evaluation. This unlikely fits on one GPU, so the only other option really is the CPU. My advice, disable evaluation when using multi-gpu (or don't use multi-gpu until Keras fixes the implementation).

Yeah, I've noticed that with tf.device('/cpu:0'): in line 105@bin/train.py.

However, I reailized the bad implemetation of Keras's multi-gpu, and I tried to take the eval operation on just a single GPU, so I edit it from with tf.device('/cpu:0'): to with tf.device('/gpu:0'):

And it seems goes right, the eval operation seems running on GPU0.

Is there any problem or potential risks to my modification?

Does the multi-gpu model merging automatically on evaluating? Or I lose the other part of weights on GPU1?

Thanks.

@hgaiser
Copy link
Contributor

hgaiser commented Jun 23, 2018 via email

@DenceChen
Copy link

hi @hgaiser your release pretrain-model trianing with multi_gpu or single-gpu?

@hgaiser
Copy link
Contributor

hgaiser commented Mar 4, 2019

Single GPU

@beibeiZ
Copy link

beibeiZ commented Sep 5, 2019

Single GPU

retinanet-evaluate --convert-model ./model/resnet50_csv_100.h5 csv ./train.csv ./class.csv
Using TensorFlow backend.
usage: retinanet-evaluate [-h] [--convert-model] [--backbone BACKBONE]
[--gpu GPU] [--score-threshold SCORE_THRESHOLD]
[--iou-threshold IOU_THRESHOLD]
[--max-detections MAX_DETECTIONS]
[--save-path SAVE_PATH]
[--image-min-side IMAGE_MIN_SIDE]
[--image-max-side IMAGE_MAX_SIDE] [--config CONFIG]
{coco,pascal,csv} ... model
retinanet-evaluate: error: argument dataset_type: invalid choice: './model/resnet50_csv_100.h5' (choose from 'coco', 'pascal', 'csv')
why error?

@stale
Copy link

stale bot commented Nov 8, 2019

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues with no activity for a long time label Nov 8, 2019
@hgaiser hgaiser removed the stale Issues with no activity for a long time label Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants