-
Notifications
You must be signed in to change notification settings - Fork 79
Multi-GPU training is not working #18
Comments
I have same issue.
When I run the run_common_voice.py code. These are shown.
I think some of the tf.function? affect to speed of the training. Does the retracing warning have a connection with OOM error? Thank you |
@Nambee Did single GPU training work for you? |
No it does not work. To see the progress, I print some logs in 'run_evaluate' func which is inside of 'run_training' func. (I attach this code at the end of this comment. (I only added 'print' functions.))
|
@Nambee From this log, you can see that you are running out of gpu memory, reduce the batch size to 8 or lower should fix the problem. But still it looks like due to eager execution, the memory requirement keeps growing and only at eval step. My system fails to allocate GPU memory after 19000 Batches at Epoc 0. |
Oh, it's a different issue. sorry. Can you run the run_common_voice.py without the OOM error? |
Yes it worked for me. Even though you use CUDA_VISIBLE_DEVICES=0 to specify one GPU you have to change the strategy = None in run_common_voice.py. |
@prajwaljpj Thank you for your advice. Retracing errors are gone when I disable strategy. I still got the OOM error, I should reduce some factors. Again, Thank you! |
@Nambee Strategy part is not implemented for eval. If you see the training function there is a condition which implements strategy and experimental_run. You have to make a similar change for eval. also Try reducing batch size to 2. |
@prajwaljpj Yes, I did that already. But I apply it only for small datasets. (Because I need feasibility now) |
Can someone please let me know if this is resolved in the latest commit? I do not have a multi GPU machine to test on. Thanks |
Could this be related to #29 ? |
It does seem so. First off, there seems to be an error, rnnt-speech-recognition/run_rnnt.py Line 570 in a0d972f
If I run the training with for completeness, click to expand full error log
|
I can train the model use multi-gpus by add a decorator @tf.function refer this link tensorflow/tensorflow#29911 i also add line of “os.environ['CUDA_VISIBLE_DEVICES'] = "{your gpus}” in my code. |
Maybe take a look at https://github.com/usimarit/TiramisuASR |
I have a machine with
2x Nvidia RTX 2080 Ti
8 Core Intel i7 processor
32Gb of RAM
The training code (non-Docker version) when CUDA_VISIBLE_DEVICES=0,1 causes a memory leak in eval_step.
python run_common_voice.py --mode train --data_dir
These are the warnings I get. I am not able to pinpoint which object is causing the retracing error.
Performing evaluation. [949/1811]
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:38.737240 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:38.740701 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:38.743986 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:38.747186 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2020-04-21 00:39:43.431398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-21 00:39:44.193788: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v2
inside a tf.function to get the best perf$rmance.
W0421 00:39:49.856330 140487075895104 mirrored_strategy.py:692] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_$ 2
inside a tf.function to get the best performance.INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.859219 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.859964 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v2
inside a tf.function to get the best perf$rmance.
W0421 00:39:49.861165 140487075895104 mirrored_strategy.py:692] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_$ 2
inside a tf.function to get the best performance.INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.863494 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.864265 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v2
inside a tf.function to get the best perf$rmance.
W0421 00:39:49.865403 140487075895104 mirrored_strategy.py:692] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_$ 2
inside a tf.function to get the best performance.INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.867894 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0421 00:39:49.868691 140487075895104 cross_device_ops.py:439] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v2
inside a tf.function to get the best performance.
W0421 00:39:49.869868 140487075895104 mirrored_strategy.py:692] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v 2
inside a tf.function to get the best performance.WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v2
inside a tf.function to get the best performance.
W0421 00:40:01.864544 140487075895104 mirrored_strategy.py:692] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap
call_for_each_replica
orexperimental_run
orexperimental_run_v 2
inside a tf.function to get the best performance.WARNING:tensorflow:5 out of the last 5 calls to <function run_evaluate..eval_step at 0x7fc4885dc598> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.functi
on has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for
more details.
W0421 00:40:37.296950 140487075895104 def_function.py:586] 5 out of the last 5 calls to <function run_evaluate..eval_step at 0x7fc4885dc598> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python obj
ects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensor
flow.org/api_docs/python/tf/function for more details.
WARNING:tensorflow:6 out of the last 6 calls to <function run_evaluate..eval_step at 0x7fc4885dc598> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.functi
on has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for
more details
Is this a tensorflow issue?
The text was updated successfully, but these errors were encountered: