Pretraining of albert from scratch is stuck #36

008karan · 2020-02-27T09:42:31Z

I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results.
@kamalkraj

The text was updated successfully, but these errors were encountered:

josegchen · 2020-03-02T20:51:36Z

same problem here. 9 GPUs are available - no training at all

008karan · 2020-03-03T07:52:29Z

I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it.
@kamalkraj @josegchen

josegchen · 2020-03-03T13:51:51Z

Mind you share us the parameters and setting in detail? Sent from my Huawei phone-------- Original message --------From: Karan Purohit <[email protected]>Date: Tue, Mar 3, 2020, 1:52 AMTo: "kamalkraj/ALBERT-TF2.0" <[email protected]>Cc: josegchen <[email protected]>, Mention <[email protected]>Subject: Re: [kamalkraj/ALBERT-TF2.0] Pretraining of albert from scratch is stuck (#36)I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

008karan · 2020-03-03T13:56:22Z

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3

josegchen · 2020-03-03T16:38:44Z

I have tried with an 313MB tf_record file, it works on CPU only.

…

On Mar 3, 2020, at 7:56 AM, Karan Purohit ***@***.***> wrote: python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AHIXEC7YQT3EV2OWCZOREMLRFUEAPA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSMLY#issuecomment-593962543>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIXEC2PHQSGCQXKPKSEM23RFUEAPANCNFSM4K4XMSMA>.

008karan · 2020-03-04T07:45:57Z

have you checked gpu usage? In my case, gpu is utilizing.

josegchen · 2020-03-04T15:08:46Z

It does show minor gpu utilize, see 1-2% for 7-8 GPUs and 25% for a single GPU for a very very short moment. However, the GPU memory are occupied. It halted eventually with a resource exhausted error.

…

On Mar 4, 2020, at 1:45 AM, Karan Purohit ***@***.***> wrote: have you checked gpu usage? In my case, gpu is utilizing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AHIXECYLI3ANUAN22PXWVSLRFYBLNA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWWXBI#issuecomment-594373509>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIXECZEP4VVQXAZRKXJSATRFYBLNANCNFSM4K4XMSMA>.

ibrahimishag · 2021-01-26T04:00:25Z

have you checked gpu usage? In my case, gpu is utilizing.

Dear Karan,
I would like to know how did it go?
Were you able to pre-train using a single GPU?
Please share your experience!.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining of albert from scratch is stuck #36

Pretraining of albert from scratch is stuck #36

008karan commented Feb 27, 2020

josegchen commented Mar 2, 2020

008karan commented Mar 3, 2020

josegchen commented Mar 3, 2020 via email

008karan commented Mar 3, 2020

josegchen commented Mar 3, 2020 via email

008karan commented Mar 4, 2020

josegchen commented Mar 4, 2020 via email

ibrahimishag commented Jan 26, 2021

Pretraining of albert from scratch is stuck #36

Pretraining of albert from scratch is stuck #36

Comments

008karan commented Feb 27, 2020

josegchen commented Mar 2, 2020

008karan commented Mar 3, 2020

josegchen commented Mar 3, 2020 via email

008karan commented Mar 3, 2020

josegchen commented Mar 3, 2020 via email

008karan commented Mar 4, 2020

josegchen commented Mar 4, 2020 via email

ibrahimishag commented Jan 26, 2021