Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining of albert from scratch is stuck #36

Open
008karan opened this issue Feb 27, 2020 · 8 comments
Open

Pretraining of albert from scratch is stuck #36

008karan opened this issue Feb 27, 2020 · 8 comments

Comments

@008karan
Copy link

I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results.
@kamalkraj

@josegchen
Copy link

same problem here. 9 GPUs are available - no training at all

@008karan
Copy link
Author

008karan commented Mar 3, 2020

I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it.
@kamalkraj @josegchen

@josegchen
Copy link

josegchen commented Mar 3, 2020 via email

@008karan
Copy link
Author

008karan commented Mar 3, 2020

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3

@josegchen
Copy link

josegchen commented Mar 3, 2020 via email

@008karan
Copy link
Author

008karan commented Mar 4, 2020

have you checked gpu usage? In my case, gpu is utilizing.

@josegchen
Copy link

josegchen commented Mar 4, 2020 via email

@ibrahimishag
Copy link

have you checked gpu usage? In my case, gpu is utilizing.

Dear Karan,
I would like to know how did it go?
Were you able to pre-train using a single GPU?
Please share your experience!.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants