the sample allreduce-test.py from README runs forever #10

nitinkamble · 2017-06-29T16:37:38Z

First, I will give some context of my test.

I got the repo built, along with it's dependencies. Configured slrum, and have both Intel MPI and OpenMPI installed. I used the sample lines from the README for creating train.txt and vocab.txt. CUDA 8.0 libraries are built and installed on the system. I have configured gpus in slurm gres. I also see messages showing GPUs libraries being loaded by tensorflow.

I also set --max-interations to 10 to reduce the runtime. For such a small dataset, the run should finish very fast. But it is running for days. I tried with 2 tasks and also 50 tasks. I see that many CPU cores running almost at 100%, but nothing is running on GPU.

1st question, Why is it running forever, for such a small test?

and 2nd question, why GPUs are not being used?

Thanks in advance,
Nitin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the sample allreduce-test.py from README runs forever #10

the sample allreduce-test.py from README runs forever #10

nitinkamble commented Jun 29, 2017 •

edited

Loading

the sample allreduce-test.py from README runs forever #10

the sample allreduce-test.py from README runs forever #10

Comments

nitinkamble commented Jun 29, 2017 • edited Loading

nitinkamble commented Jun 29, 2017 •

edited

Loading