Skip to content

Latest commit

 

History

History
65 lines (41 loc) · 3.28 KB

File metadata and controls

65 lines (41 loc) · 3.28 KB

Using the Tensorflow Profiler

Here we demonstrate how to utilize the Tensorflow Profiler on Theta. This profiler is especially useful in quickly understanding if your workflow is limited by the model execution time or the data pipeline.

We build on the Tensorflow CIFAR10 example from the Distributed Training section of this repo.

There are only a few changes needed to utilize this profiler. The training loop, or portion targeted for profiling, must be wrapped in these calls:

import tensorflow as tf
# ...

tf.profiler.experimental.start('/path/to/log/output/')

# ... training loop ...

tf.profiler.experimental.stop()

If the stop method is not called, no output will be written to disk. In addition, keep in mind that profiling utilized memory to track all the executions. This can sometimes lead to 'out of memory' errors if the profiled portions of code run too long or have large call trees. This can be circumvented by limiting the number of training steps you profile. Typically one only needs to profile 10-100 training steps to get a sense of the behavior.

In this example we run one epoch of the training since CIFAR10 is not an intensive target.

If you have installed your own tensorflow, you need to ensure you are using Tensorflow 2.2+ and have installed the tensorboard_profiler_plugin via pip install.

Viewing output with Tensorboard

In order to view the profilers output one uses Tensorboard. This should already be installed if you have installed tensorflor or are using one of ALCF's installations. Generally, tensorboard can be run using the command:

tensorboard --port <port-number> --bind_all --logdir </path/to/log/output/>

This will start a webserver on the local machine that can be accessed via the <port-number> by opening a web browser and typing in the URL localhost:<port-number>.

Since you will typically be running this on the login node of one of our HPCs, you will need to do some ssh port forwarding to access the server. You can generally follow these steps:

  1. ssh -L$PORTA:localhost:$PORTB theta.alcf.anl.gov
  2. module load miniconda-3/2020-07
  3. cd /path/to/log/output
  4. tensorboard --port $PORTB --bind_all --logdir </path/to/log/output/> wait for message that says server has started
  5. Open browser on your laptop and go to localhost:$PORTA

Here PORTA and PORTB are set to different values, both need to be larger than 1024. For example, one could use PORTA=9000 and PORTB=9001.

After navigating to the page in your browser you should see a page similar to this.

tensorboard_overview

Sometimes the Profile tab at the top left does not show up, but can be selected via the drop down menu at the top right.

The window has Runs, Tools, and Hosts drop downs on the left. Runs is a drop down list of all the runs in the current logdir specified when Tensorboard was started. Tools is a drop down that offers different analysis pages related to the current run. Hosts will list different MPI ranks for this run.

We'll explore the Tools during our tutorial.

Running On ThetaKNL

qsub submit_theta.sh

Running On ThetaGPU

Run on ThetaGPU using

qsub submit_thetagpu.sh