-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking full N-body simulation on GPU clusters #80
Comments
We will make a new bash profiling script based on the message sent by @EiffL on #cosmostat slack on Monday:
|
We created a new script to benchmark
|
We created a new script to benchmark
|
@EiffL @santiagocasas I have obtained some
Also, ~2 h later , I made the same
However, I got a Then, I made the The
and
I hope to understand:
|
ok, so first comment, to understand what's going on, you probably should look at the log files, they should be called something like Don't worry about the exit codes,they are not very important for us, and they don't tell us why the job is canceled, the log file should have more info towards the end. A |
I think I found it (
|
ah that's interesting! this means that the simulation is running out of memory and dying, because not enough space on the GPU. You could try with a smaller simulation. |
ok, I heard that story once! |
Daily Report: #SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10 with
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=10 with
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=10 with
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=10 with
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=10 with
Run time: 1:25 We noticed :
Please @santiagocasas add some more If I forgot somethings or if I was wrong with some configurations descriptions! |
@b-remy in case you have not seen it, this is super useful I think, it shows you what configurations give weird results |
@EiffL I almost forgot, @santiagocasas and I also noticed that if we use (for instance) the following configuration:
We find 80 CPUS (shouldn't we find 40? ):
|
@dlanzieri I've noticed that on nsys logs also, it may be the case that each core has two hardware threads, then it would look like 80 cores (just as it frequently happens when you look at htop). I don't know if it's that for sure, but it may be the case. |
Daily Report:
--nc=128 --batch_size=1 --nx=2 --ny=2 --hsize=32 This is how the Timeline Views look like: --nsteps=1 --nsteps=2 --nsteps=3 These profiles lead us to believe that the gaps we can see in the CUDA HW records are related to the number of steps in the N-body simulation. |
@dlanzieri @santiagocasas I've added support for NVTX annotations in Mesh TensorFlow, which you can use to probe the different sections of the code. To add annotations you can do the following:
import nvtx.plugins.tf as nvtx_tf
from nvtx.plugins.tf.estimator import NVTXHook
from mesh_tensorflow.nvtx_ops import add_nvtx
[....]
# inside your mesh tensorflow model, where XXXX is a mesh tensor
XXXX = add_nvtx(XXXX, message='a message', domain_name='nbody')
# In the session run:
nvtx_callback = NVTXHook(skip_n_steps=0, name='Train')
with tf.compat.v1.train.MonitoredSession(hooks=[nvtx_callback]) as sess: And you need to have installed This will add small markers at the different places of the code where the marked tensors are used. I'm attaching a full example here: https://gist.github.com/EiffL/ae6f9d58e958e87f29c5e1bc0b11193a |
@EiffL do we have install it in the |
you just need to run |
yes, I did it there.
|
I think I narrowed down where the annoying part of the code happens: The highlighted region is between these markers (which you can find in the gist I highlighted above): final_state0 = add_nvtx(final_state[0], message='before_paint', domain_name='nbody')
final_field = mesh_utils.cic_paint(final_field, final_state0, halo_size)
final_field = add_nvtx(final_field, message='after_paint', domain_name='nbody') so it looks like something goes wrong in the cic_paint. @dlanzieri @santiagocasas can you check that this makes sense, based on the tests you ran yesterday? And if it does, I guess to figure out what's happening, you can add other NVTX tags inside the |
|
Hi @dlanzieri @santiagocasas , Based on the new horovod version and on some modifications of the If you want to be able to get the same results you will need to compile the new version of
with
with
with
with
|
that's really cool @b-remy ! |
Posting here the steps to run the code inside the Singularity container (based on Meriem's slack post) step 0
request a node
run the container
Run the python "mesh" script
Run the python "pyramid" script
Comments
|
Attaching here the terminal outputs for these two cases above. Profiling output can be found here:
|
Here some dlprof profiles.
def _cic_paint(mesh, neighboor_coords, kernel, shift, name=None):
"""
Paints particules on a 3D mesh.
Parameters:
-----------
mesh: tensor (batch_size, nc, nc, nc)
Input 3D mesh tensor
shift: [x,y,z] array of coordinate shifting
"""
with tf.name_scope(name, "cic_update", [mesh, neighboor_coords, kernel]):
shape = tf.shape(mesh)
batch_size = shape[0]
nx, ny, nz = shape[-3], shape[-2], shape[-1]
# TODO: Assert shift shape
neighboor_coords = tf.reshape(neighboor_coords, (-1, 8, 4))
neighboor_coords = neighboor_coords + tf.reshape(tf.constant(shift, dtype=tf.float32),
[1, 1, 4])
neighboor_coords = tf.cast(neighboor_coords, tf.int32)
update = tf.scatter_nd(neighboor_coords, tf.reshape(kernel, (-1, 8)),
[batch_size, nx, ny, nz])
mesh = mesh + tf.reshape(update, mesh.shape)
return mesh
def _cic_readout(mesh, neighboor_coords, kernel, shift, name=None):
"""
Paints particules on a 3D mesh.
Parameters:
-----------
mesh: tensor (batch_size, nc, nc, nc)
Input 3D mesh tensor
shift: [x,y,z] array of coordinate shifting
"""
with tf.name_scope(name, "cic_readout", [mesh, neighboor_coords, kernel]):
shape = tf.shape(mesh)
batch_size = shape[0]
nx, ny, nz = shape[-3], shape[-2], shape[-1]
mesh = mesh[:, 0, 0, 0]
shape_part = tf.shape(neighboor_coords)
neighboor_coords = tf.reshape(neighboor_coords, (-1, 8, 4))
neighboor_coords = neighboor_coords + tf.reshape(tf.constant(shift, dtype=tf.float32),
[1, 1, 4])
neighboor_coords = tf.cast(neighboor_coords, tf.int32)
meshvals = tf.gather_nd(mesh, neighboor_coords)
weightedvals = tf.multiply(meshvals, tf.reshape(kernel, (-1, 8)))
value = tf.reduce_sum(weightedvals, axis=-1)
value = tf.reshape(value, shape_part[:-2])
return value |
This issue is to track the work on benchmarking the new Horovod backend for GPU clusters and getting profiling information for FlowPM.
We want to do the following things:
To learn how to do this profiling, keep an eye on DifferentiableUniverseInitiative/IDRIS-hackathon#2
The text was updated successfully, but these errors were encountered: