Include DLIO benchmark #124

arcturus5340 · 2024-07-10T19:53:13Z

Include I/O access patterns in common AI applications.

Documentation can be found in the directory with the extension (/dlio/README.md) or in the corresponding readthedocs section.

# Conflicts: # dlio/utils.c

sbyna · 2024-07-18T18:57:36Z

Since this is using code from the DLIO benchmark, make sure to follow the original DLIO's license.
https://github.com/argonne-lcf/dlio_benchmark/blob/main/LICENSE

arcturus5340 · 2024-08-07T17:46:19Z

With the provided sample, in Perlmutter I am getting some warnings
Are those expected?

Yes, warnings are shown because the test involves working with a small amount of data. This is done because the same json configuration file is used in github workflow during testing. If a large number of generation and training files are specified in the configuration file, the warning will disappear, but jobs in github workflow may start taking a long time, which I believe is undesirable.

…hub workflow job

jeanbez · 2024-08-13T17:30:33Z

@arcturus5340, could you also please include an explanation in the DLIO pattern page of why this is needed (i.e., why we added a pattern that mimics DLIO instead of running DLIO directly)? If I recall, we have it in the paper, so maybe let's also add it here to make it clear that this is not the full DLIO but rather only its I/O pattern. Could we also add some sentences in the documentation about how this pattern should be updated if changes in DLIO are made, kind of an overview summary of how one might update this pattern if changes are done on the DLIO side?

arcturus5340 · 2024-08-24T18:33:50Z

@jeanbez, I have made changes to the documentation, are they sufficient? Is there anything else I need to do?

jeanbez · 2024-08-26T16:26:33Z

@jeanbez, I have made changes to the documentation, are they sufficient? Is there anything else I need to do?

We're doing some testing in the other systems as well, as soon as those are okay we should be able to merge. Thanks for updating the documentation with that additional information.

dlio/README.md

TheAssembler1 · 2024-08-26T20:13:49Z

Hi, I'm getting segmentation faults when testing on two nodes with the default configuration. I have experienced no errors when running on a single node. The gdb backtrace says that this line is the culprit:

h5bench/dlio/h5bench_dlio.c

Line 424 in 33281ad

MPI_Barrier(MPI_COMM_WORLD);

Here is the full trace:

#0  0x000014c726638448 in MPIDI_SHMI_progress ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
[Current thread is 1 (Thread 0x14c723c048c0 (LWP 550102))]
(gdb) 
(gdb) bt
#0  0x000014c726638448 in MPIDI_SHMI_progress ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#1  0x000014c7253c0fe9 in MPIR_Wait_impl ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#2  0x000014c725f62a26 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#3  0x000014c725f6e785 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#4  0x000014c725e9da6f in MPIR_Barrier_intra_dissemination ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#5  0x000014c7248f97a0 in MPIR_Barrier_intra_auto ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#6  0x000014c7248f9955 in MPIR_Barrier_impl ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#7  0x000014c7260e5ffd in MPIR_CRAY_Barrier ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#8  0x000014c7248f9ebe in PMPI_Barrier () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#9  0x000014c726bed8cb in MPI_Barrier (comm=1140850688) at darshan-apmpi.c:1389
#10 0x0000000000403b43 in train_using_workers (epoch=0, 
    local_metadata_time_out=0x7fff1a971570, local_read_time_out=0x7fff1a971568)
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:424
#11 0x0000000000403cc5 in train (epoch=0, indices=0x17f83b0, 
    enable_multiprocessing=1 '\001')
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:457
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x0000000000403eeb in run () at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:506
#13 0x0000000000404a02 in main (argc=57, argv=0x7fff1a971808)
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:725

It does not fail every single time. In addition, there are two benchmarks in the default sync-dlio.json and it is seemingly random whether the first or second will fail. Maybe the arithmetic involved with splitting the MPI communicator is buggy or the fork is causing issues?

arcturus5340 · 2024-08-28T12:47:17Z

Maybe the arithmetic involved with splitting the MPI communicator is buggy...

It is only used when the total-training-steps parameter is set (which is not set in sync-dlio.json), so this should not be the cause of the error.

...or the fork is causing issues?

This also seems unbelievable, since you mentioned that the bug occurs in both benchmarks. The first one is only responsible for generating data and does not use fork(). Am I right in understanding that the same error occurs when using both benchmarks? What does the trace stack look like when the first benchmark crashes? How often does the benchmark crash? I have run the benchmark in the configuration you mentioned 6 times, but never got an error, maybe it's the MPI libraries we use. In my case it is Intel MPI library v2021.6.0. Could you please specify which MPI library you are using? Could you also provide the output of the generated stdout files for both benchmarks?

TheAssembler1 · 2024-08-28T17:08:47Z

@arcturus5340 Disregard my statement about it failing in both benchmarks. I reran 50 times and it only failed in the training benchmark. I think I misread the log output... apologies.

Here is my MPI details. MPICH version is 8.1.28. The NVIDIA HPC SDK version used is 23.3:

> mpicc -show
nvc -I/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/include -L/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib -lmpi_nvidia

It failed in 7 of the 50 runs.

In the stderr:

x3006c0s1b1n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 11 and dumped core
x3006c0s1b1n0.hsn.cm.polaris.alcf.anl.gov: rank 0 died from signal 15

In the stdout

2024-08-28 17:04:10,055 h5bench - INFO - Starting h5bench Suite
2024-08-28 17:04:10,055 h5bench - WARNING - Base directory already exists: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio
2024-08-28 17:04:10,066 h5bench - INFO - Lustre support detected
2024-08-28 17:04:10,066 h5bench - DEBUG - LD_LIBRARY_PATH: /soft/perftools/darshan/darshan-3.4.4/lib:/opt/cray/pe/papi/7.0.1.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/nccl/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/extras/qd/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/lib64
2024-08-28 17:04:10,066 h5bench - DEBUG - DYLD_LIBRARY_PATH: 
2024-08-28 17:04:10,067 h5bench - DEBUG - LD_PRELOAD: 
2024-08-28 17:04:10,067 h5bench - INFO - JOBID: 2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
2024-08-28 17:04:10,067 h5bench - INFO - h5bench [dlio] - Starting
2024-08-28 17:04:10,067 h5bench - INFO - h5bench [dlio] - DIR: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/
2024-08-28 17:04:10,067 h5bench - INFO - Parallel setup: mpirun -np 4
2024-08-28 17:04:10,068 h5bench - INFO - mpirun -np 4 /home/nlewis/src/h5bench/install/bin//h5bench_dlio --generate-data  --keep-files  --record-length 1048576  --num-files-train 8  --num-files-eval 2  --num-samples-per-file 4  --data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/data  --file-prefix img  --random-seed 42  --train-data-folder train  --valid-data-folder valid  --records-dataset-name records  --labels-dataset-name labels  --output-csv-name output  --output-ranks-data  --output-data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov 
2024-08-28 17:04:12,747 h5bench - INFO - SUCCESS (all output files are located at /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov)
2024-08-28 17:04:12,748 h5bench - INFO - Runtime: 2.6796727 seconds (elapsed time, includes allocation wait time)
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - Complete
2024-08-28 17:04:12,748 h5bench - INFO - JOBID: 2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - Starting
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - DIR: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/
2024-08-28 17:04:12,749 h5bench - INFO - Parallel setup: mpirun -np 4
2024-08-28 17:04:12,749 h5bench - INFO - mpirun -np 4 /home/nlewis/src/h5bench/install/bin//h5bench_dlio --train  --evaluation  --keep-files  --shuffle  --seed-change-epoch  --record-length 1048576  --num-files-train 8  --num-files-eval 2  --num-samples-per-file 4  --data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/data  --file-prefix img  --batch-size 2  --batch-size-eval 1  --read-threads 1  --preprocess-time 0.0  --preprocess-time-stdev 0.0  --epochs 1  --computation-time 0.123  --computation-time-stdev 0.0  --random-seed 42  --eval-time 0.123  --eval-time-stdev 0.0  --epochs-between-evals 1  --train-data-folder train  --valid-data-folder valid  --records-dataset-name records  --labels-dataset-name labels  --collective-meta  --collective-data  --output-csv-name output  --output-ranks-data  --output-data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov 
2024-08-28 17:04:15,641 h5bench - ERROR - Return: 143 (check /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/stderr for detailed log)
2024-08-28 17:04:15,641 h5bench - INFO - Runtime: 2.8921127 seconds (elapsed time, includes allocation wait time)
2024-08-28 17:04:15,642 h5bench - INFO - h5bench [dlio] - Complete
2024-08-28 17:04:15,642 h5bench - INFO - Finishing h5bench Suite

raymerta · 2024-09-03T12:47:04Z

@TheAssembler1 can you send us the output of the mpichversion ? We don't have Cray machine here to try reproducing the exact problem but we can try testing it with our own MPICH installation.
Also, what's the HDF5 version that you are using?

arcturus5340 · 2024-09-03T12:57:11Z

it only failed in the training benchmark

In this case, we cannot rule out that the problem may be indirectly related to fork(). To make sure that this is not the case, could you run the benchmark with default parameters but read-threads set to 0?

sync-dlio-without-fork.json

TheAssembler1 · 2024-09-03T17:47:52Z

@raymerta

nlewis@x3002c0s13b0n0:~/src/h5bench/samples> mpichversion
MPICH Version:          3.4a2
MPICH Release date:     unreleased development copy
MPICH Device:           ch4:ofi
MPICH configure:        --prefix=/home/jenkins/install-nvidia-ofi --without-mpe --enable-shared --enable-sharedlibs=gcc --enable-debuginfo --enable-yield=sched_yield --enable-mpit-pvars=cray_mpiio_stat,cray_misc_stat --enable-g=mem --with-configfile=/etc/cray-mpich.conf --with-device=ch4:ofi --with-libfabric-include=/usr/include --with-libfabric-lib=/home/jenkins/pebuildenv/opt/libfabric/1.14.0-onlysockets/lib --with-custom-version-string=GTL-built-with,ROCM-5.0,CUDA-11.0 --with-namepublisher=file --with-shared-memory=sysv --with-ch4-shmmods=cray_xpmem --enable-fortran=f77,fc --with-pmiext=pmi_cray_ext.h --with-pmi=cray --with-weak-pmiext=cray --disable-allowport --with-pm=gforker --with-file-system=ufs+lustre+cray+gpfs+nfs --disable-cxx --enable-threads=runtime --enable-thread-cs=global --enable-fast=O2
MPICH CC:       /usr/bin/gcc-12 -fcommon  -I/opt/cray/dvs/2.12_4.3.22-2.3_154.1__g7b1d0e31/include -I/opt/cray/pe/gtl/0.1.1/include -I/opt/cray/pe/pmi/6.1.11/include -D_CRAY_OFI -D_CRAY_CH4 -DHAVE_LUSTRE_COMP_LAYOUT_SUPPORT   -O2
MPICH CXX:      CC
MPICH F77:      ftn
MPICH FC:       ftn
MPICH Custom Information:       GTL-built-with,ROCM-5.0,CUDA-11.0

@arcturus5340 reran 50 times with read threads set to zero and had zero failures.

arcturus5340 · 2024-12-05T14:47:07Z

@TheAssembler1 We tested the extension with MPICH v4.2.2 and were unable to reproduce the error. Could you please test the extension by compiling it without using the nvc compiler? We suspect the issue might be related to NVIDIA tools.

TheAssembler1 · 2024-12-23T17:35:01Z

@jeanbez @arcturus5340 not sure if I can help with this anymore I no longer have access to the machine (Polaris) I was using.

arcturus5340 · 2025-01-16T16:12:36Z

@jeanbez Unfortunately we have no way to reproduce the error. Is it possible to test the extension on Polaris with a different stack, as we suspect the problem might be related to nvcc or MPICH? This would help us a lot in finding the real source of the error.

arcturus5340 and others added 26 commits June 14, 2024 18:55

Add h5bench_dlio extension

1e02fc8

Add multithreading support

89c62c2

Committing clang-format changes

8bfec7c

Add Subfiling VFD support

5b702e1

Committing clang-format changes

b2d829b

Update the method of performance calculation

4cf969c

Add process logging and improve the display of benchmark results

40f127a

Update the documentation

c42040a

Committing clang-format changes

e3c23f7

Correct where the csv file is created

55a8b55

Merge remote-tracking branch 'origin/master'

c2ef677

# Conflicts: # dlio/utils.c

Committing clang-format changes

9d80ffc

Attempt number 1 to fix CI/CD

923af94

Fix CI/CD (Attempt number 2)

78f4e9a

Fix CI/CD (Attempt number 3)

6394c0c

Fix CI/CD (Attempt number 4)

6c9124b

Delete the Async VOL references in the code

3b2675b

Fix CI/CD (Attempt number 5)

15e697b

Fix CI/CD (Attempt number 6)

11c7930

Fix CI/CD (Attempt number 7)

3a2e7d7

Fix CI/CD (Attempt number 8)

8fc0dd2

Add the readthedocs documentation page

cb3994a

Update the way of counting the number of batches per rank

92c3c57

Add the ability to output information about all ranks

96facfc

Committing clang-format changes

61f9a80

Update the documentation

6d48da1

sbyna requested a review from jeanbez July 18, 2024 18:51

arcturus5340 and others added 2 commits August 6, 2024 19:30

Add total-training-steps parameter

63af2cd

Committing clang-format changes

ac6fd8f

jeanbez added enhancement New feature or request tests Test related issues or improvements new benchmark New benchmark or kernel labels Aug 6, 2024

jeanbez added this to the v.1.6 milestone Aug 6, 2024

Fix multiple issues identified in code review

c061bcf

arcturus5340 and others added 6 commits August 7, 2024 21:14

Update the sample configuration file to reduce the runtime of the git…

c9fbed9

…hub workflow job

Update the sample configuration by adding a section with data generation

0b1b0cc

Update h5bench-hdf5-1.10.8.yml

2cf3cd3

Update h5bench-hdf5-1.10.4.yml

c4b9534

Update h5bench-hdf5-1.10.7.yml

4e7c11a

Update h5bench-hdf5-1.10.8.yml

9fd3ae5

arcturus5340 added 2 commits August 17, 2024 13:17

Update the readthedocs page

56c6d0f

Update the README file

33281ad

jeanbez self-assigned this Aug 26, 2024

jeanbez requested a review from sbyna August 26, 2024 16:04

sbyna reviewed Aug 26, 2024

View reviewed changes

dlio/README.md Outdated Show resolved Hide resolved

Update the documentation

a69307c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include DLIO benchmark #124

Include DLIO benchmark #124

arcturus5340 commented Jul 10, 2024

sbyna commented Jul 18, 2024

arcturus5340 commented Aug 7, 2024

jeanbez commented Aug 13, 2024

arcturus5340 commented Aug 24, 2024

jeanbez commented Aug 26, 2024

TheAssembler1 commented Aug 26, 2024 •

edited

Loading

arcturus5340 commented Aug 28, 2024

TheAssembler1 commented Aug 28, 2024 •

edited

Loading

raymerta commented Sep 3, 2024

arcturus5340 commented Sep 3, 2024

TheAssembler1 commented Sep 3, 2024

arcturus5340 commented Dec 5, 2024

TheAssembler1 commented Dec 23, 2024

arcturus5340 commented Jan 16, 2025

Include DLIO benchmark #124

Are you sure you want to change the base?

Include DLIO benchmark #124

Conversation

arcturus5340 commented Jul 10, 2024

sbyna commented Jul 18, 2024

arcturus5340 commented Aug 7, 2024

jeanbez commented Aug 13, 2024

arcturus5340 commented Aug 24, 2024

jeanbez commented Aug 26, 2024

TheAssembler1 commented Aug 26, 2024 • edited Loading

arcturus5340 commented Aug 28, 2024

TheAssembler1 commented Aug 28, 2024 • edited Loading

raymerta commented Sep 3, 2024

arcturus5340 commented Sep 3, 2024

TheAssembler1 commented Sep 3, 2024

arcturus5340 commented Dec 5, 2024

TheAssembler1 commented Dec 23, 2024

arcturus5340 commented Jan 16, 2025

TheAssembler1 commented Aug 26, 2024 •

edited

Loading

TheAssembler1 commented Aug 28, 2024 •

edited

Loading