Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include DLIO benchmark #124

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open

Conversation

arcturus5340
Copy link

Include I/O access patterns in common AI applications.

Documentation can be found in the directory with the extension (/dlio/README.md) or in the corresponding readthedocs section.

@sbyna sbyna requested a review from jeanbez July 18, 2024 18:51
@sbyna
Copy link
Member

sbyna commented Jul 18, 2024

Since this is using code from the DLIO benchmark, make sure to follow the original DLIO's license.
https://github.com/argonne-lcf/dlio_benchmark/blob/main/LICENSE

@jeanbez jeanbez added documentation Improvements or additions to documentation enhancement New feature or request tests Test related issues or improvements new benchmark New benchmark or kernel labels Aug 6, 2024
@jeanbez jeanbez added this to the v.1.6 milestone Aug 6, 2024
@arcturus5340
Copy link
Author

With the provided sample, in Perlmutter I am getting some warnings
Are those expected?

Yes, warnings are shown because the test involves working with a small amount of data. This is done because the same json configuration file is used in github workflow during testing. If a large number of generation and training files are specified in the configuration file, the warning will disappear, but jobs in github workflow may start taking a long time, which I believe is undesirable.

@jeanbez
Copy link
Member

jeanbez commented Aug 13, 2024

@arcturus5340, could you also please include an explanation in the DLIO pattern page of why this is needed (i.e., why we added a pattern that mimics DLIO instead of running DLIO directly)? If I recall, we have it in the paper, so maybe let's also add it here to make it clear that this is not the full DLIO but rather only its I/O pattern. Could we also add some sentences in the documentation about how this pattern should be updated if changes in DLIO are made, kind of an overview summary of how one might update this pattern if changes are done on the DLIO side?

@arcturus5340
Copy link
Author

@jeanbez, I have made changes to the documentation, are they sufficient? Is there anything else I need to do?

@jeanbez jeanbez self-assigned this Aug 26, 2024
@jeanbez jeanbez requested a review from sbyna August 26, 2024 16:04
@jeanbez
Copy link
Member

jeanbez commented Aug 26, 2024

@jeanbez, I have made changes to the documentation, are they sufficient? Is there anything else I need to do?

We're doing some testing in the other systems as well, as soon as those are okay we should be able to merge. Thanks for updating the documentation with that additional information.

dlio/README.md Outdated Show resolved Hide resolved
@TheAssembler1
Copy link

TheAssembler1 commented Aug 26, 2024

Hi, I'm getting segmentation faults when testing on two nodes with the default configuration. I have experienced no errors when running on a single node. The gdb backtrace says that this line is the culprit:

MPI_Barrier(MPI_COMM_WORLD);

Here is the full trace:

#0  0x000014c726638448 in MPIDI_SHMI_progress ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
[Current thread is 1 (Thread 0x14c723c048c0 (LWP 550102))]
(gdb) 
(gdb) bt
#0  0x000014c726638448 in MPIDI_SHMI_progress ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#1  0x000014c7253c0fe9 in MPIR_Wait_impl ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#2  0x000014c725f62a26 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#3  0x000014c725f6e785 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#4  0x000014c725e9da6f in MPIR_Barrier_intra_dissemination ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#5  0x000014c7248f97a0 in MPIR_Barrier_intra_auto ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#6  0x000014c7248f9955 in MPIR_Barrier_impl ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#7  0x000014c7260e5ffd in MPIR_CRAY_Barrier ()
   from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#8  0x000014c7248f9ebe in PMPI_Barrier () from /opt/cray/pe/lib64/libmpi_nvidia.so.12
#9  0x000014c726bed8cb in MPI_Barrier (comm=1140850688) at darshan-apmpi.c:1389
#10 0x0000000000403b43 in train_using_workers (epoch=0, 
    local_metadata_time_out=0x7fff1a971570, local_read_time_out=0x7fff1a971568)
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:424
#11 0x0000000000403cc5 in train (epoch=0, indices=0x17f83b0, 
    enable_multiprocessing=1 '\001')
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:457
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x0000000000403eeb in run () at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:506
#13 0x0000000000404a02 in main (argc=57, argv=0x7fff1a971808)
    at /home/nlewis/src/h5bench/dlio/h5bench_dlio.c:725

It does not fail every single time. In addition, there are two benchmarks in the default sync-dlio.json and it is seemingly random whether the first or second will fail. Maybe the arithmetic involved with splitting the MPI communicator is buggy or the fork is causing issues?

@arcturus5340
Copy link
Author

Maybe the arithmetic involved with splitting the MPI communicator is buggy...

It is only used when the total-training-steps parameter is set (which is not set in sync-dlio.json), so this should not be the cause of the error.

...or the fork is causing issues?

This also seems unbelievable, since you mentioned that the bug occurs in both benchmarks. The first one is only responsible for generating data and does not use fork(). Am I right in understanding that the same error occurs when using both benchmarks? What does the trace stack look like when the first benchmark crashes? How often does the benchmark crash? I have run the benchmark in the configuration you mentioned 6 times, but never got an error, maybe it's the MPI libraries we use. In my case it is Intel MPI library v2021.6.0. Could you please specify which MPI library you are using? Could you also provide the output of the generated stdout files for both benchmarks?

@TheAssembler1
Copy link

TheAssembler1 commented Aug 28, 2024

@arcturus5340 Disregard my statement about it failing in both benchmarks. I reran 50 times and it only failed in the training benchmark. I think I misread the log output... apologies.

Here is my MPI details. MPICH version is 8.1.28. The NVIDIA HPC SDK version used is 23.3:

> mpicc -show
nvc -I/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/include -L/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib -lmpi_nvidia

It failed in 7 of the 50 runs.

In the stderr:

x3006c0s1b1n0.hsn.cm.polaris.alcf.anl.gov: rank 2 died from signal 11 and dumped core
x3006c0s1b1n0.hsn.cm.polaris.alcf.anl.gov: rank 0 died from signal 15

In the stdout

2024-08-28 17:04:10,055 h5bench - INFO - Starting h5bench Suite
2024-08-28 17:04:10,055 h5bench - WARNING - Base directory already exists: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio
2024-08-28 17:04:10,066 h5bench - INFO - Lustre support detected
2024-08-28 17:04:10,066 h5bench - DEBUG - LD_LIBRARY_PATH: /soft/perftools/darshan/darshan-3.4.4/lib:/opt/cray/pe/papi/7.0.1.2/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/nvshmem/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/nccl/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/extras/qd/lib:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/extras/CUPTI/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/cuda/lib64
2024-08-28 17:04:10,066 h5bench - DEBUG - DYLD_LIBRARY_PATH: 
2024-08-28 17:04:10,067 h5bench - DEBUG - LD_PRELOAD: 
2024-08-28 17:04:10,067 h5bench - INFO - JOBID: 2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
2024-08-28 17:04:10,067 h5bench - INFO - h5bench [dlio] - Starting
2024-08-28 17:04:10,067 h5bench - INFO - h5bench [dlio] - DIR: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/
2024-08-28 17:04:10,067 h5bench - INFO - Parallel setup: mpirun -np 4
2024-08-28 17:04:10,068 h5bench - INFO - mpirun -np 4 /home/nlewis/src/h5bench/install/bin//h5bench_dlio --generate-data  --keep-files  --record-length 1048576  --num-files-train 8  --num-files-eval 2  --num-samples-per-file 4  --data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/data  --file-prefix img  --random-seed 42  --train-data-folder train  --valid-data-folder valid  --records-dataset-name records  --labels-dataset-name labels  --output-csv-name output  --output-ranks-data  --output-data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov 
2024-08-28 17:04:12,747 h5bench - INFO - SUCCESS (all output files are located at /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/1971d6fa-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov)
2024-08-28 17:04:12,748 h5bench - INFO - Runtime: 2.6796727 seconds (elapsed time, includes allocation wait time)
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - Complete
2024-08-28 17:04:12,748 h5bench - INFO - JOBID: 2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - Starting
2024-08-28 17:04:12,748 h5bench - INFO - h5bench [dlio] - DIR: /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/
2024-08-28 17:04:12,749 h5bench - INFO - Parallel setup: mpirun -np 4
2024-08-28 17:04:12,749 h5bench - INFO - mpirun -np 4 /home/nlewis/src/h5bench/install/bin//h5bench_dlio --train  --evaluation  --keep-files  --shuffle  --seed-change-epoch  --record-length 1048576  --num-files-train 8  --num-files-eval 2  --num-samples-per-file 4  --data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/data  --file-prefix img  --batch-size 2  --batch-size-eval 1  --read-threads 1  --preprocess-time 0.0  --preprocess-time-stdev 0.0  --epochs 1  --computation-time 0.123  --computation-time-stdev 0.0  --random-seed 42  --eval-time 0.123  --eval-time-stdev 0.0  --epochs-between-evals 1  --train-data-folder train  --valid-data-folder valid  --records-dataset-name records  --labels-dataset-name labels  --collective-meta  --collective-data  --output-csv-name output  --output-ranks-data  --output-data-folder /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov 
2024-08-28 17:04:15,641 h5bench - ERROR - Return: 143 (check /lus/eagle/projects/DLIO/nlewis/h5bench_results/dlio/380c81b6-2077085.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov/stderr for detailed log)
2024-08-28 17:04:15,641 h5bench - INFO - Runtime: 2.8921127 seconds (elapsed time, includes allocation wait time)
2024-08-28 17:04:15,642 h5bench - INFO - h5bench [dlio] - Complete
2024-08-28 17:04:15,642 h5bench - INFO - Finishing h5bench Suite

@raymerta
Copy link

raymerta commented Sep 3, 2024

@TheAssembler1 can you send us the output of the mpichversion ? We don't have Cray machine here to try reproducing the exact problem but we can try testing it with our own MPICH installation.
Also, what's the HDF5 version that you are using?

@arcturus5340
Copy link
Author

it only failed in the training benchmark

In this case, we cannot rule out that the problem may be indirectly related to fork(). To make sure that this is not the case, could you run the benchmark with default parameters but read-threads set to 0?

sync-dlio-without-fork.json

@TheAssembler1
Copy link

@raymerta

nlewis@x3002c0s13b0n0:~/src/h5bench/samples> mpichversion
MPICH Version:          3.4a2
MPICH Release date:     unreleased development copy
MPICH Device:           ch4:ofi
MPICH configure:        --prefix=/home/jenkins/install-nvidia-ofi --without-mpe --enable-shared --enable-sharedlibs=gcc --enable-debuginfo --enable-yield=sched_yield --enable-mpit-pvars=cray_mpiio_stat,cray_misc_stat --enable-g=mem --with-configfile=/etc/cray-mpich.conf --with-device=ch4:ofi --with-libfabric-include=/usr/include --with-libfabric-lib=/home/jenkins/pebuildenv/opt/libfabric/1.14.0-onlysockets/lib --with-custom-version-string=GTL-built-with,ROCM-5.0,CUDA-11.0 --with-namepublisher=file --with-shared-memory=sysv --with-ch4-shmmods=cray_xpmem --enable-fortran=f77,fc --with-pmiext=pmi_cray_ext.h --with-pmi=cray --with-weak-pmiext=cray --disable-allowport --with-pm=gforker --with-file-system=ufs+lustre+cray+gpfs+nfs --disable-cxx --enable-threads=runtime --enable-thread-cs=global --enable-fast=O2
MPICH CC:       /usr/bin/gcc-12 -fcommon  -I/opt/cray/dvs/2.12_4.3.22-2.3_154.1__g7b1d0e31/include -I/opt/cray/pe/gtl/0.1.1/include -I/opt/cray/pe/pmi/6.1.11/include -D_CRAY_OFI -D_CRAY_CH4 -DHAVE_LUSTRE_COMP_LAYOUT_SUPPORT   -O2
MPICH CXX:      CC
MPICH F77:      ftn
MPICH FC:       ftn
MPICH Custom Information:       GTL-built-with,ROCM-5.0,CUDA-11.0

@arcturus5340 reran 50 times with read threads set to zero and had zero failures.

@arcturus5340
Copy link
Author

@TheAssembler1 We tested the extension with MPICH v4.2.2 and were unable to reproduce the error. Could you please test the extension by compiling it without using the nvc compiler? We suspect the issue might be related to NVIDIA tools.

@TheAssembler1
Copy link

@jeanbez @arcturus5340 not sure if I can help with this anymore I no longer have access to the machine (Polaris) I was using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request new benchmark New benchmark or kernel tests Test related issues or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants