Helmholtz-AI-Energy · janEbert · Oct 8, 2021 · Oct 8, 2021 · Oct 8, 2021 · Oct 8, 2021
diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/README.md b/HelmholtzAI/benchmarks/implementations/cosmoflow/README.md
@@ -0,0 +1,27 @@
+# How To Run
+
+This repo holds the code and configs for running on the JUWELS Booster as well as on HoreKa at KIT.
+To run a training, simply use the `start_training_run.sh` script from within the `run_scripts` directory.
+This will trigger the other scripts in this order: `start_**_training.sh` and `run_and_time.sh`, where `**`
+is the training system (`jb` for JUWELS Booster and `horeka` for HoreKa).
+
+Example usage:
+```bash
+./start_training_run --system booster --nodes 34 --time 01:00:00 --config "config_file_path"
+```
+On JUWELS Booster, a job can be started by calling e.g. `run_cosmoflow_256x4x1.sbatch`. This will queue a
+job with 256 and 4 GPUs per node, and a local batch size of 1.
+
+# Repo Structure
+
+This implementation is based on the implementation of NVIDIA, based on their containers. In our implementation,
+we use Apptainer with an image that is almost identical to NVIDIA's image, except for installations of the packages
+h5py and pmi, which we have added to the containers, and then converted to `.sif` files for Apptainer. We have copied
+NVIDIA's Python code for the benchmark out of the container and adjusted it to our needs. This modified code is found
+in the directory `cosmoflow`.
+
+In the directory, `hdf5_io` our conversion method to HDF5 is implemented. See below.
+
+# HDF5 IO
+We use HDF5 to store the input data. The conversion is implemented in `convert_cosmoflow_to_hdf5.py`. In summary, this method creates
+large `.h5` files for training and validation.
diff --git a/...-closed/cosmoflow/cosmoflow/.dockerignore → ...tations/cosmoflow/cosmoflow/.dockerignore b/...-closed/cosmoflow/cosmoflow/.dockerignore → ...tations/cosmoflow/cosmoflow/.dockerignore
diff --git a/...rks-closed/cosmoflow/cosmoflow/Dockerfile → ...mentations/cosmoflow/cosmoflow/Dockerfile b/...rks-closed/cosmoflow/cosmoflow/Dockerfile → ...mentations/cosmoflow/cosmoflow/Dockerfile
diff --git a/...hmarks-closed/cosmoflow/cosmoflow/bind.sh → ...plementations/cosmoflow/cosmoflow/bind.sh b/...hmarks-closed/cosmoflow/cosmoflow/bind.sh → ...plementations/cosmoflow/cosmoflow/bind.sh
diff --git a/...smoflow/configs/config_DGXA100_128x8x1.sh → ...smoflow/configs/config_DGXA100_128x8x1.sh b/...smoflow/configs/config_DGXA100_128x8x1.sh → ...smoflow/configs/config_DGXA100_128x8x1.sh
diff --git a/...smoflow/configs/config_DGXA100_256x8x1.sh → ...smoflow/configs/config_DGXA100_256x8x1.sh b/...smoflow/configs/config_DGXA100_256x8x1.sh → ...smoflow/configs/config_DGXA100_256x8x1.sh
diff --git a/...nfigs/config_DGXA100_32x8x2_handrolled.sh → ...nfigs/config_DGXA100_32x8x2_handrolled.sh b/...nfigs/config_DGXA100_32x8x2_handrolled.sh → ...nfigs/config_DGXA100_32x8x2_handrolled.sh
diff --git a/...configs/config_DGXA100_4x8x2_reference.sh → ...configs/config_DGXA100_4x8x2_reference.sh b/...configs/config_DGXA100_4x8x2_reference.sh → ...configs/config_DGXA100_4x8x2_reference.sh
diff --git a/...onfigs/config_DGXA100_64x8x1_reference.sh → ...onfigs/config_DGXA100_64x8x1_reference.sh b/...onfigs/config_DGXA100_64x8x1_reference.sh → ...onfigs/config_DGXA100_64x8x1_reference.sh
diff --git a/...osmoflow/configs/config_DGXA100_common.sh → ...osmoflow/configs/config_DGXA100_common.sh b/...osmoflow/configs/config_DGXA100_common.sh → ...osmoflow/configs/config_DGXA100_common.sh
diff --git a/...hmarks-closed/cosmoflow/cosmoflow/data.py → ...plementations/cosmoflow/cosmoflow/data.py b/...hmarks-closed/cosmoflow/cosmoflow/data.py → ...plementations/cosmoflow/cosmoflow/data.py
diff --git a/...rks-closed/cosmoflow/cosmoflow/h5_data.py → ...mentations/cosmoflow/cosmoflow/h5_data.py b/...rks-closed/cosmoflow/cosmoflow/h5_data.py → ...mentations/cosmoflow/cosmoflow/h5_data.py
@@ -47,6 +47,13 @@ def stage_files(
         read_chunk_size: int,
 ) -> Tuple[List[str], List[str], Callable]:
     number_of_nodes = dist_desc.size // dist_desc.local_size // shard_mult
+    if number_of_nodes == 0:
+        print(
+            'WARNING: Number of nodes is too low for the data split. '
+            'Either discard this result or use it only for debugging.'
+        )
+        number_of_nodes = 1
+
     current_node = dist_desc.rank // dist_desc.local_size // shard_mult
     files_per_node = len(data_filenames) // number_of_nodes
     assert (
@@ -151,6 +158,13 @@ def __init__(
         if shard_type == 'local':
             number_of_nodes = \
                 dist_desc.size // dist_desc.local_size // shard_mult
+            if number_of_nodes == 0:
+                print(
+                    'WARNING: Number of nodes is too low for the data split. '
+                    'Either discard this result or use it only for debugging.'
+                )
+                number_of_nodes = 1
+
             current_node = dist_desc.rank // dist_desc.local_size // shard_mult
 
             if self.preshuffle:

diff --git a/...marks-closed/cosmoflow/cosmoflow/model.py → ...lementations/cosmoflow/cosmoflow/model.py b/...marks-closed/cosmoflow/cosmoflow/model.py → ...lementations/cosmoflow/cosmoflow/model.py
diff --git a/...osed/cosmoflow/cosmoflow/requirements.txt → ...ions/cosmoflow/cosmoflow/requirements.txt b/...osed/cosmoflow/cosmoflow/requirements.txt → ...ions/cosmoflow/cosmoflow/requirements.txt
diff --git a/...hmarks-closed/cosmoflow/cosmoflow/run.sub → ...plementations/cosmoflow/cosmoflow/run.sub b/...hmarks-closed/cosmoflow/cosmoflow/run.sub → ...plementations/cosmoflow/cosmoflow/run.sub
diff --git a/...losed/cosmoflow/cosmoflow/run_and_time.sh → ...tions/cosmoflow/cosmoflow/run_and_time.sh b/...losed/cosmoflow/cosmoflow/run_and_time.sh → ...tions/cosmoflow/cosmoflow/run_and_time.sh
diff --git a/...moflow/tools/convert_tfrecord_to_numpy.py → ...moflow/tools/convert_tfrecord_to_numpy.py b/...moflow/tools/convert_tfrecord_to_numpy.py → ...moflow/tools/convert_tfrecord_to_numpy.py
diff --git a/...marks-closed/cosmoflow/cosmoflow/train.py → ...lementations/cosmoflow/cosmoflow/train.py b/...marks-closed/cosmoflow/cosmoflow/train.py → ...lementations/cosmoflow/cosmoflow/train.py
diff --git a/...marks-closed/cosmoflow/cosmoflow/utils.py → ...lementations/cosmoflow/cosmoflow/utils.py b/...marks-closed/cosmoflow/cosmoflow/utils.py → ...lementations/cosmoflow/cosmoflow/utils.py
diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/hdf5_io/convert_cosmoflow_to_hdf5.py b/HelmholtzAI/benchmarks/implementations/cosmoflow/hdf5_io/convert_cosmoflow_to_hdf5.py
@@ -0,0 +1,179 @@
+import io
+import os
+import time
+
+import h5py
+from mpi4py import MPI
+import numpy as np
+import tarfile
+
+
+def load_numpy(tar_file, tar_member):
+    with tar_file.extractfile(tar_member) as f:
+        return np.load(io.BytesIO(f.read()))
+
+
+# We do _not_ want to preshuffle! CosmoFlow is very sensitive to data
+# order/distribution.
+preshuffle = False
+seed = 42
+
+project_name = 'hai_mlperf'
+# Try to use CSCRATCH, fall back to SCRATCH.
+using_cscratch = ('CSCRATCH_' + project_name) in os.environ
+project_dir = os.getenv(
+    'CSCRATCH_' + project_name,
+    os.getenv('SCRATCH_' + project_name),
+)
+
+# Read file
+tar_file_name = os.path.join(
+    project_dir, "cosmoUniverse_2019_05_4parE_tf_v2_numpy.tar")
+out_dir = os.path.join(project_dir, "cosmoflow")
+if MPI.COMM_WORLD.rank == 0:
+    os.makedirs(out_dir, exist_ok=True)
+
+np.random.seed(seed)
+
+print("this is the h5py file we are using:", h5py.__file__)
+
+for data_subset in ['train', 'validation']:
+    # Write files
+    hdf5_file_name = os.path.join(out_dir, f"{data_subset}.h5")
+    files_file_name = os.path.join(out_dir, f"{data_subset}.h5.files")
+
+    if using_cscratch and MPI.COMM_WORLD.rank == 0:
+        # Initialize IME cache.
+        os.system('ime-ctl --prestage ' + tar_file_name)
+        # os.system('ime-ctl --prestage ' + hdf5_file_name)
+        # os.system('ime-ctl --prestage ' + files_file_name)
+    # Make sure IME cache is initialized before continuing.
+    MPI.COMM_WORLD.Barrier()
+
+    with tarfile.open(tar_file_name, 'r') as tar_f:
+        start_time = time.perf_counter()
+        files = [
+            n
+            for n in tar_f.getmembers()
+            if n.name.startswith(
+                    f'cosmoUniverse_2019_05_4parE_tf_v2_numpy/{data_subset}')
+        ]
+        print(f'{MPI.COMM_WORLD.rank}: reading members took',
+              time.perf_counter() - start_time, 'seconds')
+
+        data_files = list(filter(lambda x: x.name.endswith("data.npy"), files))
+        data_files.sort(key=lambda x: x.name)
+        data_files = np.array(data_files)
+        label_files = list(filter(
+            lambda x: x.name.endswith("label.npy"), files))
+        label_files.sort(key=lambda x: x.name)
+        label_files = np.array(label_files)
+        if preshuffle:
+            perm = np.random.permutation(len(label_files))
+            data_files = data_files[perm]
+            label_files = label_files[perm]
+
+        no_shards = MPI.COMM_WORLD.size
+        data_files_filtered = data_files[:]
+        label_files_filtered = label_files[:]
+        data_files_shards = []
+        label_files_shards = []
+        for i in range(no_shards):
+            shard_size = int(np.ceil(len(data_files_filtered)/no_shards))
+            start = i * shard_size
+            end = min((i + 1) * shard_size, len(data_files_filtered))
+            data_files_shards.append(data_files_filtered[start:end])
+            label_files_shards.append(label_files_filtered[start:end])
+
+        start_entries = np.cumsum([len(x) for x in data_files_shards])
+        start_entries = ([0] + list(start_entries))[:-1]
+
+        first_data = load_numpy(tar_f, data_files[0])
+        data_shape = first_data.shape
+        data_dtype = first_data.dtype
+        first_label = load_numpy(tar_f, label_files[0])
+        label_shape = first_label.shape
+        label_dtype = first_label.dtype
+        all_data_shape = (len(data_files_filtered),) + data_shape
+        all_label_shape = (len(data_files_filtered),) + label_shape
+
+        def write_to_h5_file(
+                data_files,
+                label_files,
+                tfname,
+                start_entry,
+                all_data_shape,
+                all_label_shape,
+                tell_progress=10,
+        ):
+            if MPI.COMM_WORLD.rank == 0 and os.path.isfile(tfname):
+                print("file exists")
+                exit(1)
+                os.remove(tfname)
+            MPI.COMM_WORLD.Barrier()
+            print("creating file")
+            with h5py.File(
+                    tfname,
+                    'w',
+                    driver='mpio',
+                    comm=MPI.COMM_WORLD,
+                    libver='latest',
+            ) as fi:
+                print("creating dset")
+                dset = fi.create_dataset(
+                    'data', all_data_shape, dtype=data_dtype)
+                print("creating lset")
+                lset = fi.create_dataset(
+                    'label', all_label_shape, dtype=label_dtype)
+
+                start = time.time()
+                for i, (f, l) in enumerate(zip(data_files, label_files)):
+                    data = load_numpy(tar_f, f)
+                    label = load_numpy(tar_f, l)
+                    dset[start_entry + i] = data
+                    lset[start_entry + i] = label
+                    now = time.time()
+                    time_remaining = len(data_files) * (now-start) / (i + 1)
+                    if i % tell_progress == 0:
+                        print(i, time_remaining/60, f.name, l.name)
+                fi.close()
+
+        my_data_files = [f for f in data_files_shards[MPI.COMM_WORLD.rank]]
+        my_label_files = [f for f in label_files_shards[MPI.COMM_WORLD.rank]]
+        start_entry = start_entries[MPI.COMM_WORLD.rank]
+
+        all_data_files = np.concatenate(MPI.COMM_WORLD.allgather(
+            [m.name for m in my_data_files]))
+
+        print(
+            len(np.unique(all_data_files)),
+            "unique files, and",
+            len(all_data_files),
+            "total files.",
+        )
+        if len(np.unique(all_data_files)) != len(all_data_files):
+            print("There is an error with the file distribution")
+
+        if MPI.COMM_WORLD.rank == 0:
+            with open(files_file_name, "w") as g:
+                g.write("\n".join(all_data_files) + '\n')
+
+        write_start_time = time.perf_counter()
+        write_to_h5_file(
+            [f for f in my_data_files],
+            [f for f in my_label_files],
+            hdf5_file_name,
+            start_entry,
+            all_data_shape,
+            all_label_shape,
+        )
+
+        print('finished', data_subset, 'on', MPI.COMM_WORLD.rank, 'after',
+              time.perf_counter() - write_start_time, 'seconds')
+
+        MPI.COMM_WORLD.Barrier()
+        if using_cscratch and MPI.COMM_WORLD.rank == 0:
+            # Synchronize file system with IME cache.
+            os.system('ime-ctl --sync ' + hdf5_file_name)
+            os.system('ime-ctl --sync ' + files_file_name)
+        MPI.COMM_WORLD.Barrier()
diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/hdf5_io/convert_cosmoflow_to_hdf5.sbatch b/HelmholtzAI/benchmarks/implementations/cosmoflow/hdf5_io/convert_cosmoflow_to_hdf5.sbatch
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+
+#SBATCH --nodes=4
+#SBATCH --tasks-per-node=8
+#SBATCH --gres=gpu:0
+#SBATCH --time=23:59:59
+#SBATCH --partition=booster
+#SBATCH --account=hai_mlperf
+
+ml purge
+ml Stages/2022 GCC/11.2.0 OpenMPI/4.1.2 Python/3.9.6 HDF5/1.12.1 mpi4py/3.1.3 h5py/3.5.0
+
+srun python convert_cosmoflow_to_hdf5.py
diff --git a/...flow/io_optimization/create_file_lists.py → ...ns/cosmoflow/hdf5_io/create_file_lists.py b/...flow/io_optimization/create_file_lists.py → ...ns/cosmoflow/hdf5_io/create_file_lists.py
diff --git a/benchmarks-closed/cosmoflow/run_and_time.sh → ...ons/cosmoflow/run_scripts/run_and_time.sh b/benchmarks-closed/cosmoflow/run_and_time.sh → ...ons/cosmoflow/run_scripts/run_and_time.sh
@@ -31,6 +31,9 @@ DATA_SHARD_MULTIPLIER=${DATA_SHARD_MULTIPLIER:-1}
 
 APPLY_LOG_TRANSFORM=${APPLY_LOG_TRANSFORM:-"1"}
 APPLY_SHUFFLE=${APPLY_SHUFFLE:-"1"}
+# If `USE_H5=1`, setting this to 1 has a large performance impact
+# (either only on the staging part or on the whole run depending on
+# `APPLY_PRESTAGE`).
 APPLY_PRESHUFFLE=${APPLY_PRESHUFFLE:-"1"}
 APPLY_PRESTAGE=${APPLY_PRESTAGE:-"1"}
 NUM_TRAINING_SAMPLES=${NUM_TRAINING_SAMPLES:-"-1"}
@@ -40,15 +43,10 @@ INSTANCES=${INSTANCES:-1}
 USE_H5=${USE_H5:-"1"}
 READ_CHUNK_SIZE=${READ_CHUNK_SIZE:-"32"}
 
-# Our HDF5 data is already pre-shuffled. If `USE_H5=1`, setting this
-# to 1 has a large performance impact (either only on the staging part
-# or on the whole run depending on `APPLY_PRESTAGE`).
-APPLY_PRESHUFFLE=$(if [ "$USE_H5" -ge 1 ]; then echo 0; else echo 1; fi)
-
 # Only apply prestaging when we have enough nodes to be able to
 # support the memory requirements.
 export APPLY_PRESTAGE=$(
-    if [ "$(($SLURM_NNODES / $INSTANCES))" -ge 64 ]; then
+    if [ "$(($SLURM_NNODES / $INSTANCES))" -gt 64 ]; then
         echo 1
     else
         echo 0

diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_128x4x1.sbatch b/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_128x4x1.sbatch
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+NODES=128
+RESULTS_ROOT="/p/scratch/hai_mlperf/cosmoflow/results/$NODES"
+mkdir -p "$RESULTS_ROOT"
+export SEED="$(date +%s)"
+
+./start_training_run.sh --system booster --nodes "$NODES" \
+    --time 01:00:00 \
+    --config ../cosmoflow/configs/config_DGXA100_64x8x1_reference.sh
diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_256x4x1.sbatch b/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_256x4x1.sbatch
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+NODES=256
+RESULTS_ROOT="/p/scratch/hai_mlperf/cosmoflow/results/$NODES"
+mkdir -p "$RESULTS_ROOT"
+export SEED="$(date +%s)"
+
+./start_training_run.sh --system booster --nodes "$NODES" \
+    --time 01:00:00 \
+    --config ../cosmoflow/configs/config_DGXA100_128x8x1.sh
diff --git a/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_64x4x2.sbatch b/HelmholtzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_64x4x2.sbatch
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+NODES=64
+RESULTS_ROOT="/p/scratch/hai_mlperf/cosmoflow/results/$NODES"
+mkdir -p "$RESULTS_ROOT"
+export SEED="$(date +%s)"
+
+./start_training_run.sh --system booster --nodes "$NODES" \
+    --time 01:00:00 \
+    --config ../cosmoflow/configs/config_DGXA100_32x8x2_handrolled.sh
diff --git a/...tzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_weak_10x64x4x2.sbatch b/...tzAI/benchmarks/implementations/cosmoflow/run_scripts/run_cosmoflow_weak_10x64x4x2.sbatch
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+export INSTANCES=10
+NODES_PER_INSTANCE=64
+RESULTS_ROOT="/p/scratch/hai_mlperf/cosmoflow/results_weak/$INSTANCES/$NODES"
+mkdir -p "$RESULTS_ROOT"
+export SEED="$(date +%s)"
+
+./start_training_run.sh --system booster \
+    --nodes "$((INSTANCES * NODES_PER_INSTANCE))" \
+    --time 02:00:00 \
+    --config ../cosmoflow/configs/config_DGXA100_32x8x2_handrolled.sh
diff --git a/...moflow/run_cosmoflow_weak_2x64x4x2.sbatch → ...cripts/run_cosmoflow_weak_2x64x4x2.sbatch b/...moflow/run_cosmoflow_weak_2x64x4x2.sbatch → ...cripts/run_cosmoflow_weak_2x64x4x2.sbatch
@@ -2,10 +2,11 @@
 
 export INSTANCES=2
 NODES_PER_INSTANCE=64
-RESULTS_ROOT="/p/scratch/jb_benchmark/cosmoflow/results_weak/$INSTANCES/$NODES"
+RESULTS_ROOT="/p/scratch/hai_mlperf/cosmoflow/results_weak/$INSTANCES/$NODES"
 mkdir -p "$RESULTS_ROOT"
+export SEED="$(date +%s)"
 
 ./start_training_run.sh --system booster \
     --nodes "$((INSTANCES * NODES_PER_INSTANCE))" \
     --time 02:00:00 \
-    --config cosmoflow/configs/config_DGXA100_32x8x2_handrolled.sh
+    --config ../cosmoflow/configs/config_DGXA100_32x8x2_handrolled.sh