Skip to content

Commit

Permalink
Merge pull request #70 from mlcommons/v1.0-branch
Browse files Browse the repository at this point in the history
* Merge v1.0 benchmark changes (#68)

* Adding v1.0 changes

Signed-off-by: Johnu George <johnu.george@nutanix.com>

* Add report logging (#69)

Signed-off-by: Johnu George <johnu.george@nutanix.com>
  • Loading branch information
johnugeorge authored Jul 11, 2024
2 parents 232f871 + cd848ba commit ae0e53b
Show file tree
Hide file tree
Showing 10 changed files with 109 additions and 104 deletions.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ sudo apt-get install mpich
Clone the latest release from [MLCommons Storage](https://github.com/mlcommons/storage) repository and install Python dependencies.

```bash
git clone -b v1.0-rc1 --recurse-submodules https://github.com/mlcommons/storage.git
git clone -b v1.0 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt
```
Expand Down Expand Up @@ -218,7 +218,7 @@ Currently, the storage benchmark suite supports benchmarking of 3 deep learning
Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload unet3d --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
./benchmark.sh datasize --workload unet3d --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
```

Generate data for the benchmark run
Expand All @@ -230,21 +230,21 @@ Generate data for the benchmark run
Run the benchmark.

```bash
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload unet3d --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=unet3d_data
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload unet3d --accelerator-type h100 --num-accelerators 2 --results-dir unet3d_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=unet3d_data
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
./benchmark.sh reportgen --results-dir resultsdir
./benchmark.sh reportgen --results-dir unet3d_h100
```

### ResNet-50

Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload resnet50 --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
./benchmark.sh datasize --workload resnet50 --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
```

Generate data for the benchmark run
Expand All @@ -256,21 +256,21 @@ Generate data for the benchmark run
Run the benchmark.

```bash
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resnet50_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
./benchmark.sh reportgen --results-dir resultsdir
./benchmark.sh reportgen --results-dir resnet50_h100
```

### CosmoFlow

Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload cosmoflow --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
./benchmark.sh datasize --workload cosmoflow --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
```

Generate data for the benchmark run
Expand All @@ -282,13 +282,13 @@ Generate data for the benchmark run
Run the benchmark.

```bash
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload cosmoflow --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload cosmoflow --accelerator-type h100 --num-accelerators 2 --results-dir cosmoflow_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
./benchmark.sh reportgen --results-dir resultsdir
./benchmark.sh reportgen --results-dir cosmoflow_h100
```

## Parameters
Expand All @@ -304,7 +304,9 @@ Below table displays the list of configurable parameters for the benchmark in th
| dataset.data_folder | The path where dataset is stored | --|
| **Reader params** | | |
| reader.read_threads | Number of threads to load the data | --|
| reader.computation_threads | Number of threads to preprocess the data(for TensorFlow) | --|
| reader.computation_threads | Number of threads to preprocess the data(for TensorFlow) |1|
| reader.prefetch_size | Number of batches to prefetch |2|
| reader.transfer_size | Number of bytes in the read buffer(only for Tensorflow) | |
| **Checkpoint params** | | |
| checkpoint.checkpoint_folder | The folder to save the checkpoints | --|
| **Storage params** | | |
Expand All @@ -323,7 +325,7 @@ In addition to what can be changed in the CLOSED category, the following paramet
| dataset.num_samples_per_file | Number of samples per file(only for Tensorflow using tfrecord datasets) | 1 for 3D U-Net |
| **Reader params** |
| reader.data_loader | Data loader type(Tensorflow or PyTorch or custom) | PyTorch for 3D U-Net |
| reader.transfer_size | Number of bytes in the read buffer(only for Tensorflow) | |


## Submission Rules

Expand Down
26 changes: 11 additions & 15 deletions benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ CLOSED_CATEGORY_PARAMS=(
# dataset params
"dataset.num_files_train" "dataset.num_subfolders_train" "dataset.data_folder"
# reader params
"reader.read_threads" "reader.computation_threads"
"reader.read_threads" "reader.computation_threads" "reader.transfer_size" "reader.prefetch_size"
# checkpoint params
"checkpoint.checkpoint_folder"
#storage params
# storage params
"storage.storage_type" "storage.storage_root")

OPEN_CATEGORY_PARAMS=(
Expand All @@ -34,7 +34,7 @@ OPEN_CATEGORY_PARAMS=(
# dataset params
"dataset.format" "dataset.num_samples_per_file"
# reader params
"reader.data_loader" "reader.transfer_size"
"reader.data_loader"
)
HYDRA_OUTPUT_CONFIG_DIR="configs"
EXTRA_PARAMS=(
Expand All @@ -46,7 +46,7 @@ EXTRA_PARAMS=(
)

ACCELERATOR_TYPES=("a100" "h100")
STEPS_PER_EPOCH=100
STEPS_PER_EPOCH=500
# host memory multiplier for dataset generation
HOST_MEMORY_MULTIPLIER=5

Expand Down Expand Up @@ -142,13 +142,17 @@ validate_params() {
for param in "${params[@]}"
do
param_name=$(echo $param | cut -d '=' -f 1)
param_value=$(echo $param | cut -d '=' -f 2)
validate_non_empty $param_name $param_value
if [[ " ${category} " =~ " open " ]]; then
validate_in_list "params" $param_name "${OPEN_CATEGORY_PARAMS[@]}"
elif [[ " ${category} " =~ " closed " ]]; then
validate_in_list "params" $param_name "${CLOSED_CATEGORY_PARAMS[@]}"
if [[ "$param_name" == "reader.prefetch_size" && "$param_value" -gt 2 ]]; then
echo "reader.prefetch_size value should not exceed 2"
exit 1
fi
fi
param_value=$(echo $param | cut -d '=' -f 2)
validate_non_empty $param_name $param_value
done
}

Expand Down Expand Up @@ -386,15 +390,7 @@ main() {
esac
done
validate_non_empty "results-dir" $results_dir
if [ -e "$results_dir/summary.json" ]; then
timestamp=$(date "+%Y%m%d%H%M%S")
submission_pkg="submission-$timestamp.tar.gz"
tar -czvf "$submission_pkg" "$results_dir"
echo "Submission package created: $submission_pkg"
else
echo "Error: File 'summary.json' not found in the result directory '$results_dir'."
echo "The report must be generated from the first host in the hosts argument"
fi
python3 ${SCRIPT_DIR}/report.py --result-dir $results_dir
else
usage; exit 1
fi
Expand Down
2 changes: 1 addition & 1 deletion dlio_benchmark
Submodule dlio_benchmark updated 79 files
+22 −13 .github/workflows/python-package-conda.yml
+4 −2 Dockerfile
+1 −1 dlio_benchmark/checkpointing/base_checkpointing.py
+1 −1 dlio_benchmark/checkpointing/checkpointing_factory.py
+2 −2 dlio_benchmark/checkpointing/pytorch_checkpointing.py
+2 −2 dlio_benchmark/checkpointing/tf_checkpointing.py
+1 −1 dlio_benchmark/common/constants.py
+6 −2 dlio_benchmark/common/enumerations.py
+1 −1 dlio_benchmark/common/error_code.py
+1 −1 dlio_benchmark/computation/asynchronous_computation.py
+1 −1 dlio_benchmark/computation/computation_factory.py
+1 −1 dlio_benchmark/computation/computation_handler.py
+1 −1 dlio_benchmark/computation/no_computation.py
+1 −1 dlio_benchmark/computation/synchronous_computation.py
+8 −5 dlio_benchmark/configs/workload/cosmoflow_a100.yaml
+9 −6 dlio_benchmark/configs/workload/cosmoflow_h100.yaml
+4 −7 dlio_benchmark/configs/workload/cosmoflow_v100.yaml
+6 −3 dlio_benchmark/configs/workload/resnet50_a100.yaml
+6 −3 dlio_benchmark/configs/workload/resnet50_h100.yaml
+2 −2 dlio_benchmark/configs/workload/resnet50_v100.yaml
+3 −0 dlio_benchmark/configs/workload/unet3d_a100.yaml
+3 −0 dlio_benchmark/configs/workload/unet3d_h100.yaml
+1 −1 dlio_benchmark/data_generator/csv_generator.py
+1 −1 dlio_benchmark/data_generator/data_generator.py
+4 −1 dlio_benchmark/data_generator/generator_factory.py
+2 −2 dlio_benchmark/data_generator/hdf5_generator.py
+3 −2 dlio_benchmark/data_generator/indexed_binary_generator.py
+2 −2 dlio_benchmark/data_generator/jpeg_generator.py
+2 −2 dlio_benchmark/data_generator/npy_generator.py
+2 −2 dlio_benchmark/data_generator/npz_generator.py
+2 −2 dlio_benchmark/data_generator/png_generator.py
+53 −0 dlio_benchmark/data_generator/synthetic_generator.py
+2 −2 dlio_benchmark/data_generator/tf_generator.py
+1 −1 dlio_benchmark/data_loader/base_data_loader.py
+41 −17 dlio_benchmark/data_loader/dali_data_loader.py
+4 −1 dlio_benchmark/data_loader/data_loader_factory.py
+15 −8 dlio_benchmark/data_loader/native_dali_data_loader.py
+59 −0 dlio_benchmark/data_loader/synthetic_data_loader.py
+18 −14 dlio_benchmark/data_loader/tf_data_loader.py
+2 −2 dlio_benchmark/data_loader/torch_data_loader.py
+1 −1 dlio_benchmark/framework/framework.py
+1 −1 dlio_benchmark/framework/framework_factory.py
+2 −2 dlio_benchmark/framework/tf_framework.py
+2 −2 dlio_benchmark/framework/torch_framework.py
+17 −15 dlio_benchmark/main.py
+1 −1 dlio_benchmark/profiler/darshan_profiler.py
+1 −1 dlio_benchmark/profiler/io_profiler.py
+1 −1 dlio_benchmark/profiler/iostat_profiler.py
+1 −1 dlio_benchmark/profiler/no_profiler.py
+1 −1 dlio_benchmark/profiler/profiler_factory.py
+1 −1 dlio_benchmark/profiler/tf_profiler.py
+9 −3 dlio_benchmark/reader/csv_reader.py
+8 −2 dlio_benchmark/reader/dali_image_reader.py
+9 −4 dlio_benchmark/reader/dali_npy_reader.py
+13 −7 dlio_benchmark/reader/dali_tfrecord_reader.py
+8 −2 dlio_benchmark/reader/hdf5_reader.py
+14 −2 dlio_benchmark/reader/image_reader.py
+8 −2 dlio_benchmark/reader/indexed_binary_mmap_reader.py
+8 −2 dlio_benchmark/reader/indexed_binary_reader.py
+8 −2 dlio_benchmark/reader/npy_reader.py
+10 −3 dlio_benchmark/reader/npz_reader.py
+5 −1 dlio_benchmark/reader/reader_factory.py
+10 −2 dlio_benchmark/reader/reader_handler.py
+69 −0 dlio_benchmark/reader/synthetic_reader.py
+46 −28 dlio_benchmark/reader/tf_reader.py
+2 −2 dlio_benchmark/storage/file_storage.py
+2 −2 dlio_benchmark/storage/s3_storage.py
+1 −1 dlio_benchmark/storage/storage_factory.py
+1 −1 dlio_benchmark/storage/storage_handler.py
+10 −6 dlio_benchmark/utils/config.py
+71 −5 dlio_benchmark/utils/statscounter.py
+50 −1 dlio_benchmark/utils/utility.py
+1 −1 docs/source/conf.py
+8 −5 docs/source/config.rst
+2 −2 docs/source/copyright.rst
+3 −1 docs/source/install.rst
+1 −1 docs/source/license.rst
+56 −58 requirements.txt
+8 −3 setup.py
107 changes: 48 additions & 59 deletions report.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,15 @@
# final report created by Storage benchmark run
REPORT_FILE = "mlperf_storage_report.json"

# accelerator utilization has to meet AU_THRESHOLD
AU_THRESHOLD = 90

# summary file created by DLIO in the results folder after every run
SUMMARY_FILE = "summary.json"

# config files containing workload details
CONFIG_OVERRIDES_FILE = "configs/overrides.yaml"

# minimum runs required for the submission
REQUIRED_BENCHMARK_RUNS = 5

# Maximum start time gap between host runs in seconds
MAX_START_TIMESTAMP_GAP = 10

def find_file_path(directory):
found_files = []
Expand All @@ -41,20 +39,27 @@ def check_unique(list_arg):
else:
return False

def check_timestamps(start_timestamps):
ts = list(map(lambda x: parser.parse(x),start_timestamps))
max_ts = max(ts)
min_ts = min(ts)
if (max_ts-min_ts).total_seconds() > MAX_START_TIMESTAMP_GAP:
return False
return True

# read summary for DLIO summary file
def get_summary(summary_file):
def get_summary_details(summary_file):
f = open(summary_file)
summary = json.load(f)
return summary

def get_workload_details(config_file):
with open(config_file, 'r') as file:
lines = file.readlines()
workload_str="workload="
for line in lines:
if workload_str in line:
workload_l = line.split(workload_str)[1].strip()
workload_details = workload_l.split('_')
workload = workload_details[0]
accelerator_type = workload_details[1]
return workload, accelerator_type
return "", ""



class StorageReport(object):

def __init__(self, args):
Expand All @@ -79,22 +84,14 @@ def generate_report(self):
for summary_file in summary_files:
run_path = os.path.relpath(summary_file, self.result_dir)
run_dir = run_path.split("/")
if len(run_dir) != 3:
logging.error(f"Error: Directory structure {summary_file} is not correct. It has be in format result_dir/run(1..n)/host(1..n)/summary.json")
if len(run_dir) != 2:
logging.error(f"Error: Directory structure {summary_file} is not correct. It has be in format result_dir/run(1..n)/summary.json")
sys.exit(1)
run_name = run_dir[0]
if run_name not in runs:
runs[run_name] = [summary_file]
else:
runs[run_name].append(summary_file)
runs[run_name] = summary_file
if len(runs) != REQUIRED_BENCHMARK_RUNS:
logging.error(f"Error: Results are reported only for {len(runs)} runs. {REQUIRED_BENCHMARK_RUNS} runs are required.")
sys.exit(1)
host_arr = [len(runs[run_name]) for run_name in runs]
if len(set(host_arr)) != 1:
logging.error("Error: Number of participating hosts must be same across all runs")
sys.exit(1)
num_hosts = host_arr[0]
for run_name in runs:
models = []
num_acclerators = []
Expand All @@ -105,56 +102,48 @@ def generate_report(self):
num_samples_per_file = []
start_host_timestamp = []
results["runs"][run_name] ={}
for summary_file in runs[run_name]:
summary = get_summary(summary_file)
au = summary['metric']['train_au_mean_percentage']
if float(au) < AU_THRESHOLD:
logging.error(f"Error: AU value didn't pass the threshold in the run reported by {summary_file}")
sys.exit(1)
models.append(summary['model'])
num_acclerators.append(summary['num_accelerators'])
train_throughput_sps.append(summary['metric']['train_throughput_mean_samples_per_second'])
train_throughput_mps.append(summary['metric']['train_io_mean_MB_per_second'])
host_names.append(summary['hostname'])
num_files_train.append(summary['num_files_train'])
num_samples_per_file.append(summary['num_samples_per_file'])
start_host_timestamp.append(summary['start'])
if len(set(host_names)) != len(host_names):
logging.warning(f"Warning: Hostnames in results of run {run_name} are not unique.")

if not check_unique(models):
logging.error(f"Error: The model name is different across hosts")
sys.exit(1)
if not check_unique(num_acclerators):
logging.error(f"Error: The number of accelerators is different across hosts")
sys.exit(1)
if not check_unique(num_files_train):
logging.error(f"Error: The number of training files is different across hosts")
sys.exit(1)
if not check_unique(num_samples_per_file):
logging.error(f"Error: The number of samples per file is different across hosts")
sys.exit(1)
if not check_timestamps(start_host_timestamp):
logging.error(f"Error: Start timestamps of all hosts in each run must be within {MAX_START_TIMESTAMP_GAP} sec")

summary_file = runs[run_name]
config_file = os.path.join(os.path.dirname(summary_file), CONFIG_OVERRIDES_FILE)
model, accelerator = get_workload_details(config_file)
if not model or not accelerator:
logging.error("workload missing in the config file", CONFIG_OVERRIDES_FILE)
sys.exit(1)

summary = get_summary_details(runs[run_name])
au = summary['metric']['train_au_mean_percentage']
if summary['metric']['train_au_meet_expectation'] == "fail":
logging.error(f"Error: AU value {au} didn't pass the threshold in the run reported by {summary_file}")
sys.exit(1)
num_acclerators.append(summary['num_accelerators'])
train_throughput_sps.append(summary['metric']['train_throughput_mean_samples_per_second'])
train_throughput_mps.append(summary['metric']['train_io_mean_MB_per_second'])
num_files_train.append(summary['num_files_train'])
num_samples_per_file.append(summary['num_samples_per_file'])

results["runs"][run_name]["train_throughput_samples_per_second"] = np.sum(np.array(train_throughput_sps))
results["runs"][run_name]["train_throughput_MB_per_second"] = np.sum(np.array(train_throughput_mps))
results["runs"][run_name]["train_num_accelerators"] = np.sum(np.array(num_acclerators))
results["runs"][run_name]["model"] = models[0]
results["runs"][run_name]["model"] = model
results["runs"][run_name]["accelerator"] = accelerator
results["runs"][run_name]["num_files_train"] = num_files_train[0]
results["runs"][run_name]["num_samples_per_file"] = num_samples_per_file[0]


overall_train_throughput_sps = [results["runs"][run_name]["train_throughput_samples_per_second"] for run_name in results["runs"]]
overall_train_throughput_mps = [results["runs"][run_name]["train_throughput_MB_per_second"] for run_name in results["runs"]]
overall_model = [results["runs"][run_name]["model"] for run_name in results["runs"]]
overall_accelerator = [results["runs"][run_name]["accelerator"] for run_name in results["runs"]]
overall_train_num_accelerators = [results["runs"][run_name]["train_num_accelerators"] for run_name in results["runs"]]
overall_num_files_train = [results["runs"][run_name]["num_files_train"] for run_name in results["runs"]]
overall_num_samples_per_file = [results["runs"][run_name]["num_samples_per_file"] for run_name in results["runs"]]

if not check_unique(overall_model):
logging.error(f"Error: The model name is different across runs")
sys.exit(1)
if not check_unique(overall_accelerator):
logging.error(f"Error: The accelerator name is different across runs")
sys.exit(1)
if not check_unique(overall_train_num_accelerators):
logging.error(f"Error: The number of accelerators is different across runs")
sys.exit(1)
Expand All @@ -166,7 +155,7 @@ def generate_report(self):
sys.exit(1)

results["overall"]["model"] = overall_model[0]
results["overall"]["num_client_hosts"] = num_hosts
results["overall"]["accelerator"] = overall_accelerator[0]
results["overall"]["num_benchmark_runs"] = len(results["runs"])
results["overall"]["train_num_accelerators"] = overall_train_num_accelerators[0]
results["overall"]["num_files_train"] = overall_num_files_train[0]
Expand All @@ -177,7 +166,7 @@ def generate_report(self):
results["overall"]["train_throughput_stdev_MB_per_second"] = np.std(overall_train_throughput_mps)
logging.info("------------------------------")
logging.info(f'Model: {results["overall"]["model"]}')
logging.info(f'Number of client hosts: {results["overall"]["num_client_hosts"]}')
logging.info(f'Accelerator: {results["overall"]["accelerator"]}')
logging.info(f'Number of benchmark runs: {results["overall"]["num_benchmark_runs"]}')
logging.info(f'Overall number of accelerators: {results["overall"]["train_num_accelerators"]}')
logging.info(f'Overall Training Throughput (samples/second): {results["overall"]["train_throughput_mean_samples_per_second"]:.2f} ({results["overall"]["train_throughput_stdev_samples_per_second"]})')
Expand Down
13 changes: 8 additions & 5 deletions storage-conf/workload/cosmoflow_a100.yaml
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
model: cosmoflow_pt
model: cosmoflow

framework: pytorch
framework: tensorflow

workflow:
generate_data: False
train: True

dataset:
data_folder: data/cosmoflow_pt
data_folder: data/cosmoflow
num_files_train: 524288
num_samples_per_file: 1
record_length: 2828486
record_length_stdev: 71311
format: tfrecord

reader:
data_loader: dali
data_loader: tensorflow
read_threads: 4
batch_size: 1
dont_use_mmap: True
file_shuffle: seed
sample_shuffle: seed
shuffle_size: 2

train:
epochs: 5
computation_time: 0.00551

metric:
au: 0.70
Loading

0 comments on commit ae0e53b

Please sign in to comment.