Merge pull request #70 from mlcommons/v1.0-branch

* Merge v1.0 benchmark changes (#68) * Adding v1.0 changes Signed-off-by: Johnu George <johnu.george@nutanix.com> * Add report logging (#69) Signed-off-by: Johnu George <johnu.george@nutanix.com>
mlcommons · Jul 11, 2024 · ae0e53b · ae0e53b
2 parents 232f871 + cd848ba
commit ae0e53b
Show file tree

Hide file tree

Showing 10 changed files with 109 additions and 104 deletions.
diff --git a/README.md b/README.md
@@ -74,7 +74,7 @@ sudo apt-get install mpich
 Clone the latest release from [MLCommons Storage](https://github.com/mlcommons/storage) repository and install Python dependencies.
 
 ```bash
-git clone -b v1.0-rc1 --recurse-submodules https://github.com/mlcommons/storage.git
+git clone -b v1.0 --recurse-submodules https://github.com/mlcommons/storage.git
 cd storage
 pip3 install -r dlio_benchmark/requirements.txt
 ```
@@ -218,7 +218,7 @@ Currently, the storage benchmark suite supports benchmarking of 3 deep learning
 Calculate minimum dataset size required for the benchmark run based on your client configuration
 
 ```bash
-./benchmark.sh datasize --workload unet3d --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
+./benchmark.sh datasize --workload unet3d --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
 ```
 
 Generate data for the benchmark run
@@ -230,21 +230,21 @@ Generate data for the benchmark run
 Run the benchmark.
 
 ```bash
-./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload unet3d --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=unet3d_data
+./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload unet3d --accelerator-type h100 --num-accelerators 2 --results-dir unet3d_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=unet3d_data
 ```
 
 All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host. 
 
 ```bash 
-./benchmark.sh reportgen --results-dir resultsdir
+./benchmark.sh reportgen --results-dir unet3d_h100
 ```
 
 ### ResNet-50
 
 Calculate minimum dataset size required for the benchmark run based on your client configuration
 
 ```bash
-./benchmark.sh datasize --workload resnet50 --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
+./benchmark.sh datasize --workload resnet50 --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
 ```
 
 Generate data for the benchmark run
@@ -256,21 +256,21 @@ Generate data for the benchmark run
 Run the benchmark.
 
 ```bash
-./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
+./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resnet50_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
 ```
 
 All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host. 
 
 ```bash 
-./benchmark.sh reportgen --results-dir resultsdir
+./benchmark.sh reportgen --results-dir resnet50_h100
 ```
 
 ### CosmoFlow
 
 Calculate minimum dataset size required for the benchmark run based on your client configuration
 
 ```bash
-./benchmark.sh datasize --workload cosmoflow --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
+./benchmark.sh datasize --workload cosmoflow --accelerator-type h100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
 ```
 
 Generate data for the benchmark run
@@ -282,13 +282,13 @@ Generate data for the benchmark run
 Run the benchmark.
 
 ```bash
-./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload cosmoflow --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
+./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload cosmoflow --accelerator-type h100 --num-accelerators 2 --results-dir cosmoflow_h100 --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
 ```
 
 All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host. 
 
 ```bash 
-./benchmark.sh reportgen --results-dir resultsdir
+./benchmark.sh reportgen --results-dir cosmoflow_h100
 ```
 
 ## Parameters 
@@ -304,7 +304,9 @@ Below table displays the list of configurable parameters for the benchmark in th
 | dataset.data_folder           | The path where dataset is stored				| --|
 | **Reader params**				|						|   |
 | reader.read_threads		| Number of threads to load the data                            | --|
-| reader.computation_threads    | Number of threads to preprocess the data(for TensorFlow)      | --|
+| reader.computation_threads    | Number of threads to preprocess the data(for TensorFlow)      |1|
+| reader.prefetch_size    | Number of batches to prefetch      |2|
+| reader.transfer_size       | Number of bytes in the read buffer(only for Tensorflow)  		        | |
 | **Checkpoint params**		|								|   |
 | checkpoint.checkpoint_folder	| The folder to save the checkpoints  				| --|
 | **Storage params**		|								|   |
@@ -323,7 +325,7 @@ In addition to what can be changed in the CLOSED category, the following paramet
 | dataset.num_samples_per_file       | Number of samples per file(only for Tensorflow using tfrecord datasets)  		        | 1 for 3D U-Net |
 | **Reader params**		|
 | reader.data_loader       | Data loader type(Tensorflow or PyTorch or custom) 		        | PyTorch for 3D U-Net |
-| reader.transfer_size       | Number of bytes in the read buffer(only for Tensorflow)  		        | |
+
 
 ## Submission Rules
 

diff --git a/benchmark.sh b/benchmark.sh
@@ -20,10 +20,10 @@ CLOSED_CATEGORY_PARAMS=(
 	# dataset params
 	"dataset.num_files_train" "dataset.num_subfolders_train" "dataset.data_folder"
 	# reader params
-	"reader.read_threads" "reader.computation_threads"
+	"reader.read_threads" "reader.computation_threads" "reader.transfer_size" "reader.prefetch_size"
 	# checkpoint params
 	"checkpoint.checkpoint_folder"
-	#storage params
+	# storage params
 	"storage.storage_type" "storage.storage_root")
 
 OPEN_CATEGORY_PARAMS=(
@@ -34,7 +34,7 @@ OPEN_CATEGORY_PARAMS=(
 	# dataset params
 	"dataset.format" "dataset.num_samples_per_file"
 	# reader params
-	"reader.data_loader" "reader.transfer_size"
+	"reader.data_loader"
 )
 HYDRA_OUTPUT_CONFIG_DIR="configs"
 EXTRA_PARAMS=(
@@ -46,7 +46,7 @@ EXTRA_PARAMS=(
 )
 
 ACCELERATOR_TYPES=("a100" "h100")
-STEPS_PER_EPOCH=100
+STEPS_PER_EPOCH=500
 # host memory multiplier for dataset generation
 HOST_MEMORY_MULTIPLIER=5
 
@@ -142,13 +142,17 @@ validate_params() {
 	for param in "${params[@]}"
 	do
 		param_name=$(echo $param | cut -d '=' -f 1)
+		param_value=$(echo $param | cut -d '=' -f 2)
+		validate_non_empty $param_name $param_value
 		if [[ " ${category} " =~ " open " ]]; then
 			validate_in_list "params" $param_name "${OPEN_CATEGORY_PARAMS[@]}"
 		elif [[ " ${category} " =~ " closed " ]]; then
 			validate_in_list "params" $param_name "${CLOSED_CATEGORY_PARAMS[@]}"
+			if [[ "$param_name" == "reader.prefetch_size" && "$param_value" -gt 2 ]]; then
+				echo "reader.prefetch_size value should not exceed 2"
+				exit 1
+			fi
 		fi
-		param_value=$(echo $param | cut -d '=' -f 2)
-		validate_non_empty $param_name $param_value
 	done
 }
 
@@ -386,15 +390,7 @@ main() {
 			esac
 		done
 		validate_non_empty "results-dir" $results_dir
-		if [ -e "$results_dir/summary.json" ]; then
-			timestamp=$(date "+%Y%m%d%H%M%S")
-			submission_pkg="submission-$timestamp.tar.gz"
-			tar -czvf "$submission_pkg" "$results_dir"
-			echo "Submission package created: $submission_pkg"
-		else
-			echo "Error: File 'summary.json' not found in the result directory '$results_dir'."
-			echo "The report must be generated from the first host in the hosts argument"
-		fi
+		python3 ${SCRIPT_DIR}/report.py --result-dir $results_dir
 	else
 		usage; exit 1
 	fi

diff --git a/dlio_benchmark b/dlio_benchmark
diff --git a/report.py b/report.py
@@ -9,17 +9,15 @@
 # final report created by Storage benchmark run
 REPORT_FILE = "mlperf_storage_report.json"
 
-# accelerator utilization has to meet AU_THRESHOLD
-AU_THRESHOLD = 90
-
 # summary file created by DLIO in the results folder after every run
 SUMMARY_FILE = "summary.json"
 
+# config files containing workload details
+CONFIG_OVERRIDES_FILE = "configs/overrides.yaml"
+
 # minimum runs required for the submission
 REQUIRED_BENCHMARK_RUNS = 5
 
-# Maximum start time gap between host runs in seconds
-MAX_START_TIMESTAMP_GAP = 10
 
 def find_file_path(directory):
         found_files = []
@@ -41,20 +39,27 @@ def check_unique(list_arg):
     else:
         return False
 
-def check_timestamps(start_timestamps):
-    ts = list(map(lambda x: parser.parse(x),start_timestamps))
-    max_ts = max(ts)
-    min_ts = min(ts)
-    if (max_ts-min_ts).total_seconds() > MAX_START_TIMESTAMP_GAP:
-        return False
-    return True
-
 # read summary for DLIO summary file
-def get_summary(summary_file):
+def get_summary_details(summary_file):
     f = open(summary_file)
     summary = json.load(f)
     return summary
 
+def get_workload_details(config_file):
+    with open(config_file, 'r') as file:
+        lines = file.readlines()
+    workload_str="workload="
+    for line in lines:
+        if workload_str in line:
+            workload_l = line.split(workload_str)[1].strip()
+            workload_details = workload_l.split('_')
+            workload = workload_details[0]
+            accelerator_type = workload_details[1]
+            return workload, accelerator_type
+    return "", ""
+
+
+
 class StorageReport(object):
 
     def __init__(self, args):
@@ -79,22 +84,14 @@ def generate_report(self):
         for summary_file in summary_files:
             run_path = os.path.relpath(summary_file, self.result_dir)
             run_dir = run_path.split("/")
-            if len(run_dir) != 3:
-                logging.error(f"Error: Directory structure {summary_file} is not correct. It has be in format result_dir/run(1..n)/host(1..n)/summary.json")
+            if len(run_dir) != 2:
+                logging.error(f"Error: Directory structure {summary_file} is not correct. It has be in format result_dir/run(1..n)/summary.json")
                 sys.exit(1)
             run_name = run_dir[0]
-            if run_name not in runs:
-                runs[run_name] = [summary_file]
-            else:
-                runs[run_name].append(summary_file)
+            runs[run_name] = summary_file
         if len(runs) != REQUIRED_BENCHMARK_RUNS:
             logging.error(f"Error: Results are reported only for {len(runs)} runs. {REQUIRED_BENCHMARK_RUNS} runs are required.")
             sys.exit(1)
-        host_arr = [len(runs[run_name]) for run_name in runs]
-        if len(set(host_arr)) != 1:
-            logging.error("Error: Number of participating hosts must be same across all runs")
-            sys.exit(1)
-        num_hosts = host_arr[0]
         for run_name in runs:
             models = []
             num_acclerators = []
@@ -105,56 +102,48 @@ def generate_report(self):
             num_samples_per_file = []
             start_host_timestamp = []
             results["runs"][run_name] ={}
-            for summary_file in runs[run_name]:
-                summary = get_summary(summary_file)
-                au = summary['metric']['train_au_mean_percentage']
-                if float(au) < AU_THRESHOLD:
-                    logging.error(f"Error: AU value didn't pass the threshold in the run reported by {summary_file}")
-                    sys.exit(1)
-                models.append(summary['model'])
-                num_acclerators.append(summary['num_accelerators'])
-                train_throughput_sps.append(summary['metric']['train_throughput_mean_samples_per_second'])
-                train_throughput_mps.append(summary['metric']['train_io_mean_MB_per_second'])
-                host_names.append(summary['hostname'])
-                num_files_train.append(summary['num_files_train'])
-                num_samples_per_file.append(summary['num_samples_per_file'])
-                start_host_timestamp.append(summary['start'])
-            if len(set(host_names)) != len(host_names):
-                logging.warning(f"Warning: Hostnames in results of run {run_name} are not unique.")
-
-            if not check_unique(models):
-                logging.error(f"Error: The model name is different across hosts")
-                sys.exit(1)
-            if not check_unique(num_acclerators):
-                logging.error(f"Error: The number of accelerators is different across hosts")
-                sys.exit(1)
-            if not check_unique(num_files_train):
-                logging.error(f"Error: The number of training files is different across hosts")
-                sys.exit(1)
-            if not check_unique(num_samples_per_file):
-                logging.error(f"Error: The number of samples per file is different across hosts")
-                sys.exit(1)
-            if not check_timestamps(start_host_timestamp):
-                logging.error(f"Error: Start timestamps of all hosts in each run must be within {MAX_START_TIMESTAMP_GAP} sec")
+
+            summary_file = runs[run_name]
+            config_file = os.path.join(os.path.dirname(summary_file), CONFIG_OVERRIDES_FILE)
+            model, accelerator = get_workload_details(config_file)
+            if not model or not accelerator:
+                logging.error("workload missing in the config file", CONFIG_OVERRIDES_FILE)
                 sys.exit(1)
 
+            summary = get_summary_details(runs[run_name])
+            au = summary['metric']['train_au_mean_percentage']
+            if summary['metric']['train_au_meet_expectation'] == "fail":
+                    logging.error(f"Error: AU value {au} didn't pass the threshold in the run reported by {summary_file}")
+                    sys.exit(1)
+            num_acclerators.append(summary['num_accelerators'])
+            train_throughput_sps.append(summary['metric']['train_throughput_mean_samples_per_second'])
+            train_throughput_mps.append(summary['metric']['train_io_mean_MB_per_second'])
+            num_files_train.append(summary['num_files_train'])
+            num_samples_per_file.append(summary['num_samples_per_file'])
+
             results["runs"][run_name]["train_throughput_samples_per_second"] = np.sum(np.array(train_throughput_sps))
             results["runs"][run_name]["train_throughput_MB_per_second"] = np.sum(np.array(train_throughput_mps))
             results["runs"][run_name]["train_num_accelerators"] = np.sum(np.array(num_acclerators))
-            results["runs"][run_name]["model"] = models[0]
+            results["runs"][run_name]["model"] = model
+            results["runs"][run_name]["accelerator"] = accelerator
             results["runs"][run_name]["num_files_train"] = num_files_train[0]
             results["runs"][run_name]["num_samples_per_file"] = num_samples_per_file[0]
 
+
         overall_train_throughput_sps = [results["runs"][run_name]["train_throughput_samples_per_second"] for run_name in results["runs"]]
         overall_train_throughput_mps = [results["runs"][run_name]["train_throughput_MB_per_second"] for run_name in results["runs"]]
         overall_model = [results["runs"][run_name]["model"] for run_name in results["runs"]]
+        overall_accelerator = [results["runs"][run_name]["accelerator"] for run_name in results["runs"]]
         overall_train_num_accelerators = [results["runs"][run_name]["train_num_accelerators"] for run_name in results["runs"]]
         overall_num_files_train = [results["runs"][run_name]["num_files_train"] for run_name in results["runs"]]
         overall_num_samples_per_file = [results["runs"][run_name]["num_samples_per_file"] for run_name in results["runs"]]
 
         if not check_unique(overall_model):
             logging.error(f"Error: The model name is different across runs")
             sys.exit(1)
+        if not check_unique(overall_accelerator):
+            logging.error(f"Error: The accelerator name is different across runs")
+            sys.exit(1)
         if not check_unique(overall_train_num_accelerators):
             logging.error(f"Error: The number of accelerators is different across runs")
             sys.exit(1)
@@ -166,7 +155,7 @@ def generate_report(self):
             sys.exit(1)
 
         results["overall"]["model"] = overall_model[0]
-        results["overall"]["num_client_hosts"] = num_hosts
+        results["overall"]["accelerator"] = overall_accelerator[0]
         results["overall"]["num_benchmark_runs"] = len(results["runs"])
         results["overall"]["train_num_accelerators"] =  overall_train_num_accelerators[0]
         results["overall"]["num_files_train"] = overall_num_files_train[0]
@@ -177,7 +166,7 @@ def generate_report(self):
         results["overall"]["train_throughput_stdev_MB_per_second"] = np.std(overall_train_throughput_mps)
         logging.info("------------------------------")
         logging.info(f'Model: {results["overall"]["model"]}')
-        logging.info(f'Number of client hosts: {results["overall"]["num_client_hosts"]}')
+        logging.info(f'Accelerator: {results["overall"]["accelerator"]}')
         logging.info(f'Number of benchmark runs: {results["overall"]["num_benchmark_runs"]}')
         logging.info(f'Overall number of accelerators: {results["overall"]["train_num_accelerators"]}')
         logging.info(f'Overall Training Throughput (samples/second): {results["overall"]["train_throughput_mean_samples_per_second"]:.2f} ({results["overall"]["train_throughput_stdev_samples_per_second"]})')

diff --git a/storage-conf/workload/cosmoflow_a100.yaml b/storage-conf/workload/cosmoflow_a100.yaml
@@ -1,27 +1,30 @@
-model: cosmoflow_pt
+model: cosmoflow
 
-framework: pytorch
+framework: tensorflow
 
 workflow:
  generate_data: False
  train: True
 
 dataset:
- data_folder: data/cosmoflow_pt
+ data_folder: data/cosmoflow
  num_files_train: 524288
  num_samples_per_file: 1
  record_length: 2828486
  record_length_stdev: 71311
  format: tfrecord
 
 reader:
- data_loader: dali
+ data_loader: tensorflow
  read_threads: 4
  batch_size: 1
- dont_use_mmap: True
  file_shuffle: seed
  sample_shuffle: seed
+ shuffle_size: 2
 
 train: 
   epochs: 5
   computation_time: 0.00551
+
+metric:
+ au: 0.70
+22 −13		.github/workflows/python-package-conda.yml
+4 −2		Dockerfile
+1 −1		dlio_benchmark/checkpointing/base_checkpointing.py
+1 −1		dlio_benchmark/checkpointing/checkpointing_factory.py
+2 −2		dlio_benchmark/checkpointing/pytorch_checkpointing.py
+2 −2		dlio_benchmark/checkpointing/tf_checkpointing.py
+1 −1		dlio_benchmark/common/constants.py
+6 −2		dlio_benchmark/common/enumerations.py
+1 −1		dlio_benchmark/common/error_code.py
+1 −1		dlio_benchmark/computation/asynchronous_computation.py
+1 −1		dlio_benchmark/computation/computation_factory.py
+1 −1		dlio_benchmark/computation/computation_handler.py
+1 −1		dlio_benchmark/computation/no_computation.py
+1 −1		dlio_benchmark/computation/synchronous_computation.py
+8 −5		dlio_benchmark/configs/workload/cosmoflow_a100.yaml
+9 −6		dlio_benchmark/configs/workload/cosmoflow_h100.yaml
+4 −7		dlio_benchmark/configs/workload/cosmoflow_v100.yaml
+6 −3		dlio_benchmark/configs/workload/resnet50_a100.yaml
+6 −3		dlio_benchmark/configs/workload/resnet50_h100.yaml
+2 −2		dlio_benchmark/configs/workload/resnet50_v100.yaml
+3 −0		dlio_benchmark/configs/workload/unet3d_a100.yaml
+3 −0		dlio_benchmark/configs/workload/unet3d_h100.yaml
+1 −1		dlio_benchmark/data_generator/csv_generator.py
+1 −1		dlio_benchmark/data_generator/data_generator.py
+4 −1		dlio_benchmark/data_generator/generator_factory.py
+2 −2		dlio_benchmark/data_generator/hdf5_generator.py
+3 −2		dlio_benchmark/data_generator/indexed_binary_generator.py
+2 −2		dlio_benchmark/data_generator/jpeg_generator.py
+2 −2		dlio_benchmark/data_generator/npy_generator.py
+2 −2		dlio_benchmark/data_generator/npz_generator.py
+2 −2		dlio_benchmark/data_generator/png_generator.py
+53 −0		dlio_benchmark/data_generator/synthetic_generator.py
+2 −2		dlio_benchmark/data_generator/tf_generator.py
+1 −1		dlio_benchmark/data_loader/base_data_loader.py
+41 −17		dlio_benchmark/data_loader/dali_data_loader.py
+4 −1		dlio_benchmark/data_loader/data_loader_factory.py
+15 −8		dlio_benchmark/data_loader/native_dali_data_loader.py
+59 −0		dlio_benchmark/data_loader/synthetic_data_loader.py
+18 −14		dlio_benchmark/data_loader/tf_data_loader.py
+2 −2		dlio_benchmark/data_loader/torch_data_loader.py
+1 −1		dlio_benchmark/framework/framework.py
+1 −1		dlio_benchmark/framework/framework_factory.py
+2 −2		dlio_benchmark/framework/tf_framework.py
+2 −2		dlio_benchmark/framework/torch_framework.py
+17 −15		dlio_benchmark/main.py
+1 −1		dlio_benchmark/profiler/darshan_profiler.py
+1 −1		dlio_benchmark/profiler/io_profiler.py
+1 −1		dlio_benchmark/profiler/iostat_profiler.py
+1 −1		dlio_benchmark/profiler/no_profiler.py
+1 −1		dlio_benchmark/profiler/profiler_factory.py
+1 −1		dlio_benchmark/profiler/tf_profiler.py
+9 −3		dlio_benchmark/reader/csv_reader.py
+8 −2		dlio_benchmark/reader/dali_image_reader.py
+9 −4		dlio_benchmark/reader/dali_npy_reader.py
+13 −7		dlio_benchmark/reader/dali_tfrecord_reader.py
+8 −2		dlio_benchmark/reader/hdf5_reader.py
+14 −2		dlio_benchmark/reader/image_reader.py
+8 −2		dlio_benchmark/reader/indexed_binary_mmap_reader.py
+8 −2		dlio_benchmark/reader/indexed_binary_reader.py
+8 −2		dlio_benchmark/reader/npy_reader.py
+10 −3		dlio_benchmark/reader/npz_reader.py
+5 −1		dlio_benchmark/reader/reader_factory.py
+10 −2		dlio_benchmark/reader/reader_handler.py
+69 −0		dlio_benchmark/reader/synthetic_reader.py
+46 −28		dlio_benchmark/reader/tf_reader.py
+2 −2		dlio_benchmark/storage/file_storage.py
+2 −2		dlio_benchmark/storage/s3_storage.py
+1 −1		dlio_benchmark/storage/storage_factory.py
+1 −1		dlio_benchmark/storage/storage_handler.py
+10 −6		dlio_benchmark/utils/config.py
+71 −5		dlio_benchmark/utils/statscounter.py
+50 −1		dlio_benchmark/utils/utility.py
+1 −1		docs/source/conf.py
+8 −5		docs/source/config.rst
+2 −2		docs/source/copyright.rst
+3 −1		docs/source/install.rst
+1 −1		docs/source/license.rst
+56 −58		requirements.txt
+8 −3		setup.py