-
Version 0.5
-
August 1, 2022
Points of contact:
-
Geoffrey Fox([email protected])
-
Tony Hey ([email protected])
-
Gregor von Laszewski([email protected])
-
Juri Papay ([email protected])
-
Jeyan Thiyagalingam ([email protected])
The MLPerf and MLCommons® name and logo are trademarks. In order to refer to a result using the MLPerf and MLCommons® name, the result must conform to the letter and spirit of the rules specified in this document. The MLCommons® organization reserves the right to solely determine if a use of its name or logo is acceptable.
Our goal is better science accuracy through benchmarks. Our main metric is the accuracy of the science. However, we will have secondary metrics that report time, space, and resources such as energy and temperature behavior.
The Science WG will use training as the primary benchmark. Here we specify the rules for training submissions.
The Science WG will have Closed and Open divisions and submissions to these divisions must be separate although the same activity could qualify for both. The Open division is expected to be the primary focus.
A Closed division submission should report system performance as the result and give the logging information outlined in MLPerf HPC Rules. The stopping criterion will be the value of loss specified in the benchmark. Power and temperature measurements may also be supplied.
An Open division submission aims to improve scientific discovery from the dataset specified in the benchmark which will specify one or more scientific measurements to be calculated in the submission. The result will be the value of these specified measurements from the submitted model. This model can be based on the supplied reference model or distinct. Data augmentation is allowed and all hyperparameters can be changed in the reference model if used. The result should be a GitHub (markdown) document starting with a table listing Measurement name, Reference model value, Submitted model value. For benchmarks with more than one measurement, an average difference between submitted and reference measurements should be given. Power and performance values are optional but encouraged in the results document. The resulting document should give enough details on the submitted model and any data augmentation so that the review team can evaluate its scientific soundness. Citations should be included to describe the scientific basis of the approach. Other rules for the Open division are as described in MLCommons® Training Rules, however, there are no special rules for the Open Science division.
To showcase the various aspects of the benchmarks and contrast it to with existing training benchmarks from MLCommons® at the time of writing of this document, we have included Table 1. The first line Non Science Training Closed refers to training conducted by other MLCommons® groups. The rows with Science in the Division show the target attributes for that division so the focus of the benchmarks can easily be contrasted. Each collumn showcases an attribute and how it is different between the divisions and other MLCommons non science benchmarks.
Table 1: Targeted aspects of the MLCommons® Science benchmark.
Division |
Hardware System |
Training data |
Model |
Test data / method |
Primary Metric |
Secondary Metrics (optional) |
Non Science Training Closed |
Anything |
Fixed |
Fixed |
Fixed |
Speed |
None |
Science Closed |
Anything |
Fixed |
Variable |
Fixed |
"Science Quality", Accuracy |
Time, memory, power |
Science Open |
Anything |
Variable |
Variable |
Fixed |
"Science Quality", Accuracy, Optional user-defined metric |
Time, memory, power |
The secondary metrics include
-
Space
-
Time
-
Energy
-
Different datasets
All rules are taken from the MLPerf Training Rules except for those that are overridden here.
HPC working group has a focus on infrastructure and closed division Science working group has a focus on science and the open division.
While in HPC the focus of the benchmarks is on infrastructure, the focus here is on scientific accuracy. Nevertheless, the scientific benchmark applications could be used in some cases for HPC evaluation.
The benchmark suite consists of the benchmarks shown in the following table.
Problem |
Dataset |
Quality Target |
Earthquake Prediction |
Earthquake data from USGS. |
Normalized Nash–Sutcliffe model efficiency (NNSE), |
CloudMask |
Multispectral image data from Sea and Land Surface Temperature Radiometer (SLSTR) instrument. |
convergence target |
STEMDL Classification |
Convergent Beam Electron Diffraction (CBED) patterns. |
The scientific metric for this problem is the top1 classification accuracy and F1-score (the higher the better). The main challenge is to predict 3D geometry from its 3 projections (2D images). Information about the best accuracy so far for this dataset can be found in [4] |
UNO |
Molecular features of tumor cells across multiple data sources. |
Score: |
There are two divisions of the Science Benchmark Suite, the Closed division and the Open division.
The Closed division requires using the same preprocessing, model, and training method as the reference implementation.
The closed division models are:
Problem |
Repository |
EarthQuake |
|
CloudMask |
|
STEMDL |
|
CANDLE UNO |
Allowed hyperparameter and optimizer settings are specified in the section Hyperparameters and Optimizer. For anything not explicitly mentioned there, submissions must match the behavior and settings of the reference implementations.
In order to simplify the complex setup for scientific benchmarks, we recommend that all parameters are included in the config file when available. We recommend a YAML format for the config file.
Hyperparameters and optimizers may be freely changed. For Science benchmarks this is the most important division as the goal is to improve the science and identify algorithms that optimize the science. For this reason, any algorithm and hyperparameter specification for that algorithm is allowed.
As this may include new algorithms we like to collect them as discussed in the Contribution section.
When specifying new algorithms, please provide us with the set of hyperparameters as defined by the examples given in this document.
Algorithms in the Open Division must be properly documented and archived in a GitHub repository with a tagged version so they can easily be reproduced. To be fully included the code must be archived in the official MLCommons® Science GitHub repository.
As the algorithms provided here can also be used in the open division we place the same rules on them as other algorithms.
Most importantly the scientific accuracy must be measured in the same fashion so that alternative implementations and hyperparameter choices can be compared with each other. Each science application provides a well-defined single or a set of comparative measures to evaluate the scientific accuracy. The measure(s) should be widely accepted by the science community
Algorithms that are not open source do not qualify for the science benchmarks as reproducibility and reviews are limited.
Each reference implementation includes a download script or broadly available method to acquire and verify the dataset.
The data at the start of the benchmark run should reside on a parallel file system that is persistent (>= 1 month, not subject to eviction by other users), can be downloaded to / accessed by the user, and can be shared among users at the facility. Any staging to node-local disk or memory or system burst buffer should be included in the benchmark time measurement.
You must flush/reset the on-node caches prior to running each instance of the benchmark. Due to practicality issues, you are not required to reset off-node system-level caches.
We otherwise follow the training rule Data State at Start of Run on consistency with the reference implementation preprocessing and allowance for reformatting.
For the closed division, we have a number of defined data sets that can be used for obtaining scientific results. This allows us an easier review.
For the open division, we also allow open data sets to be part of the submission if the submitter considers data augmentation achieves better science. The ability for us to review the dataset and instructions for replication will need to be supplied by the submitter. We will be introducing unique identifiers for the model and data to allow convenient identification of the input data and models.
All benchmark sources are contained in a GitHub repository and a tagged version is provided for all benchmarked applications. In addition, all data will be using a tagging mechanism and will be part of the benchmark submission. If the data fits in GitHub we will be using GitHub. Otherwise, we will be placing it in a data archive that is openly accessible.
We support the DataPerf MLCommons® working group studies to integrate such identifiers and when available will evaluate their integration.
Our focus is the training of data, but it may take considerable effort to prepare the data for the training loop. Such preparation and their performance is integrated into the benchmark.
Each application has its own hyperparameters and optimizer configurations. They can be controlled with the parameters listed for each application.
Model |
Name |
Constraint |
Definition |
Reference Configuration |
Earthquake |
TFTTransformerepochs |
|
num_epochs |
|
Earthquake |
TFTTransformerbatch_size |
|
batch size to split training data into batches used to calculate model error and update model coefficients |
|
Earthquake |
TFTTransformertestvalbatch_size |
|
this is a range between min and max for batch size |
|
Earthquake |
TFTd_model |
|
number of hidden layers in model |
|
Earthquake |
Tseq |
|
num of encoder steps. The size of sequence window, number of days included in that section of data |
|
Earthquake |
TFTdropout_rate |
|
dropout rate: the dropout rate when training models to randomly drop nodes from a neural network to prevent overfitting |
|
Earthquake |
learning_rate |
|
how quickly the model adapts to the problem, larger means faster convergence but less optimal solutions, slower means slower convergence but more optimal solutions potentially fail if the learning rate is too small. In general, a variable learning rate is best. start larger and decrease as you see fewer returns or as your solution converges. |
|
Earthquake |
early_stopping_patience |
|
Early stopping param for Keras, a way to prevent overfit or various metric decreases |
Model |
Name |
Constraint |
Definition |
Reference Configuration |
CloudMask |
epochs |
|
Number of epochs |
|
CloudMask |
learning_rate |
|
Learning rate |
|
CloudMask |
batch_size |
|
Batch size |
|
CloudMask |
MIN_SST |
|
Min allowable Sea Surface Temperature |
|
CloudMask |
PATCH_SIZE |
|
Size of image patches |
|
CloudMask |
seed |
|
Random seed |
Model |
Name |
Constraint |
Definition |
Reference Configuration |
STEMDL |
num_epochs |
|
Number of epochs |
|
STEMDL |
learning_rate |
|
Learning rate |
|
STEMDL |
batch_size |
|
Batch size |
MLCommon® Science Benchmark Suite submissions consist of the following three metrics: metrics 1 is considered mandatory for a complete submission whereas metrics 2 and 3 are considered optional.
This is a mandatory metric (see MLPerf Training Run Results). The same rules apply here.
We follow MLPerf Training Benchmark Results rule along with the following required number of runs per benchmark. Note that since run-to-run variability is already captured by spatial multiplexing in case of metric 3, we use the adjusted requirement that the number of trained instances have to be at least equal to the number of runs for metric 1 and 2.
The numbers given below reflect the minimum number of repetitive runs required to produce repeatable metrics. In the case of the Earthquake benchmark, we have reduced the number of runs to 1 for metric 1, as the runs take a long time (between 5 - 12h on NVidia GPUs).
Number of Runs |
Number of Runs |
Number of Runs |
|
Benchmark |
Metric 1 |
Metric 2 |
Metric 3 |
Earthquake |
1 |
5 |
>=5 |
CloudMask |
10 |
10 |
>=10 |
STEMDL Classification |
5 |
5 |
>=5 |
CANDLE UNO |
5 |
5 |
>=5 |
For the closed division, we will have one or more sample submission results.
The results are tared and submitted through the MLCommons® submission process.
To identify a benchmark user must add the following information at the beginning of the submission (We use here an example for the Earthquake Benchmark:
name: Earthquake user: Gregor von Laszewski e-mail: [email protected] organisation: University of Virginia division: BII status: submission platform: rivanna shared memory
This can easily be achieved through a configuration file and inclusion into the benchmark with the MLcommons® logging library.
We expect that over time additional benchmarks will be contributed. At this time we have adopted the following best practice for contribution:
-
The initial benchmark is hosted on a group-accessible GitHub repository, where members have full access rights. These may be different repositories. Currently, we have one repository at [10].
-
New version will first be made available in that group repository while using branching.
-
A new candidate version is created and merged into main.
-
The candidate version is internally tested by the group members to evaluate expected behavior.
-
Once passed, the code is uploaded to the MLCommons® Science GitHub Repository [9].
-
Announcements are made to solicit submissions.
-
Submissions are checked and integrated according to the MLCommons® rules and policies.
The links to the current development repositories are as follows:
Problem |
MLCommons® Repository |
Development Repository |
EarthQuake |
||
CloudMask |
||
STEMDL |
||
CANDLE UNO |
Augmentation of codes for consideration into the inclusion of the science benchmarks must use the
An alternative library that internally produces MLCommons® events for logging is the
This library has the advantage of generating a human-readable summary table in addition to the MLCommons® log events.
In this section we document the directory structure for submissions. We introduce the following variables denoted by { }
around the Variable name. The backest [ ]
are used to donate a list
{organization}
::= The organization submitting the benchmark
{application}
::= The application, a value from [cloudmask,earthquake,uno,stemdl]
{system}
::= Defines the system used for this benchmark
{descriptor}
::= The descriptor of the experiment
{n}
::= number of repeated experiments
All results are stored in a directory such as
{organization}/{application}/{system}/{descriptor}/result-{n}
Within this directory, all parameters for that experiment are stored, so that all information for the experiment are self-contained within the experiment.
This includes
result.txt
::= The result logs for the n
-th run with the parameters defined by config.yaml
config.yaml
::= A configuration file that contains all hyperparameters and other parameters to define a run. This configuration file contains an entry that uniquely describes the version of the code that is run. The version must be included in the mlcommons benchmark repository
github:
repo:
branch:
version:
tag:
an additional README.md and sufficient information to create such runs need to be provided in the
A number of scripts that are used to run the particular benchmark on the specified system to allow reproducibility.
A README.md file that describes how to run it.
In some cases, a program may be used to run multiple experiments and create such a directory automatically. Enough information must be included in the
directory, so such parameterized runs can be conducted, while also replicating the appropriate directory structure. The reason we require for each result its own subdir is to allow output notebooks and comments to be submitted for each of the results if needed. This is especially the case when jupyter notebooks are used as the benchmark to be executed, allowing the notebook with all its cells to be submitted along the results.txt file.
We included here a list of supporting and related documents
-
[1] Overview presentation of the MLScience Group Barrett, Wahid Bhimji, Bala Desinghu, Murali Emani, Geoffrey Fox, Grigori Fursin, Tony Hey, David Kanter, Christine Kirkpatrick,Hai Ah Nam, Juri Papay, Amit Ruhela, Mallikarjun Shankar, Jeyan Thiyagalingam Aristeidis Tsaris, Gregor von Laszewski, Feiyi Wang, Junqi Yin , MLCommons® Community Meeting, (also available in Google docs), December 9 2021.
-
[2] AI Benchmarking for Science: Efforts from the MLCommons® Science Working Group, Jeyan Thiyagalingam, Gregor von Laszewski, Junqi Yin, Murali Emani, Juri Papay, Gregg Barrett, Piotr Luszczek, Aristeidis Tsaris, Christine Kirkpatrick, Feiyi Wang, Tom Gibbs, Venkatram Vishwanath, Mallikarjun Shankar, Geoffrey Fox, Tony Hey, June 2022
-
[3] Earthquake Nowcasting with Deep Learning, Fox, G., Rundle, J., Donnellan, A., Feng, B., Geohazards 3(2), 199, April 2022
-
[4] Probability Flow for Classifying Crystallographic Space Groups Pan, J., In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham., 2022
-
[10] Science Development GitHub Repository to prepare release candidates for the MLCommons® repository