This repository contains a case study of parallel I/O kernel from the E3SM climate simulation model. E3SM is one of the Department of Energy (DOE) mission applications designed to run on the DOE leadership parallel computers. The E3SM I/O module, Scorpio, can make use of existing I/O libraries, such as PnetCDF, NetCDF-4, HDF5, and ADIOS. The benchmark program in this repository is developed to evaluate the E3SM I/O kernel performance using the above mentioned libraries. Achieving a good I/O performance for E3SM on HPC systems is challenging because its data access pattern consists of a large number of small, unordered, non-contiguous write requests on each process.
The problem domain in E3SM simulation is represented by a cubed sphere grid which is partitioned among multiple processes along only X and Y axes that cover the surface of the problem domain. Domain partitioning among processes uses the Hilbert space curve algorithm to first linearize the 2D cubed sphere grid into subgrids where data points in a subgrid are physically closer to each other, and then divide the linearized subgrids evenly among the processes. This partitioning strategy produces in each process a long list of small, noncontiguous write requests and the file offsets of any two consecutive requests may not be in an increasing order in the file space.
The data partitioning patterns (describing the decomposition of the problem domain represented by multi-dimensional arrays) were captured by the Scorpio library during the E3SM production runs. There can be multiple decomposition maps used by different variables. A data decomposition map records the positions (offsets) of array elements written by each MPI process. The access offsets are stored in a text file, referred to as the "decomposition map file".
This benchmark currently studies three cases from E3SM, namely F, G and I cases, simulating the atmospheric, oceanic, and land components, respectively. Information about the climate variables written in these three case studies and their decomposition maps can be found in variables.md. Table below shows the information about decomposition maps, numbers of variables, and the maximum and minimum numbers of non-contiguous write requests among all processes.
Case | F | G | I |
---|---|---|---|
Number of MPI processes | 21600 | 9600 | 1344 |
Number of decomposition (partitioning) maps | 3 | 6 | 5 |
Number of partitioned variables | 387 | 41 | 546 |
Number of non-partitioned variables | 27 | 11 | 14 |
Total number of variables | 414 | 52 | 560 |
MAX no. noncontig writes among processes | 184,644 | 21,110 | 41,400 |
MIN no. noncontig writes among processes | 174,926 | 18,821 | 33,120 |
- See INSTALL.md. It also describes the command-line run options in details.
- Current build/test status:
There are several I/O methods implemented in this case study, including two data layouts (canonical vs. log) and five parallel I/O libraries of PnetCDF, NetCDF-4, HDF5, Log VOL, and ADIOS. Table below summarizes the supported combinations of libraries and data layouts. For the full list of I/O options and more detailed descriptions, readers are referred to INSTALL.md.
layout \ library | PnetCDF | HDF5 | Log VOL | ADIOS | NetCDF4 |
---|---|---|---|---|---|
canonical | yes | yes | no | no | yes |
log (blob) | yes | yes | yes | yes | no |
For the log layout options available in this benchmark, users are referred to BLOB_IO.md for their designs and implementations. Log I/O methods store write requests by appending the write data one after another, like a time log, regardless the data's position relative to its global structure, e.g. a subarray of a multi-dimensional array. Thus data stored in the file does not follow the dimensional canonical order. On the other hand, storing data in the canonical order requires an expensive communication to organize the data among the processes. As the number of processes increases, the communication cost can become significant. All I/O methods that store data in the log layout defers the expensive inter-process communication to the data consumer applications. Usually, "replay" utility programs are made available for users to convert a file in the log layout to the canonical layout.
-
Below shows the execution times of four log-layout based I/O methods collected on July 2022 on Cori at NERSC.
-
Below shows the execution times of four log-layout based I/O methods collected in September 2022 on Summit at OLCF.
- Qiao Kang, Sunwoo Lee, Kai-Yuan Hou, Robert Ross, Ankit Agrawal, Alok Choudhary, and Wei-keng Liao. Improving MPI Collective I/O for High Volume Non-Contiguous Requests With Intra-Node Aggregation. In the IEEE Transactions on Parallel and Distributed Systems (TPDS), 31(11):2682–2695, November 2020.
- Qiao Kang, Rob Ross, Rob Latham, Sunwoo Lee, Ankit Agrawal, Alok Choudhary, and Wei-keng Liao. Improving All-to-Many Personalized Communication in Two-Phase I/O. In the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2020.
- Qiao Kang, Scot Breitenfeld, Kai-Yuan Hou, Wei-keng Liao, Robert Ross, and Suren Byna. Optimizing Performance of Parallel I/O Accesses to Non-contiguous Blocks in Multiple Array Variables. In the International Conference on Big Data, December 2021.
- Kai-Yuan Hou, Qiao Kang, Sunwoo Lee, Ankit Agrawal, Alok Choudhary, and Wei-keng Liao. Supporting Data Compression in PnetCDF. In the International Conference on Big Data, December 2021.
- Wei-keng Liao <[email protected]>
- Kai-yuan Hou <[email protected]>
- Zanhua Huang <[email protected]>
Copyright (C) 2021, Northwestern University. See COPYRIGHT notice in top-level directory.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy's Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation's exascale computing imperative.