-
Notifications
You must be signed in to change notification settings - Fork 24
Poor_Man's_vs_Rich_Mans'_Parallel_IO
This document uses the terms Poor Man’s and Rich Man’s parallel I/O to refer to two different modalities of handling parallel I/O. These terms have fallen out of favor and are replaced by Multiple Independent File (MIF) and Single Shared File (SSF) parallel I/O. The new terminology is intended to capture where and what software handles concurency. In MIF, concurrency is handled explicitly by the application in writing independent files. In SSF, because a single file is produced, this necessitates that concurrency be handled implicitly by the file system.
Poor Man’s and Rich Man’s Parallel I/O differ in one important respect: Poor Man’s achieves concurrent parallelism by writing to multiple, different files while Rich Man’s writes to a single, shared file. That’s basically it in a nutshell.
People often confuse Poor Man’s parallel with file-per-processor. That is false. While file-per-processor does represent one extreme end-point in the spectrum of Poor Man’s use cases, there is no fundamental reason to restrict Poor Man’s to file-per-processor. In fact, there are a variety of reasons for not doing it that way. Indeed, codes like Ale3d have a knob to control the number of files used for concurrent I/O and that knob is entirely independent of the number of processors the code is run on. Using PMPIO in Ale3d, you can run on 1024 processors and write to 10 files or 12 or 7 or 128. Typically, the number chosen is on par with the number of I/O nodes the code can see during its execution as well as the relative beefiness of the I/O nodes in handling I/O requests from multiple processors.
What are the strengths and weaknesses here? If we are standing in the filesystem looking upwards and watching I/O activity, is there any fundamental difference in the two?
With the exception of additional metadata requirements due to additional filesystem object names (e.g. files), as you stand in the filesystem and look upwards at the I/O requests pouring down from an application, there is little to distinguish the two I/O paradigms.
-
PMPIO involves more filesystsem metadata. So, PMPIO is more susceptible to bad performance on filesystems where metadata is handled poorly/unscalably.
- See discussion below
- Decomposition into multiple files may undermine the self-describing nature of an HDF5 file. Keeping all the data in one file may give future consumers of the data a better chance of understanding the information stored.
- If the close, handoff, open cycle does not perform well, that may dominate the performance of accessing PMPIO files.
- The performance of the close, handoff, open cycle is influenced both by filesystem’s performance as well as the HDF5 library (as well as any I/O libraries layered above it).
- Metadata that accumulates in the file can grow such that each successive processor has more metadata to process causing scaling issues.
- In terms of programming, it’s dirt simple, regardless of wildly varying I/O needs of each processor. Don’t have to worry about what other processors are doing. For large, multi-physics applications where data requirements can vary wildly from processor to processor, this is significant. For monolithic science codes, it is less of a concern because the code is designed around basically one or a very small number of shared data structure(s) the distribution of which across processors is predictable and easily understood.
- HDF5’s all-collective-all-the-time dataset creation interface makes handling such applications using the current parallel interface unwieldly at best.
- Some simple work-arounds are possible though. Some we’ve considered are deferred dataset creation where processors can create datasets locally without requiring parallel communication but all operations on so created datasets are suspended until the application calls a collective sync operation. A similar approach built on top of HDF5 is H5MButil
- Compression is easy here. It’s nothing special, since access is through a serial VFD. Can use it today in HDF5.
- No implied re-org of the data into a consistent global view on write and re-distribute on read. The data is decomposed into pieces and it stays in those pieces in perpetuity.
- Parallel performance demands very little in the way of extra/advanced features from the underlying I/O components and filesystem. There isn’t any massive reshuffle or re-distribution of the data on read/write. The distribution of data as it comes from the application is essentially piggy backed to serve as the parallel distribution of the data across the interconnect, through the I/O nodes and onto the filesystem’s magnetic media. A relatively dumb filesystem can get it right and do it well.
- There is no artificial imposition of reorganization of data as number of processors is varied from run to run for a given dataset. There is, of course, an assumption that applications interacting with the data are designed to support a domain overload sort of execution model where each processor can be handed one or more than one domain (e.g. chunk) of data and the work it does is a loop over domains. There is, of course, some overhead an application pays for this as, for example, these domain chunks are maintained distinctly instead of merged into a larger, single coherent domain chunk that is operated on in one fell swoop. But, a win is that such applications can run the same problem on any number of processors up to the number of domain chunks the problem is decomposed into. For example, a problem broken into 60 chunks can be run on 2,4,5,6,10,12,15,20,30,60 processors where each processor is assigned 30,15,12,10,5,3,2,1 domain chunks respectively. It can also run on 17 or 37 processors although the load will not be as well balanced in that case because the number of processors does not divide the number of chunks evenly. But, that still provides flexibility in scheduling and running the application.
- Application controlled throttling of I/O is easily supported with PMPIO. That is, explicitly controlling the number of concurrent writers is easily handled in a PMPIO setting. This allows an application to avoid situations where its I/O requests, if let loose with reckless abandon, can easily overwhelm the underlying I/O subsystems.
When a large, parallel object is stored in its decomposed state (either to separate files or to separate objects within a single shared file), each object winds up getting a unique name string that differs from other object’s name string in only a small portion of characters. In other words, there are a lot of very similar strings. Relative to the total raw data being stored, the storage cost for these unique name strings is not significant.
What is significant, however, is the potential scaling issues of the underlying software responsible for managing all these unique but highly similar name strings. At the filesystem level, Lustre winds up having to manage hundreds of thousands of filenames for a given single dataset. Likewise, in an HDF5 file, the HDF5 library winds up having to manage hundreds of thousands of object names. Again, the storage cost is not so much the issue as the scaling of the data structures the software uses to manage these names.
In the Silo library, a caller is responsible for constructing a multi-block object that holds all the individual object piece names. This introduces a scaling problem. First, only one processor can be responsible for writing Silo’s multi-block objects to the root file. So, all the names need to get created in one large array on one processor and then written to the file. To correct for this, Silo was recently enhanced to support an sprintf-like name scheme pattern that defines a rule for computing an object name from its position (index or offset) in the list of names. So, this enhanced multi-block object avoids the need of having to gather together in one, linear list, a bunch of similarly named strings. However, it still does not address the issue that all the component objects in the file still have names associated with them that need to be managed by the underlying software (HDF5 and/or Lustre)
The question arises whether Lustre or HDF5 could be enhanced to support the names of files/objects in a given collection in a similar way, thereby avoiding whatever scaling issues may arise in managing the names as numbers of unique but highly similar name strings grow.