-
Notifications
You must be signed in to change notification settings - Fork 24
Ideal_Block Based_VFD
Some of these characteristics may be mutually exclusive. Which, I don’t know. Let’s elaborate as we flesh out what this thing looks like.
What is the ideal, block-based Virtual File Driver to support PMPIO?
- Blocks are either pure meta-data (MD) or pure-raw data (RD)
- In HDF5-1.8.4, Mark M. tried all sorts of lib parameters effecting small block size and MD cache size and none of these and the expected effect, in the VFD of causing blocks of VFD file space to be 100% MD or 100% RD.
- MD blocks can be written throughout file (e.g. don’t have to let MD grow without bound and write at close)
- Block size of RD controlled independently of MD
- Application can specify how many MD blocks and RD blocks are allowed to be kept in memory at any one time and/or total memory that is allowed to be used by VFD to cache/buffer blocks.
- When file is closed, whatever blocks are still in memory are written in increasing file index order.
- Furhthermore, blocks that occur next to each other in file are written in larger I/O requests
- May require copies to larger buffer just before write
- Furhthermore, blocks that occur next to each other in file are written in larger I/O requests
- Produces a single file out the bottom (not one for RD and one for MD)
- Can be re-opened correctly by any standard HDF5 VFD (e.g. sec2 for example)
-
PMPIO baton handoff is performed on the open file.
- Means in-memory state of file is message-passed to next processor rather than written on proc i as the result of a close and then read back onto proc i+1 as result of an open. However, from an API utilization standpoint, might look to the application like it is closing and then later opening the file but message-pass optimization occurs transparently under the covers.
- Means that whatever state is passed around does NOT grow as number of processors the baton is passed between grows.
- Is informed by HDF5’s higher level MD cache of the N hottest hot spots of MD, where N is a variable chosen by caller
- In presence of a VFD that already does LRU preemption on MD blocks, this may not be worth the effort to implement (Mainzer), particularly if it results in file that can be opened only by creator/writer VFD.
- Employs a least recently used MD block pre-emption algorithm for deciding which MD blocks to page out to disk and when
- This may be viable alternative to maintaining explicit knowledge of MD hot spots in the VFD
- Can handle MD async. (And likely RD async)
- A perfect block-based VFD decorolates chunks from I/O requests.
- A block-based VFD winds up decorellating HDF5 chunks from actual I/O requests. So, where before I highly objected to HDF5 chunking my data due to artificial I/O fragmentation into chunks that occcurs (by the way, am I wrong or does HDF5 wind up issuing I/O requests chunk-at-a-time with chunked datasets), on a block-based VFD like this, I don’t really much care any more ’cause the block-based VFD will wind up aggregating those into larger real I/O requests anyways.
- Computes diagnostic statistics for performance debugging (e.g. like Silo’s VFD currently does)
- Can use MPI under the covers to aggregate blocks from different MPI-tasks ‘files’ to a single, shared file on disk.
- Option to ship blocks off processor via MPI message to…
- Other processors sitting idle within MPI_WORLD_COMM but set aside explicitly to handle I/O
- Special service software running on the actual I/O nodes of the system
The internal parts of HDF5 lib can communicate directly with VFD by adding what amounts to out of band read/write messages to the VFD. Currently, there is a mem type tag on each message that indicates the type of memory HDF5 is sending to or requesting from the VFD. We could add new types to this enum to support messages to be sent between HDF5 lib proper and VFD. For example, to send information of hot spots in MD, HDF5 lib could write data to VFD with mem_type of MD_HOT_SPOTS
. The VFD would advertise to HDF5 if and what kind of out of band messaging it supports. So, HDF5 would only send such messages to VFDs that claim to support them. This way, however, its possible for HDF5 lib proper to communicate with VFD without changing existing VFD API. (QAK: Cool idea!)
Likewise, HDF5 lib could request information from VFD by a read method with an appropriate mem_type.
Given the variety of different functionality listed above and the desire for good software engineering practices and results, it seems likely that teasing apart the different kinds of functionality into multiple aspects, tied together by a common VFD framework would be desirable. The HDF Group has tackled this before, and nearly finished a prototype. We should resurrect that project and implement it, so that these features can be combined flexibly by application developers.
Possible VFD aspects, from characteristics above:
- Accessing and caching pages from the file (i.e. “page buffering”)
- PMPIO baton passing/state transfer between processes participating in accessing a file
- Performance/diagnostic logging of I/O operations
- Asynchronous I/O