improve Darshan modules' record layout
Currently, modules receive their own distinct memory buffer when registering with Darshan which will only store records corresponding to the specific module. At shutdown time, each module's memory buffers are first sorted to get all shared records in a contiguous region so that reductions can be ran on the shared records. After that, the buffers are compressed and written out collectively. A log file header is used to indicate the extents in the output log file that correspond to each module so their data can easily be retrieved.
Some of our initial performance testing has shown this not to be the optimal design. Specifically, the decision to compress and write out memory buffers on a per-module basis leads to reduced compression efficiency and longer shutdown times as collective writes are performed for each active module. It would be more desirable to compress and write everything in single operations, rather than one for each module.
To avoid this issue, we would need to modify Darshan to manage a single contiguous memory buffer that is used to store records from all modules. Whenever modules register a record with Darshan, it is appended to this buffer and an address to refer to the record is returned. At shutdown time, Darshan would only have to compress this single buffer and write it out all at once.
A couple of problems evident from the proposed approach:
1.) How would a log consumer be able to extract data from the log file without knowing what order records were stored in? I.e., given a contiguous buffer of records from numerous modules, how do we know what addresses correspond to what types of records (obviously, each module has different record structs, so they are likely all different length, etc).
2.) How do we handle shared file reductions? Data records from active modules will be stored together, meaning we likely won't have a contiguous range of records to reduce.
For issue 1), we have a simple solution: prefix all records with a "Darshan base record" that indicates the record identifier and the associated module identifier (and perhaps more data that is likely to be used in all modules, such as corresponding rank) for the following record. This turns out to be a natural refactor, as all current modules store this type of info (sans the module identifier which is most critical for this mechanism) at the beginning of each record. Then, when a consumer is reading a log file, it can work through the buffer of module data, peeking at each record's module identifier to determine exactly how to read what follows. This solution trades off more flexibility in storage of records at runtime for more complexity in log reading utilities, which is completely fine.
For issue 2), there are a couple of solutions: sort Darshan's record buffer in a way that gets shared records from the same module in a contiguous memory region or use custom MPI datatypes to do a reduction on noncontiguous records. I'm not sure how the overhead of each method compares, but intuitively the first option (sorting) seems like it would be easier and more efficient. In fact, if we use a sorting algorithm that puts all records from the same module in a contiguous memory region, compression efficiency should increase (since we are compressing sequences of records with identical structure), making this an even more appealing option.
So, what we need to do:
- refactor of module records to begin with a Darshan "base record"
- update Darshan's record registration and memory management to use a single buffer for all module records
- implement a sort algorithm to sort records by module identifier
- compress and collectively write out this big record buffer
- update darshan logutils API to be able to consume the new record format
Some potential advantages:
- improved compression efficiency
- reduced number of collective I/O ops at shutdown
- optimized memory consumption as there will no longer be fragmentation or wasted use of memory buffers as there is when each module has their own distinct buffer
- no longer any need for module offset/extent pairs in the Darshan header, which will greatly reduce the fixed-length size of this header
Drawbacks:
- increased record size to include module identifier (1 byte...which is likely offset by increased compression efficiency and much smaller header, so probably not really a drawback)
- the need to sort a big array of records at runtime, probably twice (once to get modules in order, twice to get each module's shared records in order)
- development effort :)
The key to determining whether this redesign is a good idea is determining how the overhead of sorting all records at shutdown time and doing a single compress/coll. write compares to doing independent compress/coll. writes for each module.