Commit c8124a33 authored by rjzamora's avatar rjzamora
Browse files

Adding ccio_overview.md, but the documentation is old. I will replace the file...

Adding ccio_overview.md, but the documentation is old. I will replace the file with up-to-date documentetaiton in the next commit (this commiet is just saving older work)
parent 4894c43b
# The Custom-Collective IO (CCIO) Virtual File-Driver Layer (VFDL)
This document is intended to provide an overview of the custom-collective IO (CCIO) virtual file-driver layer (VFDL) for parallel I/O operations in HDF5. The purpose of CCIO is to provide a dedicated optimization layer, on *top* of the existing MPIO virtual file driver (VFD), for the implementation of collective optimizations outside of the MPI-IO library. As it is currently implemented, the CCIO VFDL assumes that an MPI library is available on the system. When enabled, calls to the MPIO VFD can be rerouted into the specialized CCIO routines, which can ultimately use either MPI-IO and/or POSIX/GNI/OFI to interact with the underlying parallel file system (PFS).
The CCIO VFDL is designed to support/modify HDF5 `select_write` and `select_read` operations. For example, in the MPIO VFD, a `select_write` call will utilize `H5FD_mpio_custom_write`, which transfers data between *flatbuf* structures in both file and memory space. When such an operation is performed, the default HDF5 call path can be overridden by a custom CCIO algorithm.
An ideal CCIO end product will contain a suite of optimizations for system-aware data aggregation and PFS access. Since, by default, parallel HDF5 relies on the underlying MPI-IO library for collective I/O performance, the key advantage of this new layer is that it provides a dedicated module for the implementation of algorithms that may be missing from the available MPI-IO libraries on a given system (and/or may have unnecessary overhead within the existing HDF5 MPIO-VFD). The optimizations implemented in this new layer may be system dependent, application dependent, and/or experimental.
![alt text](softstack.png "Software Stack")
**Fig 1.** *Relative position of the CCIO VFDL within the larger HDF5 software stack. The new optimization layer is closely integrated with the MPIO VFD. From the perspective of the user-application, CCIO-based optimizations are simply modified versions of default MPIO-driver routines.*
Currently, the existing CCIO implementation includes:
1. An optimized aggregation algorithm for *two-phase* I/O (For both *read* and *write* operations). This optimization leverages **one-sided** (RMA) communication to minimize aggregation overhead, and ultimately improve scalability. The implementation also avoids the overhead of constructing/deconstructing `MPI_Datatype` objects.
2. Topology aware aggregator selection, based on the TAPIOCA library [1].
As illustrated in **Fig 1**, the CCIO VFDL is implemented on top of the MPIO VFD, and it is designed to intercept I/O calls to the MPIO VFD, and apply custom optimizations. The new layer can be understood as an organized collection of *new* MPIO-VFD driver routines, mostly contained in the `H5FDmpio_ccio.c` file (and headers).
![alt text](alt_diagram.png "Detailed Diagram")
**Fig 2.** *More-detailed illustration of the larger HDF5 software stack, and how CCIO fits in.*
## Call Stack Overview
An example of a typical CCIO call-stack is illustrated in **Fig 3** for the case of an `H5DWrite` call with the one-sided aggregation optimization. Currently, activation of this optimization is accomplished by setting the `HDF5_CUSTOM_AGG_WR` environment variable to `yes`. After an `H5DWrite` call, this environment variable is first queried in `H5D__contig_collective_write`, where the usual call to `H5D__inter_collective_io` is redirected to `H5F_select_write` (if `HDF5_CUSTOM_AGG_WR=yes`).
The `H5F_select_write` function takes in a data selection from memory, and passes it along to the appropriate VFD (the appropriate call occurring inside `H5FD_select_write`). Since CCIO interfaces with the HDF5 MPIO VFD, the appropriate *select_write* call is explicitly defined in the `H5FD_mpio_g` structure (which is defined in `H5FDmpio.c`):
```
static const H5FD_class_mpi_t H5FD_mpio_g = {
{ /* Start of superclass information */
... ...
H5FD_mpio_custom_write, /*select_write*/
... ...
}, /* End of superclass information */
... ...
};
```
Here, the MPIO VFD version of *select_write* is defined as `H5FD_mpio_custom_write` in `H5FD_mpio_g`. Currently, this `H5FD_mpio_custom_write` function acts as the entry point to the optimized one-sided write algorithm, and will also be a natural entry point for other CCIO-based write optimizations in the future. For read operations, a similar `H5FD_mpio_custom_read` function has been added.
Aside from the `H5FDmpio.c` file, most of the source code for CCIO is located in the `H5FDmpio_ccio.c` file (and `H5topology.h` for topology-aware aggregator selection).
![alt text](stack.png "Call Stack")
**Fig 3.** *Typical call stack for an `H5DWrite` call in a user application (with `HDF5_CUSTOM_AGG_WR` environment variable set to `yes`). The source files with key changes for the CCIO implementation are highlighted in orange. In `H5Dmpio.c`, the `H5D__<*>_collective_write` call will be routed to the `contig` or `chunk` routine, depending on the storage option chosen by the user application. The lion's share of all CCIO changes are in the `H5FDmpio_ccio.c` file.*
### Current Limitations
- The one-sided collective algorithm is only implemented for contiguous file storage. Chunked storage will use the one-sided algorithm if the `opt` mode is set to `H5FD_MPIO_CHUNK_MULTI_IO` by the user, but this approach is not performant (and not recommended). If the `MULTI` chunk mode is *not* set, the default MPIO driver will be used
- **Note:** To set the `MULTI` chunk mode: `H5Pset_dxpl_mpio_chunk_opt(xferPropList, H5FD_MPIO_CHUNK_MULTI_IO);`
- Both *write* and *read* implementations of the one-sided aggregation algorithm can be improved. That is, there is still more room for overlapping of aggregation and IO stages.
- On ALCF's **Theta** machine (Cray XC40), the one-sided implementation may require `MPICH_MPIIO_CB_ALIGN=3`. Therefore, the ROMIO `ad_lustre` MPI-IO file driver is used (which is *not* supported by Cray).
- **Note:** This setting is not strictly required. `MPICH_MPIIO_CB_ALIGN=2` may work fine, but the initial implementation is MPICH-focused, and the Cray implementation is not well-tested yet.
- The topology-aware aggregator selection algorithm is very simple, and can only be performed at file-open time. For Theta, the distance to IO nodes is **not** yet considered during the cost optimization.
## List of Modified Files
### `H5FDmpio_ccio.c`
This file is entirely new, and dedicated to CCIO algorithms.
### `H5FDmpio.c`
The `H5FD_ccio_setup` function is added to define the appropriate variable structures if the `HDF5_CUSTOM_AGG_WR` or `HDF5_CUSTOM_AGG_RD` environment variables are set to `yes`. This function is also responsible for selecting the aggregator ranks. For BGQ, this is currently achieved by using the aggregator list provided by MPICH. For Theta/Lustre, a simple `generateRanklist` function (defined in `H5FDmpio_ccio.c`) is used. If `HDF5_CUSTOM_TOPO=yes`, the topology API is used to make an optimal selection (assuming all ranks will be sending equal data to all aggregators, for now). Since smart aggregator selection may be more beneficial when there are a smaller number of aggregators (compared to total ranks or nodes), the user can also specify the specific number of aggregators to use by setting the '`HDF5_CB_NODES_OVERRIDE` environment variable.
The call to `H5FD_ccio_setup` is performed inside the `H5FD_mpio_open` function, when opening the file. However, the `file->custom_agg_data.ranklist` is also passed to the appropriate `ADIO_*_WriteStridedColl_CA` function (in `H5FDmpio_ccio.c`) during each `H5DWrite` operation. Therefore, the aggregator ranklist can be further modified between operations for topology-aware aggregator placement.
We also define the `H5FD_mpio_custom_write` function (and set it as the appropriate *select_write* call in `H5FD_mpio_g`), which is responsible for actually calling the custom aggregation procedures in `H5FDmpio_ccio.c`. The `custom_write` implementation uses a helper `H5FD_mpio_setup_flatbuf` function to define the appropriate *flatbuf* structure.
### `H5Dmpio.c`
Some *write* functions are modified to check for the `HDF5_CUSTOM_AGG_WR` and `HDF5_CUSTOM_AGG_RD` environment variable,s and the call is redirected to `H5F_select_<read,write.` if the value of the environment variable is equal to `yes`.
### `H5Smpio.c`
Definition of new functions (that are declared in `H5Sprivate.h`) to make some `H5S_t`-structure variables available to the new `H5FD_mpio_custom_write` function (defined in `H5FDmpio.c`, outside the `H5S` module):
- `H5S_mpio_return_space_rank_and_extent`
- `H5S_mpio_return_space_extent_and_select_type`
Note that `H5FDmpio.c` has access to `H5Sprivate.h` through the included `H5Dprivate.h` header file.
### `H5Dprivate.h`
See above.
### `H5Sprivate.h`
In order to expose the existing `H5S_<hyper,point,all>_get_seq_list` routines to the new `H5FD_mpio_custom_write` function, their definitions were moved to `H5Sprivate.h` from `H5Shyper.c`, `H5Spoint.c` and `H5Sall.c`.
We also define the `H5S_flatbuf_t` structure in this file, and declare `H5S_mpio_return_space_rank_and_extent` and `H5S_mpio_return_space_extent_and_select_type`. These last two function are defined in `H5Smpio.c` (see above).
#### `H5Shyper.c`
Commenting out `H5S_hyper_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
#### `H5Spoint.c`
Commenting out `H5S_point_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
#### `H5Sall.c`
Commenting out `H5S_all_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
## One-sided Aggregation Optimization
The CCIO VFDL currently includes a one-sided implementation of the conventional two-phase collective write algorithm. Much of the implementation is borrowed from the MPICH-ROMIO version of MPI-IO. The direct implementation of the algorithm in HDF5 avoids the need to rely on the available version of MPI-IO to achieve good collective I/O performance. **It's HDF5-base implementation also avoids the performance overhead incurred by the construction/deconstruction of `MPI_Datatype` structures.** In the usual MPIO VFD, HDF5 constructs an `MPI_Datatype` from the dataset selection, and then the MPI-IO implementation needs to deconstruct the datatype to get a set of offset/length pairs (and for highly discontiguous data, this can be expensive).
The one-sided aggregation optimization is called from the `H5FD_mpio_custom_<read,write>` function in `H5FDmpio.c` (which is used only if the `HDF5_CUSTOM_AGG_<RD,WR>` environment variable is set to `yes`). An extended `if`-`else if` control sequence spans most of the code in this function call. Each if block corresponds to a different type of file- and memory-space selections to write to: `H5S_SEL_NONE`, `H5S_SEL_ALL`, `H5S_SEL_POINTS`, or `H5S_SEL_HYPERSLABS`. Depending on the values of `mem_space_sel_type` and `file_space_sel_type`, the *memory* and *file* flatbuf structures are defined and the appropriate `H5S_<point,all,hyper>_get_seq_list` function is called to create a list of offsets and lengths for a selection. These offsets and lengths are stored in the `indices` and `blocklens` members of the `*_flatbuf` structures.
Once the `*_flatbuf` structures are defined, the `ADIOI_<LUSTRE,GPFS>_<Read,Write>StridedColl_CA` function is called (which performs the actual *aggregation* and *<read,write>* algorithms defined in the `H5FDmpio_ccio.c` file). Currently, the plan for any/all future CCIO optimizations is to implement them in the `H5FDmpio_ccio.c` file, and to call them from a `H5FD_mpio_custom_<read,write>` function. This means that all CCIO optimizations will be designed to **write** a **memory** flatbuf into a file flatbuf, or **read** a **file** flatbuf into a memory flatbuf. For now, the one-sided algorithm is activated by setting the `HDF5_CUSTOM_AGG_<RD,WR>` environment variable. However, in the future, the specific CCIO options should probably be set within a **transfer property list** (or something else that is more flexible than a collection of environment variables).
### The `ADIOI_LUSTRE_WriteStridedColl_CA` Function [ Write, LUSTRE ]
This function is mostly borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [ad_lustre](https://github.com/pmodels/mpich/blob/0c1b194db4ddd47350b2dbe93c5a572fc3d01af8/src/mpi/romio/adio/ad_lustre/ad_lustre_wrcoll.c). The procedure still looks very similar to the MPICH implementation, with `_CA` appended to the name of other functions that are also borrowed from MPICH:
- Call `ADIOI_Calc_my_off_len_CA` to return a list of absolute file offsets and associated lengths of contiguous accesses.
- Use `MPI_Allreduce` to synchronize the offsets
- Use `ADIOI_LUSTRE_Get_striping_info_CA` to get the MPI-IO hints information for the LUSTRE settings
- Finally, call `ADIOI_LUSTRE_IterateOneSidedWrite_CA` to iteratively pack stripes of data into the collective buffer and then flush the collective buffer to the file when fully packed (repeating this process until all the data is written to the file).
### The `ADIOI_LUSTRE_IterateOneSidedWrite_CA` Function [ Write, LUSTRE ]
The general algorithm here is to divide the file up into segments (with a segment being defined as a contiguous region of the file having up to one occurrence of each stripe), and the data for each stripe is written out by a particular aggregator. The `segmentLen` variable is the maximum size in bytes of each segment (`stripeSize` X number of aggs). Iteratively, the function calls `ADIOI_OneSidedWriteAggregation_CA`, for each segment, to aggregate the data into the collective buffers. However, the aggregators only do the actual write (via the `flushCB` stripe parameter) once `stripesPerAgg` stripes have been packed, or the aggregation for all the data is complete (minimizing synchronization overhead).
Each iteration of data packing (into the collective buffer) will be referred to here as an aggregation *round*.
The difference in calls to `ADIOI_OneSidedWriteAggregation_CA` is based on the whether the `buftype` is contiguous. The algorithm tracks the position in the source buffer when called multiple times. In the case of contiguous data, this procedure is simple, and can be externalized with a buffer offset. In the case of non-contiguous data, this is complex, and the state must be tracked internally (therefore no external buffer offset). Care was taken to minimize `ADIOI_OneSidedWriteAggregation_CA` changes at the expense of some added complexity to the caller.
### The `ADIOI_GPFS_WriteStridedColl_CA` Function [ Write, GPFS ]
This function is mostly borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [ad_gpfs](https://github.com/pmodels/mpich/blob/0c1b194db4ddd47350b2dbe93c5a572fc3d01af8/src/mpi/romio/adio/ad_gpfs/ad_gpfs_wrcoll.c). Like the *Lustre* version of the same algorithm, the procedure still looks very similar to the MPICH implementation, with `_CA` appended to the name of other functions that are also borrowed from MPICH.
- Call `ADIOI_Calc_my_off_len_CA` to return a list of absolute file offsets and associated lengths of contiguous accesses.
- Use `MPI_Allreduce` to synchronize the offsets
- Use `ADIOI_GPFS_Calc_file_domains_CA` to compute a dynamic access-range-based file domain partition among I/O aggregators, which align to the GPFS block size.
- The I/O workload is divided among `nprocs_for_coll` processes. This is done by (logically) dividing the file into file domains (FDs); each process may directly access only its own file domain. Additional effort makes sure that each I/O aggregator gets a file domain that aligns to the GPFS block size. So, there will not be any false sharing of GPFS file blocks among multiple I/O nodes.
- Finally, call `ADIOI_OneSidedWriteAggregation_CA` once for the entire write operation (Note that there is no GPFS-version of the `ADIOI_LUSTRE_IterateOneSidedWrite_CA` function).
### The `ADIOI_OneSidedWriteAggregation_CA` Function [ *Mostly* FS-Agnostic ]
Much of this function is borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [`onesided_aggregation.c`](https://github.com/pmodels/mpich/blob/a750d6f6aae22f7ab0408c437b2ea58fe6d13716/src/mpi/romio/adio/common/onesided_aggregation.c).
The `ADIOI_OneSidedWriteAggregation_CA` algorithm is called once for each segment of data, with a segment being defined as a contiguous region of the file which is the size of one striping unit times the number of aggregators. For Lustre, the striping unit corresponds with the actual file stripe. In the case of GPFS, these are file domains. Each call effectively packs one striping unit of data into the collective buffer on each aggregator, with additional parameters that govern when to flush the collective buffer to the file. Therefore, in practice, the collective write (for a file system, like Lustre, on a dataset composed of multiple segments) would call the algorithm several times without a flush parameter to fill the collective buffers with multiple stripes of data, before calling it again to flush the collective buffer to the file system. In this fashion, the synchronization can be minimized, because the flush only needs to occur during the actual read-from or write-to the file system. **In the case of GPFS this function is called just once**. The `ADIOI_OneSidedStripeParms_CA` parameter is used to save the state and re-use variables, through repetitive calls, to help the Lustre implementation avoid costly recompilation. For consistency, GPFS utilizes these variables as well, but doesn't use all aspects of it.
Note that this function was originally written for GPFS only and then modified to support Lustre (following a very similar approach to MPICH).
The implementation of `ADIOI_OneSidedWriteAggregation_CA` currently spans **~1240** lines in the `H5FDmpio_ccio.c` file. In order to ease maintenance and new feature addition in the future, the function should most-likely be refactored. However, the MPICH version of a similar algorithm is also very long.
## References
[1] François Tessier, Venkatram Vishwanath, Emmanuel Jeannot - TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers - IEEE Cluster 2017, Honolulu, HI (Sept. 2017)
......@@ -3,12 +3,22 @@
**Authors:**
- Richard J. Zamora (rzamora@anl.gov)
- Paul Coffman (pcoffman@anl.gov
- Paul Coffman (pcoffman@anl.gov)
- Venkatram Vishwanath (venkat@anl.gov)
**Overview:**
This repository is intended for documentation and benchmarks for the HDF5 library. The code/documentation in here has been collected through the ExaHDF5 ECP Software and Technology project. The Exerciser benchmark was specifically designed to aid the development of a new Custom-Collective I/O (CCIO) MPIO VFD. For this reason, some build/run instructions will directly refer to the CCIO version of HDF5.
This repository is intended for **documentation** and **benchmark code** for the HDF5 library. The code/documentation in here has been collected through the ExaHDF5 ECP Software and Technology project. The Exerciser benchmark was specifically designed to aid the development of a new Custom-Collective I/O (CCIO) MPIO VFD. For this reason, some build/run instructions will directly refer to the CCIO version of HDF5.
**Current Benchmarks:**
**Benchmarks:**
- HDF5 Exerciser (`Exerciser/exerciser.c`)
**Documentation:**
- *This* README file
- `CCIO_Documentation/ccio_overview.md`
- `Exerciser/README.md` (Main Exerciser README file)
- `Exerciser/Theta/README.md` (Theta HDF5/Exerciser build/run instructions)
- `Exerciser/BGQ/VESTA_XL/README.md` (Vesta HDF5/Exerciser build/run instructions - IBM XL Version)
- `Exerciser/BGQ/VESTA_CH4/README.md` (Vesta HDF5/Exerciser build/run instructions - MPICH-CH4 Version)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment