Commit 4daaa7c8 authored by rjzamora's avatar rjzamora
Browse files

major exerciser cleanup - getting ready to merge with master and there needs...

major exerciser cleanup - getting ready to merge with master and there needs to be very clear documentation to point to in topology-aware VFD milestone report
parent 3e040d8c
# The Custom-Collective IO (CCIO) Virtual File-Driver Layer (VFDL)
This document is intended to provide an overview of the custom-collective IO (CCIO) virtual file-driver layer (VFDL) for parallel I/O operations in HDF5. The purpose of CCIO is to provide a dedicated optimization layer, on *top* of the existing MPIO virtual file driver (VFD), for the implementation of collective optimizations outside of the MPI-IO library. As it is currently implemented, the CCIO VFDL requires that an MPI library is available on the system. When enabled, calls to the MPIO VFD can be rerouted into the specialized CCIO routines, which can ultimately use either MPI-IO and/or POSIX/GNI/OFI to interact with the underlying parallel file system (PFS).
The CCIO VFDL is designed to support/modify HDF5 `select_write` and `select_read` operations. For example, in the MPIO VFD, a `select_write` call will utilize `H5FD_mpio_custom_write`, which transfers data between *flatbuf* structures in both file and memory space. When such an operation is performed, the default HDF5 call path can be overridden by a custom CCIO algorithm.
An ideal CCIO end product will contain a suite of optimizations for system-aware data aggregation and PFS access. Since, by default, parallel HDF5 relies on the underlying MPI-IO library for collective I/O performance, the key advantage of this new layer is that it provides a dedicated module for the implementation of algorithms that may be missing from the available MPI-IO libraries on a given system (and/or may have unnecessary overhead within the existing HDF5 MPIO-VFD). The optimizations implemented in this new layer may be system dependent, application dependent, and/or experimental.
![alt text](softstack.png "Software Stack")
**Fig 1.** *Relative position of the CCIO VFDL within the larger HDF5 software stack. The new optimization layer is closely integrated with the MPIO VFD. From the perspective of the user-application, CCIO-based optimizations are simply modified versions of default MPIO-driver routines.*
Currently, the existing CCIO implementation includes:
1. An optimized aggregation algorithm for *two-phase* I/O (For both *read* and *write* operations). This optimization leverages **one-sided** (RMA) communication to minimize aggregation overhead, and ultimately improve scalability. The implementation also avoids the overhead of constructing/deconstructing `MPI_Datatype` objects.
2. Topology aware aggregator selection, based on the TAPIOCA library [1].
As illustrated in **Fig 1**, the CCIO VFDL is implemented on top of the MPIO VFD, and it is designed to intercept I/O calls to the MPIO VFD, and apply custom optimizations. The new layer can be understood as an organized collection of *new* MPIO-VFD driver routines, mostly contained in the `H5FDmpio_ccio.c` file (and headers).
![alt text](alt_diagram.png "Detailed Diagram")
**Fig 2.** *More-detailed illustration of the larger HDF5 software stack, and how CCIO fits in.*
## Call Stack Overview
An example of a typical CCIO call-stack is illustrated in **Fig 3** for the case of an `H5DWrite` call with the one-sided aggregation optimization. Currently, activation of this optimization is accomplished by setting the `HDF5_CUSTOM_AGG_WR` environment variable to `yes`. After an `H5DWrite` call, this environment variable is first queried in `H5D__contig_collective_write`, where the usual call to `H5D__inter_collective_io` is redirected to `H5F_select_write` (if `HDF5_CUSTOM_AGG_WR=yes`).
The `H5F_select_write` function takes in a data selection from memory, and passes it along to the appropriate VFD (the appropriate call occurring inside `H5FD_select_write`). Since CCIO interfaces with the HDF5 MPIO VFD, the appropriate *select_write* call is explicitly defined in the `H5FD_mpio_g` structure (which is defined in `H5FDmpio.c`):
static const H5FD_class_mpi_t H5FD_mpio_g = {
{ /* Start of superclass information */
... ...
H5FD_mpio_custom_write, /*select_write*/
... ...
}, /* End of superclass information */
... ...
Here, the MPIO VFD version of *select_write* is defined as `H5FD_mpio_custom_write` in `H5FD_mpio_g`. Currently, this `H5FD_mpio_custom_write` function acts as the entry point to the optimized one-sided write algorithm, and will also be a natural entry point for other CCIO-based write optimizations in the future. For read operations, a similar `H5FD_mpio_custom_read` function has been added.
Aside from the `H5FDmpio.c` file, most of the source code for CCIO is located in the `H5FDmpio_ccio.c` file (and `H5topology.h` for topology-aware aggregator selection).
![alt text](stack.png "Call Stack")
**Fig 3.** *Typical call stack for an `H5DWrite` call in a user application (with `HDF5_CUSTOM_AGG_WR` environment variable set to `yes`). The source files with key changes for the CCIO implementation are highlighted in orange. In `H5Dmpio.c`, the `H5D__<*>_collective_write` call will be routed to the `contig` or `chunk` routine, depending on the storage option chosen by the user application. The lion's share of all CCIO changes are in the `H5FDmpio_ccio.c` file.*
### Current Limitations
- The one-sided collective algorithm is only implemented for contiguous file storage. Chunked storage will use the one-sided algorithm if the `opt` mode is set to `H5FD_MPIO_CHUNK_MULTI_IO` by the user, but this approach is not performant (and not recommended). If the `MULTI` chunk mode is *not* set, the default MPIO driver will be used
- **Note:** To set the `MULTI` chunk mode: `H5Pset_dxpl_mpio_chunk_opt(xferPropList, H5FD_MPIO_CHUNK_MULTI_IO);`
- Both *write* and *read* implementations of the one-sided aggregation algorithm can be improved. That is, there is still more room for overlapping of aggregation and IO stages.
- On ALCF's **Theta** machine (Cray XC40), the one-sided implementation may require `MPICH_MPIIO_CB_ALIGN=3`. Therefore, the ROMIO `ad_lustre` MPI-IO file driver is used (which is *not* supported by Cray).
- **Note:** This setting is not strictly required. `MPICH_MPIIO_CB_ALIGN=2` may work fine, but the initial implementation is MPICH-focused, and the Cray implementation is not well-tested yet.
- The topology-aware aggregator selection algorithm is very simple, and can only be performed at file-open time. For Theta, the distance to IO nodes is **not** yet considered during the cost optimization.
## List of Modified Files
### `H5FDmpio_ccio.c`
This file is entirely new, and dedicated to CCIO algorithms.
### `H5FDmpio.c`
The `H5FD_ccio_setup` function is added to define the appropriate variable structures if the `HDF5_CUSTOM_AGG_WR` or `HDF5_CUSTOM_AGG_RD` environment variables are set to `yes`. This function is also responsible for selecting the aggregator ranks. For BGQ, this is currently achieved by using the aggregator list provided by MPICH. For Theta/Lustre, a simple `generateRanklist` function (defined in `H5FDmpio_ccio.c`) is used. If `HDF5_CUSTOM_TOPO=yes`, the topology API is used to make an optimal selection (assuming all ranks will be sending equal data to all aggregators, for now). Since smart aggregator selection may be more beneficial when there are a smaller number of aggregators (compared to total ranks or nodes), the user can also specify the specific number of aggregators to use by setting the '`HDF5_CB_NODES_OVERRIDE` environment variable.
The call to `H5FD_ccio_setup` is performed inside the `H5FD_mpio_open` function, when opening the file. However, the `file->custom_agg_data.ranklist` is also passed to the appropriate `ADIO_*_WriteStridedColl_CA` function (in `H5FDmpio_ccio.c`) during each `H5DWrite` operation. Therefore, the aggregator ranklist can be further modified between operations for topology-aware aggregator placement.
We also define the `H5FD_mpio_custom_write` function (and set it as the appropriate *select_write* call in `H5FD_mpio_g`), which is responsible for actually calling the custom aggregation procedures in `H5FDmpio_ccio.c`. The `custom_write` implementation uses a helper `H5FD_mpio_setup_flatbuf` function to define the appropriate *flatbuf* structure.
### `H5Dmpio.c`
Some *write* functions are modified to check for the `HDF5_CUSTOM_AGG_WR` and `HDF5_CUSTOM_AGG_RD` environment variable,s and the call is redirected to `H5F_select_<read,write.` if the value of the environment variable is equal to `yes`.
### `H5Smpio.c`
Definition of new functions (that are declared in `H5Sprivate.h`) to make some `H5S_t`-structure variables available to the new `H5FD_mpio_custom_write` function (defined in `H5FDmpio.c`, outside the `H5S` module):
- `H5S_mpio_return_space_rank_and_extent`
- `H5S_mpio_return_space_extent_and_select_type`
Note that `H5FDmpio.c` has access to `H5Sprivate.h` through the included `H5Dprivate.h` header file.
### `H5Dprivate.h`
See above.
### `H5Sprivate.h`
In order to expose the existing `H5S_<hyper,point,all>_get_seq_list` routines to the new `H5FD_mpio_custom_write` function, their definitions were moved to `H5Sprivate.h` from `H5Shyper.c`, `H5Spoint.c` and `H5Sall.c`.
We also define the `H5S_flatbuf_t` structure in this file, and declare `H5S_mpio_return_space_rank_and_extent` and `H5S_mpio_return_space_extent_and_select_type`. These last two function are defined in `H5Smpio.c` (see above).
#### `H5Shyper.c`
Commenting out `H5S_hyper_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
#### `H5Spoint.c`
Commenting out `H5S_point_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
#### `H5Sall.c`
Commenting out `H5S_all_get_seq_list` declaration, because it was moved to `H5Sprivate.h`.
## One-sided Aggregation Optimization
The CCIO VFDL currently includes a one-sided implementation of the conventional two-phase collective write algorithm. Much of the implementation is borrowed from the MPICH-ROMIO version of MPI-IO. The direct implementation of the algorithm in HDF5 avoids the need to rely on the available version of MPI-IO to achieve good collective I/O performance. **It's HDF5-base implementation also avoids the performance overhead incurred by the construction/deconstruction of `MPI_Datatype` structures.** In the usual MPIO VFD, HDF5 constructs an `MPI_Datatype` from the dataset selection, and then the MPI-IO implementation needs to deconstruct the datatype to get a set of offset/length pairs (and for highly discontiguous data, this can be expensive).
The one-sided aggregation optimization is called from the `H5FD_mpio_custom_<read,write>` function in `H5FDmpio.c` (which is used only if the `HDF5_CUSTOM_AGG_<RD,WR>` environment variable is set to `yes`). An extended `if`-`else if` control sequence spans most of the code in this function call. Each if block corresponds to a different type of file- and memory-space selections to write to: `H5S_SEL_NONE`, `H5S_SEL_ALL`, `H5S_SEL_POINTS`, or `H5S_SEL_HYPERSLABS`. Depending on the values of `mem_space_sel_type` and `file_space_sel_type`, the *memory* and *file* flatbuf structures are defined and the appropriate `H5S_<point,all,hyper>_get_seq_list` function is called to create a list of offsets and lengths for a selection. These offsets and lengths are stored in the `indices` and `blocklens` members of the `*_flatbuf` structures.
Once the `*_flatbuf` structures are defined, the `ADIOI_<LUSTRE,GPFS>_<Read,Write>StridedColl_CA` function is called (which performs the actual *aggregation* and *<read,write>* algorithms defined in the `H5FDmpio_ccio.c` file). Currently, the plan for any/all future CCIO optimizations is to implement them in the `H5FDmpio_ccio.c` file, and to call them from a `H5FD_mpio_custom_<read,write>` function. This means that all CCIO optimizations will be designed to **write** a **memory** flatbuf into a file flatbuf, or **read** a **file** flatbuf into a memory flatbuf. For now, the one-sided algorithm is activated by setting the `HDF5_CUSTOM_AGG_<RD,WR>` environment variable. However, in the future, the specific CCIO options should probably be set within a **transfer property list** (or something else that is more flexible than a collection of environment variables).
### The `ADIOI_LUSTRE_WriteStridedColl_CA` Function [ Write, LUSTRE ]
This function is mostly borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [ad_lustre]( The procedure still looks very similar to the MPICH implementation, with `_CA` appended to the name of other functions that are also borrowed from MPICH:
- Call `ADIOI_Calc_my_off_len_CA` to return a list of absolute file offsets and associated lengths of contiguous accesses.
- Use `MPI_Allreduce` to synchronize the offsets
- Use `ADIOI_LUSTRE_Get_striping_info_CA` to get the MPI-IO hints information for the LUSTRE settings
- Finally, call `ADIOI_LUSTRE_IterateOneSidedWrite_CA` to iteratively pack stripes of data into the collective buffer and then flush the collective buffer to the file when fully packed (repeating this process until all the data is written to the file).
### The `ADIOI_LUSTRE_IterateOneSidedWrite_CA` Function [ Write, LUSTRE ]
The general algorithm here is to divide the file up into segments (with a segment being defined as a contiguous region of the file having up to one occurrence of each stripe), and the data for each stripe is written out by a particular aggregator. The `segmentLen` variable is the maximum size in bytes of each segment (`stripeSize` X number of aggs). Iteratively, the function calls `ADIOI_OneSidedWriteAggregation_CA`, for each segment, to aggregate the data into the collective buffers. However, the aggregators only do the actual write (via the `flushCB` stripe parameter) once `stripesPerAgg` stripes have been packed, or the aggregation for all the data is complete (minimizing synchronization overhead).
Each iteration of data packing (into the collective buffer) will be referred to here as an aggregation *round*.
The difference in calls to `ADIOI_OneSidedWriteAggregation_CA` is based on the whether the `buftype` is contiguous. The algorithm tracks the position in the source buffer when called multiple times. In the case of contiguous data, this procedure is simple, and can be externalized with a buffer offset. In the case of non-contiguous data, this is complex, and the state must be tracked internally (therefore no external buffer offset). Care was taken to minimize `ADIOI_OneSidedWriteAggregation_CA` changes at the expense of some added complexity to the caller.
### The `ADIOI_GPFS_WriteStridedColl_CA` Function [ Write, GPFS ]
This function is mostly borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [ad_gpfs]( Like the *Lustre* version of the same algorithm, the procedure still looks very similar to the MPICH implementation, with `_CA` appended to the name of other functions that are also borrowed from MPICH.
- Call `ADIOI_Calc_my_off_len_CA` to return a list of absolute file offsets and associated lengths of contiguous accesses.
- Use `MPI_Allreduce` to synchronize the offsets
- Use `ADIOI_GPFS_Calc_file_domains_CA` to compute a dynamic access-range-based file domain partition among I/O aggregators, which align to the GPFS block size.
- The I/O workload is divided among `nprocs_for_coll` processes. This is done by (logically) dividing the file into file domains (FDs); each process may directly access only its own file domain. Additional effort makes sure that each I/O aggregator gets a file domain that aligns to the GPFS block size. So, there will not be any false sharing of GPFS file blocks among multiple I/O nodes.
- Finally, call `ADIOI_OneSidedWriteAggregation_CA` once for the entire write operation (Note that there is no GPFS-version of the `ADIOI_LUSTRE_IterateOneSidedWrite_CA` function).
### The `ADIOI_OneSidedWriteAggregation_CA` Function [ *Mostly* FS-Agnostic ]
Much of this function is borrowed from the MPICH-ROMIO implementation of a function by the same name (without the `_CA` suffix): [`onesided_aggregation.c`](
The `ADIOI_OneSidedWriteAggregation_CA` algorithm is called once for each segment of data, with a segment being defined as a contiguous region of the file which is the size of one striping unit times the number of aggregators. For Lustre, the striping unit corresponds with the actual file stripe. In the case of GPFS, these are file domains. Each call effectively packs one striping unit of data into the collective buffer on each aggregator, with additional parameters that govern when to flush the collective buffer to the file. Therefore, in practice, the collective write (for a file system, like Lustre, on a dataset composed of multiple segments) would call the algorithm several times without a flush parameter to fill the collective buffers with multiple stripes of data, before calling it again to flush the collective buffer to the file system. In this fashion, the synchronization can be minimized, because the flush only needs to occur during the actual read-from or write-to the file system. **In the case of GPFS this function is called just once**. The `ADIOI_OneSidedStripeParms_CA` parameter is used to save the state and re-use variables, through repetitive calls, to help the Lustre implementation avoid costly recompilation. For consistency, GPFS utilizes these variables as well, but doesn't use all aspects of it.
Note that this function was originally written for GPFS only and then modified to support Lustre (following a very similar approach to MPICH).
The implementation of `ADIOI_OneSidedWriteAggregation_CA` currently spans **~1240** lines in the `H5FDmpio_ccio.c` file. In order to ease maintenance and new feature addition in the future, the function should most-likely be refactored. However, the MPICH version of a similar algorithm is also very long.
## References
[1] François Tessier, Venkatram Vishwanath, Emmanuel Jeannot - TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers - IEEE Cluster 2017, Honolulu, HI (Sept. 2017)
\ No newline at end of file
default: ${EXE}
exerciser.o: exerciser.c
mpixlc -c -g -O3 -qlanglvl=extc99 -I${HDF5_INSTALL_DIR}/include exerciser.c -o exerciser.o
${EXE}: exerciser.o
mpixlc exerciser.o -o ${EXE} -L${HDF5_INSTALL_DIR}/lib -lhdf5 -L/soft/libraries/alcf/current/xl/ZLIB/lib -lz -ldl -lm
rm -f exerciser.o
rm -f ${EXE}
## Exerciser ```` (ALCF VESTA)
These are instructions for building the parallel HDF5 exerciser, and
running a basic exerciser test on Vesta (ALCF). Similar instructions can be used for other BG/Q systems (e.g. ALCF Mira).
### (Optional) Building CCIO Version of HDF5
export HDF5_ROOT=<your-desired-root-directory>
mkdir $HDF5_ROOT
mkdir exerciser
mkdir library
mkdir gitrepos
cd gitrepos
git clone
git clone
cd hdf5-ccio-develop
git checkout ccio
cd ../library
mkdir build
mkdir install
cd install
mkdir opt
cd ../build
mkdir opt
cd debug
cp -rL $HDF5_ROOT/xgitlabrepos/CustomCollectiveIO/* .
cp $HDF5_ROOT/xgitlabrepos/BuildAndTest/Exerciser/BGQ/do-configure-debug .
cp $HDF5_ROOT/xgitlabrepos/BuildAndTest/Exerciser/BGQ/do-make .
cp $HDF5_ROOT/xgitlabrepos/BuildAndTest/Exerciser/BGQ/do-mpirun .
Then modify do-configure-debug changing the '' to your
email id and:
to your install dir - eg:
Then run autogen and configure:
export PATH=/soft/buildtools/autotools/feb2015/bin:$PATH
The configure script must be run in cobalt since it needs to run mpi
programs on the backend during the configuration -- for this example the
Performance allocation is used.
qsub -A Performance ./do-configure-debug
Once that completes, modify the do-make and change the email id from to yours, then run the ./do-make to build it
qsub -A Performance ./do-make
then run:
make install
on the front-end to do the install.
Once the HDF5 library is built - should be here:
Go ahead and build the exerciser:
cd $HDF5_ROOT/exerciser
mkdir run
mkdir build
cd build
cp $HDF5_ROOT/xgitlabrepos/BuildAndTest/Exerciser/BGQ/Makefile-debug .
cp $HDF5_ROOT/xgitlabrepos/BuildAndTest/Exerciser/exerciser.c .
make HDF5_INSTALL_DIR=$HDF5_ROOT/library/install/debug -f Makefile-debug
cp hdf5Exerciser-ofi-debug-mpitrace ../run
then a sample run on say 32 nodes:
cd ../run
qsub -A Performance -t 29 --nodecount 32 --mode c16 --cwd $HDF5_ROOT/exerciser/run --env RUNJOB_LABEL=short:HDF5_CUSTOM_AGG_DEBUG=yes:HDF5_CUSTOM_AGG=yes ./hdf5Exerciser-ofi-debug-mpitrace --metacoll --derivedtype --addattr --minbuf 256 --maxbuf 4194304
Note the HDF5_CUSTOM_AGG and HDF5_CUSTOM_AGG_DEBUG env vars which should have
yes/no values.
I have followed these instructions and set everything up here:
To use as a reference....
cd /home/zamora/hdf5_root_dir/library/build/opt-g-ccio-xl
cp -rL /home/zamora/hdf5_root_dir/xgitlabrepos/hdf5-ccio-develop/* .
export PATH=/soft/buildtools/autotools/feb2015/bin:$PATH
# If the do-scripts are available.. (should add these to BuildAndTest Repo)
cp ../do-scripts/do-configure .
cp ../do-scripts/do-make .
cp ../do-scripts/do-mpirun .
qsub -A datascience ./do-configure
qsub -A datascience ./do-make
make install
### Building the exerciser
In the ``Makefile`` (in the same directory as this ```` file), you need to set ``$HDF5_INSTALL_DIR`` to the location where you have installed HDF5. Then, you can build the exerciser in a root directory name of choice (``$HDF5_ROOT``):
mkdir $HDF5_ROOT
git clone
mkdir exerciser
cd exerciser
mkdir run
mkdir build
cd build
cp $HDF5_ROOT/BuildAndTest/Exerciser/Cray/Makefile .
ln -s $HDF5_ROOT/BuildAndTest/Exerciser/exerciser.c .
**Note**: For the remaining instructions, we are assuming that ``$HDF5_ROOT`` (and therefore ``$HDF5_ROOT/exerciser/run``) was created on the *luster* file system. If this is not true, then same instructions should be followed using a stand-alone ``run/`` directory that is in a ``/projects`` subdirectory, rather than ``$HDF5_ROOT``.
### Running a simple example script
A simple COBALT script example can be found at:
To use this script, you can copy it to the ``run`` directory, where you can also create a symbolic link to the exerciser executable (or copy it there, if you prefer):
cd $HDF5_ROOT/exerciser/run
ln -s $HDF5_ROOT/exerciser/build/hdf5Exerciser .
cp $HDF5_ROOTBuildAndTest/Exerciser/Cray/ .
Since this is a lustre file system, we need to define the desired stripe settings. For example, we can make a directory with stripe size 8 mb count 32:
mkdir stripecount32size8
cd stripecount32size8
lfs setstripe -c 32 -S 8m .
Then call the sample script as follows (to run on 8 nodes at 32 ppn in the debug-cache-quad queue, for example):
qsub --mode script -n 8 -t 15 -A Performance -q debug-cache-quad ../ ../hdf5Exerciser 32
#COBALT -q default -t 60 -n 1 -O LOG.configure
# !!!!!!
# BG/Q This is intended to be submitted via "qsub do-configure"
if [ -z $COBALT_PARTNAME ] ; then
echo "Not in job, submitting self"
exec qsub $0 "$@"
echo "pwd is" `pwd`
echo "------Env begin-------------------"
echo "------Env end-------------------"
echo "Driver version:"
/bin/ls -l /bgsys/drivers/ppcfloor
# echo
# echo "Saving output"
# >LOG.configure.efix_version
export CC=mpixlc_r
#export CFLAGS='-O3 -qlanglvl=extended'
export CFLAGS='-O3 -qlanglvl=extc99'
# expecting INCLUDE_PATH to be e.g. /bgsys/drivers/ppcfloor/comm/xl/include
echo "C compiler version:"
$CC -qversion=verbose
export F9X=mpixlf90_r
export FFLAGS="-O3 -qsuffix=f=f90"
echo "F9X compiler version:"
$F9X -qversion=verbose
export hdf5_cv_gettimeofday_tz=yes
export hdf5_cv_system_scope_threads=no
# Note:
# RUNSERIAL is used in src/ and points to
# a wrapper "do-mpirun".
# Depending on whether we are working inside a script job or
# from a normal login node, "do-mpirun" can be set to
# perform a cobalt-mpirun or a bg_run, respectively.
export RUNSERIAL="$PWD/do-mpirun --np 1 : "
export RUNPARALLEL="$PWD/do-mpirun --np 4 : "
# echo "****Using do-mpirun.configure"
# cp -f do-mpirun.configure do-mpirun
# ./configure --build powerpc32-unknown-gnu --host powerpc-suse-linux \
# --without-pthread --disable-shared --enable-parallel \
# --disable-cxx --disable-stream-vfd --enable-fortran
# --enable-threadsafe --with-pthread=DIR
# --enable-cxx --disable-parallel \
set -x
/home/zamora/hdf5_root_dir/library/build/opt-g-ccio-xl/configure --without-pthread --disable-shared --enable-fortran \
--disable-cxx --enable-parallel \
--with-zlib=/soft/libraries/alcf/current/xl/ZLIB \
--prefix=/home/zamora/hdf5_root_dir/library/install/opt-g-ccio-xl \
set +x
echo "configure is finished with status $status"
# echo "Calling cobalt-mpirun -free wait"
# cobalt-mpirun -free wait
# echo "Done calling cobalt-mpirun -free wait"
exit $status
#COBALT -n 1 -t 60 -q default -O LOG.make
if [ -z $COBALT_PARTNAME ] ; then
echo "Not in job, submitting self"
exec qsub $0 "$@"
make 2>&1 $1
echo "do-make: make completed with status: $status"
exit $status
# use this within scripts
set -x
# do-mpirun --np NNNN : prog args ...
runjob --block $COBALT_PARTNAME --verbose 2 --envs PAMID_VERBOSE=1 -p 16 "$@"
# bg_run needs a patch, don't use this now
# TIME=30 NODES=64 QUEUE=default MODE=smp bg_run $*
# important - exit status of this script must be exit status of the runjob
The purpose of this repo is to contain code and instructions for building
and running performance tests exercising various HDF5 optimizations across
various architectures.
# HDF5 Exerciser Benchmark
### Excerciser (Command-line Argument) Options
- Richard J. Zamora (
- Paul Coffman (
- December 13th 2018 (Version 2.0)
## Exerciser Overview
The **HDF5 Exerciser Benchmark** creates an HDF5 use case with some ideas/code borrowed from other benchmarks (namely `IOR`, `VPICIO` and `FLASHIO`). Currently, the algorithm does the following in parallel over all MPI ranks:
- For each rank, a local data buffer (with dimensions given by `numDims`) is initialized with `minNEls` double-precision elements in each dimension.
- If the `--derivedtype` flag is used, a second local dataset is also specified with a *derived* data type a-signed to each element.
- For a given number of iterations (hardcoded as `NUM_ITERATIONS`):
- Open a file, create a top group, set the `MPI-IO` transfer property, and (optionally) add a simple attribute string to the top group
- Create *memory* and *file* dataspaces with hyperslab selections for simple rank-ordered offsets into the file. The `-rshift` option can be used to specify the number of rank positions to shift the write position in the file (the read will be shifted *twice* this amount to avoid client-side caching effects
- Write the data and close the file
- Open the file, read in the data, and check correctness (if dataset is small enough)
- Close the dataset (but not the file)
- If the second (derived-type) data set is specified: (1) create a derived type, (2) open a new data set with the same number of elements and dimension, (3) write the data and (4) close everything.
- Each dimension of `curNEls` is then multiplied by each dimension of `bufMult`, and the previous steps (the loop over `NUM_ITERATIONS`) are repeated. This outer loop over local buffer sizes is repeated a total of `nsizes` times.
### Command-line Arguments (Options)
#### Required
- **``--numdims <x>``:** Dimension of the datasets to write to the hdf5 file
- **``--minels <x> ... <x>``:** Min number of double elements to write in each dim of the dataset (one value for each dimension)
- **``--nsizes <x>``:** How many buffer sizes to use (Code will start with ``minbuf`` and loop through ``nsizes`` iterations, with the buffer size multiplied by ``bufmult`` in each dim, for each iteration)
#### *Optional*
- **``--nsizes <x>``:** How many buffer sizes to use (Code will start with ``minbuf`` and loop through ``nsizes`` iterations, with the buffer size multiplied by ``bufmult`` in each dim, for each iteration)
- ``--bufmult <x> ... <x>``: Constant, for each dimension, used to multiply the buffer [default: *2* *2* ... ]
- ``--metacoll``: Whether to set meta data collective usage [default: *False*]
- ``--derivedtype``: Whether to create a second data set containing a derived type [default: *False*]
......@@ -17,9 +40,6 @@ various architectures.
- ``--indepio``: Whether to use independant I/O (not MPI-IO) [default: *False*]
- ``--keepfile``: Whether to keep the file around after the program ends for futher analysis, otherwise deletes it [default: *False*]
- ``--usechunked``: Whether to *chunk* the data when reading/writing [default: *False*]
- ``--perf``: Whether to collect detailed custom aggregation perfomance info (works for CCIO branch of HDF5 only) [default: *False*]
- ``--nowrite``: Whether to skip the MPI_File_Write_at calls within the CCIO algorithm (works for CCIO branch of HDF5 only, DISABLES read and verification tests) [default: *False*]
- ``--usemem <d>``: Fraction of memory to allocate (to mimic a mem-hungry application) Note that, for now, the size of the write/read buffers are subtracted from this value (**EXPERIMENTAL**) [default: *0.0*]
- ``--maxcheck <x>``: Maximum buffer size (in bytes) to validate. Note that **all** buffers will be vaidated if this option is **not** set by this command-line argument [default: *Inf*]
- ``--memblock <x>``: Define the block size to use in the local memory buffer (local buffer is always 1D for now, *Note*: This currently applies to the 'double' dataset only) [default: *local buffer size*]
- ``--memstride <x>``: Define the stride of the local memory buffer (local buffer is always 1D for now, *Note*: This currently applies to the 'double' dataset only) [default: *local buffer size*]
......@@ -29,3 +49,52 @@ various architectures.
The exerciser also allows the MPI decomposition to be explicitly defined:
- ``--dimranks <x>...<x>``: (one value for each dimension) mpi-rank division in each dimension. Note that, if not set, decomposition will be in 1st dimension only
### Exerciser Basics
In the simplest case, the Exerciser code will simply write and then read an n-dimensional double-precision dataset in parallel (with all the necessary HDF5 steps in between). At a minimum, the user must specify the **number** of dimensions to use for this dataset (using the `--numdims` flag), and the **size** of each dimension (using the `--minels` flag). By default, the maximum number of dimensions allowed by the code is set by `MAX_DIM` (currently 4, but can be modified easily). Note that the user is specifying the number of elements to use in each dimension with `--minels`. Therefore, the local buffer size is the product of the dimension sizes and `sizeof(double)` (and the global dataset in the file is a product of the total MPI ranks and the local buffer size). As illustrated in **Fig. 1**, the mapping of ranks to hyper-slabs in the global dataset can be specified with the `--dimranks` flag (here, *Example 1* is the default decomposition, while *Example 2* corresponds to: `--dimranks 2 2`). This flag simply allows the user to list the number of spatial decompositions in each dimension of the global dataset, and requires that the product of the input to be equal to the total number of MPI ranks.
**Fig. 1 - Illustration of different local-to-global dataset mapping options:**
![alt text](dimranks.png "Illustration of local-to-global dataset mapping.")
If the user wants to loop through a range of buffer sizes, the `--nsizes` flag can be used to specify how many sizes measure, and the `--bufmult` flag can be used to specify the multiplication factor for each dimension between each loop. For example, if the user wanted to test 64x64, 128x128, and 256x256-element local datasets on 32 ranks, they could use the following command to run the code:
mpirun -np 32 ./hdf5Exerciser --numdims 2 --minels 8 8 --nsizes 3 --bufmult 2 --dimranks 8 4
When executed for a single local-buffer size (default), the Exerciser output will look something like this:
useMetaDataCollectives: 0 addDerivedTypeDataset: 0 addAttributes: 0 useIndependentIO: 0 numDims: 1 useChunked: 0 rankShift: 4096
Metric Bufsize H5DWrite RawWrBDWTH H5Dread RawRdBDWTH Dataset Group Attribute H5Fopen H5Fclose H5Fflush OtherClose
Min 32768 0.134616 3058.154823 0.191049 2534.613015 0.361010 0.551608 0.000001 0.224550 0.127877 0.210821 0.000755
Med 32768 0.143874 3554.180478 0.191684 2670.829718 0.379858 0.612309 0.000001 0.236735 0.132450 0.228889 0.000761
Max 32768 0.167421 3803.418460 0.202003 2679.939135 0.405620 0.679779 0.000002 0.268622 0.138463 0.270188 0.000785
Avg 32768 0.146435 3506.598052 0.192068 2666.021346 0.379799 0.616157 0.000001 0.237219 0.132410 0.233730 0.000763
Std 32768 0.008055 185.366133 0.002090 27.665058 0.010248 0.026048 0.000000 0.008915 0.002650 0.017362 0.000006
Using `NUM_ITERATIONS` samples for each local buffer size (`Bufsize`), the minimum, median, maximum, average, and standard deviation of all metrics will be reported in distinct rows of the output. The `Bufsize` values are reported in **bytes**, while the `RawWrBDWTH` and `RawRdBDWTH` are in **MB/s**, and all other metrics are in **seconds**.
## Building Exerciser
Given the path to a parallel HDF5 installation, building the Exerciser benchmark is straightforward. The following Makefile can be used as a reference:
default: hdf5Exerciser
exerciser.o: exerciser.c
mpicc -c -g -DMETACOLOK -I${HDF5_INSTALL_DIR}/include exerciser.c -o exerciser.o
hdf5Exerciser: exerciser.o
mpicc exerciser.o -o hdf5Exerciser -L${HDF5_INSTALL_DIR}/lib -lhdf5 -lz
rm -f exerciser.o
rm -f hdf5Exerciser
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment