Commit 23804dd9 authored by Huihuo Zheng's avatar Huihuo Zheng
Browse files

added readme

parent 807b579e
# Incorparating node local storage in HDF5
This folder contains the prototype of system-aware HDF5 incoroprating node-local
storage. We developed this for multiple read workflows
## Source file
* H5Dio_cache.c, H5Dio_cache.h -- source codes for incorporating node-local storage into parallel read and write HDF5.
* test_read_cache.cpp -- testing code for read
* test_write_cache.cpp -- testing code for write
* prepare_dataset.cpp -- preparing dataset for the read testing.
## Function APIs
* H5Dwrite_cache -- writing data to node local storage and then the pthread move the data to the parallel file system
* H5Dread_to_cache -- reading data from the parallel file system and write the buffer to the local storage
* H5Dread_from_cache -- reading data directly from the local storage
## Parallel HDF5 Write incororating node-local storage
test_write_cache.cpp is the benchmark code for evaluating the performance. In this testing case, each MPI rank has a local
buffer BI to be written into a HDF5 file organized in the following way: [B0|B1|B2|B3]|[B0|B1|B2|B3]|...|[B0|B1|B2|B3]. The repeatition of [B0|B1|B2|B3] is the number of iterations
* --dim: dimension of the 2D array [BI] // this is the local buffer size
* --niter: number of iterations. Notice that the data is accumulately written to the file.
* --scratch: the location of the raw data
* --sleep: sleep between different iterations
environmental variable to turn on the SSD_CACH=yes
SSD_PATH -- environmental variable setting the path of the
## Parallel HDF5 Read incorporating node-local storage
**Preparing the dataset**
The benchmark relies on a dataset stored in a hdf5 file. One can generate the
dataset using prepare_dataset.py or prepare_dataset.cpp. The example
python prepare_dataset.py --num_images 8192 --sz 224 --output images.h5
This will generate a hdf5 file, images.h5, which contains 8192 samples, each with 224*224*3 (image-base dataset)
**Benchmarks**
test_read_cache.cpp is the benchmark code for evaluating the performance.
* --input: HDF5 file
* --dataset: the name of the dataset in the HDF5 file
* --num_epochs [Default: 2]: Number of epochs (at each epoch/iteration, we sweep through the dataset)
* --num_batches [Default: 16]: Number of batches to read per epoch
* --batch_size [Default: 32]: Number of samples per batch
* --shuffle [Default: False]: Whether to shuffle the samples at the beginning of each epoch.
* --cache [Default: False]: Whether the local storage cache is turned on or not. If False, each epoch it will read from the file system.
* --local_storage [Default: ./]: The path of the local storage.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment