darshan-modularization.txt 15.8 KB
Newer Older
1 2
Darshan modularization branch development notes
===============================================
3 4 5

== Introduction

6 7
Darshan is a lightweight toolkit for characterizing the I/O performance of instrumented
HPC applications.
8

9
Darshan was originally designed to gather I/O data from a static set of sources.
10 11
Adding instrumentation for additional sources of I/O data was only possible through
manual modification of the Darshan log file format, which consequentially breaks
12
any other utilities reliant on that format.
13

14 15 16 17 18 19 20
Starting with version 3.0.0, the Darshan runtime environment and log file format have
been redesigned such that new "instrumentation modules" can be added without breaking
existing tools. Developers are given a framework to implement arbitrary instrumentation
modules, which are responsible for gathering I/O data from a specific system component
(which could be from an I/O library, platform-specific data, etc.). Darshan can then
manage these modules at runtime and create a valid Darshan log regardless of how many
or what types of modules are used.
21

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
== Checking out and building the modularization branch

Developers can clone the Darshan source repository using the following methods:

* "git clone git://git.mcs.anl.gov/radix/darshan" (read-only access)

* "git clone \git@git.mcs.anl.gov:radix/darshan" (authenticated access)

After cloning the Darshan source, it is necessary to checkout the modularization
development branch:

----
git clone git@git.mcs.anl.gov:radix/darshan
git checkout dev-modular
----

38 39
For details on configuring and building the Darshan runtime and utility repositories,
consult the documentation from previous versions
40 41 42 43 44
(http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html[darshan-runtime] and
http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html[darshan-util]) -- the
necessary steps for building these repositories should not have changed in the new version of
Darshan.

45
== Darshan dev-modular overview
46

47
The Darshan source tree is organized into two primary components:
48 49 50 51 52 53 54 55 56 57

* *darshan-runtime*: Darshan runtime environment necessary for instrumenting MPI
applications and generating I/O characterization logs.

* *darshan-util*: Darshan utilities for analyzing the contents of a given Darshan
I/O characterization log.

The following subsections provide an overview of each of these components with specific
attention to how new instrumentation modules may be integrated into Darshan.

58
=== Darshan-runtime
59

60
The primary responsibilities of the darshan-runtime component are:
61

62
* intercepting I/O functions of interest from a target application;
63

64
* extracting statistics, timing information, and other data characterizing the application's I/O workload;
65

66
* compressing I/O characterization data and corresponding metadata;
67

68
* logging the compressed I/O characterization to file for future evaluation
69

70 71
The first two responsibilities are the burden of the instrumentation module developer, while the last
two are handled automatically by Darshan.
72

73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
==== Instrumentation modules

The wrapper functions used to intercept I/O function calls of interest are central to the design of
any Darshan instrumentation module. These wrappers are used to extract pertinent I/O data from
the function call and persist this data in some state structure maintained by the module. The wrappers
are inserted at compile time for statically linked executables (e.g., using the linkers `--wrap`
mechanism) and at runtime for dynamically linked executables (using LD_PRELOAD).

*NOTE*: Modules should not perform any I/O or communication within wrapper functions. Darshan records
I/O data independently on each application process, then merges the data from all processes when the
job is shutting down. This defers expensive I/O and communication operations to the shutdown process,
limiting Darshan's impact on application I/O performance.

When the instrumented application terminates and Darshan begins its shutdown procedure, it requires
a way to interface with any active modules that have data to contribute to the output I/O characterization.
Darshan requires that module developers implement the following functions to allow the Darshan runtime
environment to coordinate with modules while shutting down:

[source,c]
struct darshan_module_funcs
{
    void (*disable_instrumentation)(void);
    void (*prepare_for_reduction)(
        darshan_record_id *shared_recs,
        int *shared_rec_count,
        void **send_buf,
        void **recv_buf,
        int *rec_size
    );
    void (*reduce_records)(
        void* a,
        void* b,
        int *len,
        MPI_Datatype *datatype
    );
    void (*get_output_data)(
        void** buf,
        int* size
    );
    void (*shutdown)(void);
};

`disable_instrumentation()`

This function informs the module that Darshan is about to begin shutting down. It should disable
all wrappers and stop updating internal data structures to ensure data consistency and avoid
other race conditions.

`prepare_for_reduction()`

Since Darshan aggregates shared data records (i.e., records which all application processes
accessed) into a single record, module developers must provide mechanisms for performing a reduction
on these records.

This function is used to prepare a module for performing a reduction operation. In general, this
just involves providing the input buffers to the reduction, and (on rank 0 only) providing output
buffer space to store the result of the reduction.

* _shared_recs_ is a set of Darshan record identifiers which are associated with this module.
These are the records which need to be reduced into single shared data records.

* _shared_rec_count_ is a pointer to an integer storing the number of shared records will
be reduced by this module. When the function is called this variable points to the number
of shared records detected by Darshan, but the module can decide not to reduce any number
of these records. Upon completion of the function, this variable should point to the number
of shared records to perform reductions on (i.e., the size of the input and output buffers).

* _send_buf_ is a pointer to the address of the send buffer used for performing the reduction
operation. Upon completion, this variable should point to a buffer containing *_shared_rec_count_
records that will be reduced.

* _recv_buf_ is a pointer to the address of the receive bufffer used for performing the reduction
operation. Upon completion, this variable should point to a buffer containing *_shared_rec_count_
records that will be reduced. This variable is only valid on the root process (rank 0). This
buffer address needs to be stored with module state, as it will be needed when retrieiving
the final output buffers from this module.

* _rec_size_ is just the size of the record structure being reduced for this module.

`reduce_records()`

This is the function which performs the actual shared record reduction operation. The prototype
of this function matches that of the user function provided to the MPI_Op_create function. Refer
to the http://www.mpich.org/static/docs/v3.1/www3/MPI_Op_create.html[documentation] for further
details.

`get_output_data()`

This function is responsible for passing back a single buffer storing all data this module is
contributing to the output I/O characterization. On rank 0, this may involve copying the results
of the shared record reduction into the output buffer.

* _buf_ is a pointer to the address of the buffer this module is contributing to the I/O
characterization. 

* _size_ is the size of this module's output buffer.

`shutdown()`

This function is a signal from Darshan that it is safe to shutdown. It should clean up and free
all internal data structures.

==== darshan-core

Within darshan-runtime, the darshan-core component manages the initialization and shutdown of the
Darshan environment, provides instrumentation module developers an interface for registering modules
with Darshan, and manages the compressing and the writing of the resultant I/O charracterization.
Each of the functions defined by this interface are explained in detail below.

[source,c]
void darshan_core_register_module(
    darshan_module_id mod_id,
    struct darshan_module_funcs *funcs,
    int *runtime_mem_limit);

The `darshan_core_register_module` function registers Darshan instrumentation modules with the
darshan-core runtime environment. This function needs to be called at least once for any module
that will contribute data to Darshan's final I/O characterization. 

* _mod_id_ is a unique identifier for the given module, which is defined in the Darshan log
format header file (darshan-log-format.h).

* _funcs_ is the structure of function pointers (as described above) that a module developer must
provide to interface with the darshan-core runtime. 

* _runtime_mem_limit_ is a pointer to an integer which will store the amount of memory Darshan
allows this module to use at runtime. Currently, darshan-core will hardcode this value to 2 MiB,
but in the future this may be changed to optimize Darshan's memory footprint. Note that Darshan
does not allocate any memory for modules, it just informs a module how much memory it can use.

[source,c]
void darshan_core_unregister_module(
    darshan_module_id mod_id);

The `darshan_core_unregister_module` function disassociates the given module from the
darshan-core runtime. Consequentially, Darshan does not interface with the given module at
shutdown time and will not log any I/O data from the module. This function should only be used
if a module registers itself with darshan-core but later decides it does not want to contribute
any I/O data.

* _mod_id_ is the unique identifer for the module being unregistered.

[source,c]
void darshan_core_register_record(
    void *name,
    int len,
    int printable_flag,
    darshan_module_id mod_id,
    darshan_record_id *rec_id);

The `darshan_core_register_record` function registers some data record with the darshan-core
runtime. This record could reference a POSIX file or perhaps an object identifier for an
object storage system, for instance.  A unique identifier for the given record name is
generated by Darshan, which should then be used by the module for referencing the corresponding
record.  This allows multiple modules to refer to a specific data record in a consistent manner
and also provides a mechanism for mapping these records back to important metadata stored by
darshan-core. It is safe (and likely necessary) to call this function many times for the same
record -- darshan-core will just set the corresponding record identifier if the record has
been previously registered.

* _name_ is just the name of the data record, which could be a file path, object ID, etc.

* _len_ is the size of the input record name. For string record names, this would just be the
string length, but for nonprintable record names (e.g., an integer object identifier), this
is the size of the record name type.
238

239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
* _printable_flag_ indicates whether the input record name is a printable ASCII string.

* _mod_id_ is the identifier for the module attempting to register this record.

* _rec_id_ is a pointer to a variable which will store the unique record identifier generated
by Darshan.

[source,c]
void darshan_core_unregister_record(
    darshan_record_id rec_id,
    darshan_module_id mod_id);

The `darshan_core_unregister_record` functoin disassociates the given module identifier from the
given record identifier. If no other modules are associated with the given record identifier, then
Darshan removes all internal references to the record. This function should only be used if a
module registers a record with darshan-core, but later decides not to store the record internally.

* _rec_id_ is the record identifier we want to unregister.

* _mod_id_ is the module identifier that is unregistering _rec_id_.

[source,c]
double darshan_core_wtime(void);

The `darshan_core_wtime` function simply returns a floating point number of seconds since
Darshan was initialized. This functionality can be used to time the duration of application
I/O calls or to store timestamps of when functions of interest were called.

==== darshan-common

darshan-common is a utility component of darshan-runtime, providing module developers with
general functions that are likely to be reused across multiple modules. These functions are
distinct from darshan-core functions since they do not require access to internal Darshan
state.

[source,c]
char* darshan_clean_file_path(
    const char* path);

The `darshan_clean_file_path` function just cleans up the input path string, converting
relative paths to absolute paths and suppressing any potential noise within the string.

* _path_ is the input path string to be cleaned up.

As more modules are contributed, it is likely that more functionality can be refactored out
of module implementations and maintained in darshan-common, facilitating code reuse and
simplifying maintenance.

=== Darshan-util
288

289
Text.
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357

== Adding new instrumentation modules

In this section we outline each step necessary to adding a module to Darshan.

=== Log format headers

The following modifications to Darshan log format headers are required for defining
the module's record structure:

* Add module identifier to darshan_module_id enum and add module string name to the
darshan_module_name array in `darshan-log-format.h`.

* Add a top-level header that defines a data record structure for the module. An exemplar
log header for the POSIX instrumentation module is given in `darshan-posix-log-format.h`.

=== Darshan-runtime

==== Build modifications

The following modifications to the darshan-runtime build system are necessary to integrate
new instrumentation modules:

* Necessary linker flags for wrapping this module's functions need to be added to the definition
of `CP_WRAPPERS` in `darshan-config.in`.

* Targets must be added to `Makefile.in` to build static and shared objects for the module's
source files, which will be stored in the `lib/` directory. The prerequisites to building
static and dynamic versions of `lib-darshan` must be updated to include these objects, as well.

It is necessary to rerun the `prepare` script and reconfigure darshan-runtime for these changes
to take effect.

==== Instrumentation module implementation

An exemplar instrumentation module for POSIX I/O functions is given in `lib/darshan-posix.c` as
reference. In addtion to the development notes from above and the reference POSIX module, we
provide the following notes to assist module developers:

* Modules only need to include the `darshan.h` header to interface with darshan-core.

* Lacking a way to bootstrap themselves, modules will have to include some logic in their
wrappers to initialize necessary module state if initialization has not already occurred.
    - Part of this initialization process should be registering the module with darshan-core,
    since this informs the module how much memory it may allocate.

* The file record identifier given when registering a record with darshan-core can be used
to store the record structure in a hash table or some other structure.
    - The `darshan_core_register_record` function is really more like a lookup function. It
    may be called multiple times for the same record -- if the record already exists, the function
    simply returns its record ID.
    - It may be necessary to maintain a separate hash table for other handles which the module
    may use to refer to a given record. For instance, the POSIX module may need to look up a
    file record based on a given file descriptor, rather than a path name.

=== Darshan-util

==== Build modifications

Text

== Other resources

* http://www.mcs.anl.gov/research/projects/darshan/[Darshan website]
* http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html[darshan-runtime documentation]
* http://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html[darshan-util documentation]
* https://lists.mcs.anl.gov/mailman/listinfo/darshan-users[Darshan-users mailing list]
* https://trac.mcs.anl.gov/projects/darshan/report[Darshan trac page]