Newer Older
1 2 3 4
This document serves the following purposes:
* Document project resources (repository links, etc.)
* Introduce and present an overview of the key components making up the CODES
5 6
* Introduce and present an overview of the key components making up the CODES
  networking library "model-net", along with available models and utilities.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
* Walk through the CODES example model, which shows the majority of CODES
  features currently available.

= CODES/ROSS resources

CODES and ROSS share a mailing list. It is at:


* main site: 
* repositories:
  * "base" (this repository):
  * codes-net (networking component of CODES):
* bug tracking:


* main site, repository, etc.:
  * both the site and repository contain good documentation as well - refer to
    it for an in-depth introduction and overview of ROSS proper

= Components of CODES 

== Configuration

The configuration of LPs, LP parameterization, and miscellaneous simulation
parameters are specified by the CODES configuration system, which uses a
structured configuration file. The configuration format allows categories, and
optionally subgroups within the category, of key-value pairs for configuration.
The LPGROUPS category defines the LP configuration. The PARAMS category is
currently used for networking and ROSS-specific parameters. User-defined
categories can also be used. 

The configuration system additionally allows LP specialization via the usage of
"annotations". This allows two otherwise identical LPs to have different
parameterizations. Annotations have a simple "@" syntax appended to the LP
fields, and are optional.

CODES currently exposes a small number of ROSS simulation engine options within
the configuration file. These are "PARAMS:message_size" and
"PARAMS:pe_mem_factor". The "message_size" parameter indicates the upper bound
of event sizes that ROSS is expected to handle, while the "pe_mem_factor"
parameter indicates the multiplier to the number of events allocated per ROSS
PE (an MPI rank). Both of these exist to support the static allocation scheme
ROSS uses for efficiency - both LP states and the maximum population of events
are allocated statically at the beginning of the simulation.

The API is located at codes/configuration.h, which provides various types of
access into the simulation configuration. Detailed configuration files can be
found at doc/example/example.conf and doc/example_heterogeneous/example.conf.

== LP mapping

The codes-mapping API maps user LPs to global LP IDs, providing numerous
options for modulating the namespace under which the mapping is conducted.
Mapping is performed on a per-group or per-LP-type basis, with numerous further
filtering options including on an LPs annotation. Finally, the mapping API
provides LP counts using the aforementioned filtering options.

The API can be found at codes/codes_mapping.h. doc/example/example.c shows a
simple example of the mapping functionality, while the test program
tests/mapping_test.c with configuration file tests/conf/mapping_test.conf
Jonathan Jenkins's avatar
Jonathan Jenkins committed
extensively demonstrates the mapping API.

72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
=== LP mapping context

In many cases (the resource and local storage model LPs, and modelnet in
codes-net), the mapping from caller to callee LPs are implicit. The
codes-mapping context API (codes/codes-mapping-context.h) provides various
options for influencing these mappings, as well as providing the capability to
bypass codes-mapping and send messages directly to an LP. The test file
tests/map-ctx-test.c, along with the config file tests/conf/map-ctx-test.conf
provides the best usage examples.

== LP messaging conventions

Models in CODES that have a request-response form of communication resembling
that of RPCs follow a specific convention, defined in codes/codes-callback.h.
The header documents the convention. The current best example usage is in the
CODES sources themselves, i.e. in codes/resource-lp.h, src/util/resource-lp.c,
codes/local-storage-model.h, and src/util/local-storage-model.c.

90 91 92 93 94 95 96 97
== Workload generator(s)

codes-workload is an in-development abstraction layer for feeding I/O / network
workloads into a simulation. It supports multiple back-ends for generating I/O
and network events; data could come from a trace file, from Darshan, or from a
synthetic description.

The workload generator is currently a work in progress, and the API is subject
Jonathan Jenkins's avatar
Jonathan Jenkins committed
98 99 100
to change. The I/O generation component exposes a "POSIX-ish"
open/close/read/write interface, while the network generation component exposes
an "MPI-ish" send/recv/barrier/collective interface.
101 102

As an additional utility, we provide a simple debug program,
Jonathan Jenkins's avatar
Jonathan Jenkins committed
src/workload/codes-workload-dump, that processes a given workload and prints to
104 105 106 107 108
standard out.

=== IO

We currently have initial support for extrapolating (lossy) Darshan logs
Jonathan Jenkins's avatar
Jonathan Jenkins committed
109 110 111
(, a simple synthetic IO
kernel language, and IO Recorder (
112 113 114 115 116

==== Synthetic IO language

The synthetic IO language is a simple, interpreted set of IO and basic
arithmetic commands meant to simplify the specification and running of
application workloads.
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

The input for the workload generator consists of an IO kernel metadata file and
a number of IO kernel files. The former specifies a set of kernel files to run
and logical client IDs to participate in the workload, while the latter
describes the IO to be performed.

The format of the metadata file is a set of lines containing:
  <group ID> <start ID> <end ID inclusive> <kernel file>
* <group ID> is the ID of this group (see restrictions)
* <start ID> and <end ID> form the range of logical client IDs that will 
  perform the given workload. Note that the end ID is inclusive, so a start,
  end pair of 0, 3 will include IDs 0, 1, 2, and 3. An <end ID> of -1 indicates
  to use the remaining number of clients as specified by the user.
* <kernel file> is the path to the IO kernel workload. It may either be an
  absolute or relative path.

The IO kernel file contains a set of commands performed on a per-client
basis. Like the workload generator interface, files are represented by integer
IDs, and the standard set of "POSIX-ish" operations can be applied (e.g., open,
close, sync, write, read) and have a similar argument list (file ID, [length],
[offset] where applicable). pread/pwrite equivalents are given by

More detailed documentation on the language is ongoing, but for now a general
example can be seen at doc/workload, which shows a simple out-of-core data
shuffle. Braver souls may wish to visit the implementation at src/iokernellang
and src/workload/codes-iolang-wrkld.c.
146 147 148 149 150 151 152 153 154 155

The following restrictions currently apply to the IO language:
* all user-defined variables must be a single, lower-case letter (the symbol
  table from the code we inherited is an array of 26 chars)
* the implementation of "groups" is currently broken. We have gotten around
  this by hard-coding in the group size and client ID into the parser when a
  kernel file is loaded (parsing currently occurs on a per-client basis).
  Hence, getgroupid should be completely ignored and getgrouprank and
  getgroupsize ignore the group ID parameter passed in.

Jonathan Jenkins's avatar
Jonathan Jenkins committed
156 157 158 159 160 161 162 163
===== Limitations

The IO language is frozen and no future development will be happening with it,
so keep the following limitations in mind when using it.

* There is currently no way to specify a "create" flag to open.
* Variables are expected to be a single lowercase character.

Jonathan Jenkins's avatar
Jonathan Jenkins committed
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
==== "Mock" IO workload

The mock IO workload generator creates a sequential workload of N requests of
size M. The generated file ID is either an optional input or 0 - there's also
an option to add a (simulated) processes rank to the file IDs, giving in effect
a unique file per rank. Relevant configuration parameters are:
* mock_num_requests - the number of requests
* mock_request_size - the size of each request
* mock_request_type - the type of request ("read" or "write")
* mock_file_id (optional) - the file ID to use, default 0
* mock_use_unique_file_ids (optional) - if non-zero, add the workload
  processor's rank to the file ID. Default is 0.
* mock_rank_table_size (optional) - the hash table size to store the ranks in.
  For minimal collisions, choose a value larger than the expected workload
  number of ranks.

==== Recorder IO workload


==== Checkpoint IO workload


188 189
=== Network

Jonathan Jenkins's avatar
Jonathan Jenkins committed
190 191
Our primary network workload generator is via the DUMPI tool
( DUMPI collects and reads events from
192 193 194 195 196 197 198 199
MPI applications. See the DUMPI documentation for how to generate traces. There
are additionally publically-available traces at

Note on trace reading - the input file prefix to the dumpi workload generator
should be everything up to the rank number. E.g., if the dumpi files are of the
form "dumpi-YYYY.MM.DD.HH.MM.SS-XXXX.bin", then the input should be

Jonathan Jenkins's avatar
Jonathan Jenkins committed
201 202 203 204 205 206 207 208 209
=== Workload generator helpers

The codes-jobmap API (codes/codes-jobmap.h) specifies mechanisms to initialize
and run multiple jobs with unique job IDs and linear namespaces, used to allow
multiple concurrent workloads to be run within the same simulation. Example
usage can be seen in tests/jobmap-test.c and, for the list allocation method,
in tests/conf/jobmap-test-list.conf. Note that only the DUMPI trace generator
has been tested using this interface at this time.

210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
== LP-IO

LP-IO is a set of simple reverse-computation-aware routines for conditionally
outputting data on a per-LP basis. As the focus is on convenient, small-scale
data output, data written via LP-IO remains in memory until the end of the
simulation, or freed upon reverse computation. Large-scale,
reverse-computation-aware IO is a feature we're thinking about for future

The API can be found at codes/lp-io.h and is fairly self-explanatory.

== CODES configurator

The configurator is a set of scripts intended to make the auto-generation of
multiple CODES configuration files easier, for the purposes of performing
parameter sweeps of simulations. The configuration file defining the parameters
in the parameter sweep is defined by a python source file with well-defined
field names, to maximize flexibility and enable some essential features for
flexible parameter sweeps (disabling certain combinations of parameters,
deriving parameters from other parameters in the sweep). The actual replacement
is driven by token replacement defined by the values in the configuration file.

An exhaustive example can be found at scripts/example. The scripts themselves
are,, and, each with detailed usage info. These scripts have
heavily-overlapping functionality, so in the future these may be merged.

== Miscellaneous utilities

=== Workload display utility

For debugging and experimentation purposes, a plain-text "dump" of an IO
workload can be seen using the utility src/workload/codes-workload-dump
(it gets installed into $bindir).

=== LP template (src/util/templates)

As writing ROSS/CODES models currently entail a not-insignificant amount of
boilerplate for defining LPs and hooking them into ROSS, we have a template
model for use at src/util/templates/lp_template.* .

=== Generic message header (see best practices)

We recommend the use of codes/lp-msg.h to standardize LP event headers, making it
easier to identify messages.

= Utility models

== Local storage model

The local storage model (LSM) is fairly simple in design but is sufficient for
many simulations with reasonable I/O access patterns. It is an
overhead/latency/bandwidth model that tracks file and offset accesses to
determine whether to apply seeking penalties to the performance of the access.
It uses a simple histogram-based approach to parameterization:
overhead/latency/bandwidth numbers are given relative to different access
sizes. To gather such parameters, well-known I/O benchmarks such as fio
(;a=summary) can be used.

The LP name used in configuration is "lsm" and the configuration is expected to
be in a similarly named standalone group, an example of which is shown below:

    # in bytes
    request_sizes   = ( "4096","8192","16384","32768","65536","131072","262144","524288","1048576","2097152","4194304" );
    # in MiB/s (2^20 bytes / s)
    write_rates     = ( "1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7" );
    read_rates      = ( "1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1" );
    # in microseconds
    write_seeks     = ( "499.5","509.0","514.7","525.9","546.4","588.3","663.1","621.8","539.1","3179.5","6108.8" );
    read_seeks      = ( "3475.6","3470.0","3486.2","3531.2","3608.6","3741.0","3988.9","4530.2","5644.2","7922.0","11700.3" );
    write_overheads = ( "29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67" );
    read_overheads  = ( "23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67" );

The API can be found at codes/local-storage-model.h and example usage can be
seen in tests/local-storage-model-test.c and tests/conf/lsm-test.conf. 

289 290 291 292 293 294
The queueing policy of LSM is currently FIFO, and the default mode uses an
implicit queue, simply incrementing counters and scheduling future events when
I/O requests come in. Additionally, an explicit queue has been added and
provides a simple FIFO+priority mechanism. To use, in the "lsm" group set
"enable_scheduler" to the value "1".

295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
== Resource model

The resource model presents a simple integer counter representing some finite
resource (e.g., bytes of memory available). LPs request some number of units of
the resource, receiving a success/failure completion message via a callback
mechanism. Optional "blocking" can be used to defer the completion message
until the request is successfully completed.

The configuration LP name is "resource" and the parameters are given in a
similarly-named group. An example is shown below:


The API for the underlying resource data structure can be found in
codes/resource.h. The user-facing API for communicating with the LP can be
found in codes/resource-lp.h.

315 316 317 318 319 320 321
= model-net overview

model-net is a set of networking APIs and models intended to provide
transparent message passing support between LPs, allowing the underlying
network LPs to do the work of routing while user LPs model their applications.
It consists of a number of both simple and complex network models as well as
configuration utilities and communication APIs. A somewhat stale overview is
also given at src/networks/model-net/doc/README.
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369

= Components of model-net

== messaging API

model-net provides a few messaging routines for use by LPs. Each specify a
message payload and an event to execute upon message receipt. They allow for:
* push-style messaging, e.g., MPI_Send with an asynchronous MPI_Recv at the
* pull-style messaging (e.g., RMA). E.g., LP x issues a pull message event,
  which reaches LP y. Upon receipt, LP y sends the specified message payload
  back to LP x. LP x executes the event it specified once the message payload
  is fully received. Note that, at the modeler level, The pull message event is
  not seen by the destination LP (LP y in this example).
* (experimental) initial collective communication support. This is currently
  being evaluated - more documentation on this feature will follow.

The messaging types, with the exception of the collective routines,
support specializing by LP annotation, only communicating through model-net LPs
containing the given annotation. Additionally, there is initial support for
different message scheduling algorithms - see codes/model-net-sched.h and the
following section. In particular, message-specific parameters can be set via
the model_net_set_msg_param function in codes/model-net.h, which currently is
only used with the "priority" scheduler.

The messaging API is given by codes/model-net.h, through the
model_net_event family of functions. Usage can be found in the example program
of the codes-base repository.

== LP configuration / mapping

model-net LPs, like any other LP in CODES/ROSS, require specification in the
configuration file as well as LP-specific parameters. To identify model-net LPs
to CODES, we use the naming scheme of prefixing model-net LP names with the
string "modelnet_". For example, the "simplenet" LP (see "model-net models")
would be specified in the LPGROUPS section as "modelnet_simplenet". 

Currently, model-net LPs expect their configuration values in the PARAMS
section. This is due to historical precedent and may be changed in the future
to use their own dedicated section. There are a combination of general
model-net parameters and model-specific parameters. The general parameters are:
* modelnet_order - this is the only required option. model-net assigns integer
  IDs to the different types of networks used (independent of annotation) so
  that the user can differentiate between different network topologies without
  knowing the particular models used. All models used in the CODES
  configuration file must be referenced in the parameter. See
  doc/example_heterogeneous in the codes-base repository.
* packet_size - the size of packets in bytes. model-net will break messages into
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392
* modelnet_scheduler - the algorithm for scheduling packets when multiple
  concurrent messages are being processed. Options can be found in the
  SCHEDULER_TYPES macro in codes/model-net-sched.h). In particular, the
  "priority" scheduler requires two additional parameters in PARAMS:
  "prio-sched-num-prios" and "prio-sched-sub-sched", the former of which sets
  the number of priorities to use and the latter of which sets the scheduler
  used for messages with the same priority.

== Statistics tracking

model-net tracks basic statistics for each LP such as bytes sent/received and
time spent sending/receiving messages. If LP-IO has been enabled in the
program, they will be printed to the specified directory.

= model-net models

Currently, model-net contains a combination of analytical models and specific
models for high-performance networking, with an HPC bent. 

Configuration files for each model can be found in tests/conf, under the name

Jonathan Jenkins's avatar
Jonathan Jenkins committed
== Simplenet/SimpleP2P

Jonathan Jenkins's avatar
Jonathan Jenkins committed
395 396
The Simplenet and SimpleP2P models (model-net LP names "simplenet" and
"simplep2p") are basic queued point-to-point latency/bandwidth models, assuming
infinite packet buffering when routing. These are best used for models that
Jonathan Jenkins's avatar
Jonathan Jenkins committed
require little fidelity out of the network performance. SimpleP2P is the same
399 400 401 402 403 404
model as Simplenet, except it provides heterogeneous link capacities from point
to point. Rather than a single entry, it requires files containing a matrix of
point-to-point bandwidths and latencies. 

Simplenet models require two configuration parameters: "net_startup_ns" and
"net_bw_mbps", which define the startup and bandwidth costs in nanoseconds and
Jonathan Jenkins's avatar
Jonathan Jenkins committed
405 406
megabytes per second, respectively. SimpleP2P requires two configuration files
for the latency and bandwidth costs: "net_latency_ns_file" and
407 408 409

More details about the models can be found at
410 411
src/networks/model-net/doc/README.simplenet.txt and
src/networks/model-net/doc/README.simplep2p.txt, respectively.
412 413 414 415 416 417 418 419 420 421 422 423 424

== LogGP

The LogGP model (model-net LP name: "loggp"), based on the LogP
model, is a more advanced analytical model, containing point-to-point latency
("L"), CPU overhead ("o"), a minimum time between consecutive messages ("gap",
or "g"), and a time-per-byte-sent ("G") independent from the gap.

The only configuration entry the LogGP model requires is
"PARAMS:net_config_file", which points to a file with path *relative* to the
configuration file.

For more details on gathering parameters for the LogGP model, as well as it's
usage and caveats, see the document src/model-net/doc/README.loggp.txt. 
426 427 428 429 430

== Torus

The torus model (model-net LP name: "torus") represents a torus topology with
arbitrary dimension, used primarily in HPC systems. It is based on the work
Jonathan Jenkins's avatar
Jonathan Jenkins committed
431 432 433 434 435
performed in:
    M. Mubarak, C. D. Carothers, R. B. Ross, P. Carns. A case study in using
    massively parallel simulation for extreme-scale torus network co-design,
    ACM SIGSIM conference on Principles of Advanced Discrete Simulations
    (PADS), 2014.

The configuration and model setup can be found at:
439 440 441

== Dragonfly

The dragonfly model (model-net LP name: "dragonfly") is a network
443 444 445
topology that utilizes the concept of virtual routers to produce systems with
very high virtual radix out of network components with a lower radix. The
topology itself and the simulation model are both described in
447 448 449 450 451 452 453 454 455

The configuration parameters are a little trickier here, as additional LPs
other than the "modelnet_dragonfly" LP must be specified. "modelnet_dragonfly"
represent the node LPs (terminals), while a second type "dragonfly_router"
represents a physical router. At least one "dragonfly_router" LP must be
present in every LP group with a "modelnet_dragonfly" LP. 

Further configuration and model setup can be found at
456 457 458
A newer version of dragonfly model that supports a Cray style network is also
available. Instructions can be found at src/model-net/doc/README.dragonfly-custom.txt.

460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556
= CODES example model

An example model representing most of the functionality present in CODES is
available in doc/example. In this scenario, we have a certain number of storage
servers, identified through indices 0, ... , n-1 where each server has a
network interface card (NIC) associated with it. The servers exchange messages
with their neighboring server via their NIC card (i.e., server i pings server
i+1, rolling over the index if necessary). When the neighboring server receives
the message, it sends an acknowledgement message to the sending server in
response. Upon receiving the acknowledgement, the sending server issues another
message. This process continues until some number of messages have been sent.
For simplicity, it is assumed that each server has a direct link to its
neighbor, and no network congestion occurs due to concurrent messages being

The model is relatively simple to simulate through the usage of ROSS. There are
two distinct LP types in the simulation: the server and the NIC. Refer to
example.c for data structure definitions. The server LPs are in charge of
issuing/acknowledging the messages, while the NIC LPs (implemented via CODES's
model-net component, available in the codes-net repository) transmit the data
and inform their corresponding servers upon completion. This LP decomposition
strategy is generally preferred for ROSS-based simulations: have
single-purpose, simple LPs representing logical system components.

In this program, CODES is used in the following four ways: to provide
configuration utilities for the program (example.conf), to logically separate
and provide lookup functionality for multiple LP types, to automate LP
placement on KPs/PEs, and to simplify/modularize the underlying network
structure. The configuration API is used for the first use-case, the
mapping API is used for the second and third use-cases, and the
model-net API is used for the fourth use-case. The following sections
discuss these while covering necessary ROSS-specific information.

== Configuration and mapping

In the example program, there are one server LP and one
"modelnet_simplenet" LP type in a group and this combination is
repeated 16 times (repetitions="16") for a total of 32 LPs. The section
"server_pings" is server-LP-specific and defines the number of rounds of
communication and the payload for each round.

We use the simple-net LP provided by model-net as the underlying network
model. The simple-net parameters are specified by the user in the PARAMS
section of the example.conf config file.

== Server state and event handlers

The server LP state maintains a count of the number of remote messages it has
sent and received as well as the number of local completion messages.   

For the server event message, we have four message types: KICKOFF, REQ, ACK and
LOCAL. With a KICKOFF event, each LP sends a message to itself to begin the
simulation proper. To avoid event ties, we add a small amount of random noise
using codes_local_latency. The REQ message is sent by a server to its
neighboring server and when received, neighboring server sends back a message
of type ACK. We've shown a hard-coded direct communication method which
directly computes the LP ID, and a codes-mapping API-based method. 

== Server reverse computation

ROSS has the capability for optimistic parallel simulation, but instead of
saving the state of each LP, they instead require users to perform reverse
computation. That is, while the event messages are themselves preserved (until
the Global Virtual Time (GVT) algorithm renders the messages unneeded), the LP
state is not preserved. Hence, it is up to the simulation developer to provide
functionality to reverse the LP state, given the event to be reversed. ROSS
makes this simpler in that events will always be rolled back in exactly the
order they were applied. Note that ROSS also has both serial and parallel
conservative modes, so reverse computation may not be necessary if the
simulation is not compute- or memory-intensive.

For our example program, recall the "forward" event handlers. They perform the
* Kickoff: send a message to the peer server, and increment sender LP's
    count of sent messages.
* Request (received from peer server): increment receiver count of
    received messages, and send an acknowledgement to the sender.
* Acknowledgement (received from message receiver): send the next
    message to the receiver and increment messages sent count. Set a flag
    indicating whether a message has been sent.  
* Local model-net callback: increment the local model-net
    received messages count.

In terms of LP state, the four operations are simply modifying counts. Hence,
the "reverse" event handlers need to merely roll back those changes: 
* Kickoff: decrement sender LP's count of sent messages.
* Request (received from peer server): decrement receiver count of
    received messages.
* Acknowledgement (received from message receiver): decrement messages
    sent count if flag indicating a message has been sent has not been
* Local model-net callback: decrement the local model-net
    received messages count.

For more complex LP states (such as maintaining queues), reverse event
processing becomes similarly more complex. Refer to the best practices document
for strategies of coping with the increase in complexity.