This document serves the following purposes: * Document project resources (repository links, etc.) * Introduce and present an overview of the key components making up the CODES library * Introduce and present an overview of the key components making up the CODES networking library "model-net", along with available models and utilities. * Walk through the CODES example model, which shows the majority of CODES features currently available. = CODES/ROSS resources CODES and ROSS share a mailing list. It is at: https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users == CODES * main site: http://www.mcs.anl.gov/projects/codes/ * repositories: * "base" (this repository): git.mcs.anl.gov:radix/codes-base * codes-net (networking component of CODES): git.mcs.anl.gov:radix/codes-net * bug tracking: https://trac.mcs.anl.gov/projects/CODES == ROSS * main site, repository, etc.: https://github.com/ross-org/ROSS * both the site and repository contain good documentation as well - refer to it for an in-depth introduction and overview of ROSS proper = Components of CODES == Configuration The configuration of LPs, LP parameterization, and miscellaneous simulation parameters are specified by the CODES configuration system, which uses a structured configuration file. The configuration format allows categories, and optionally subgroups within the category, of key-value pairs for configuration. The LPGROUPS category defines the LP configuration. The PARAMS category is currently used for networking and ROSS-specific parameters. User-defined categories can also be used. The configuration system additionally allows LP specialization via the usage of "annotations". This allows two otherwise identical LPs to have different parameterizations. Annotations have a simple "@" syntax appended to the LP fields, and are optional. CODES currently exposes a small number of ROSS simulation engine options within the configuration file. These are "PARAMS:message_size" and "PARAMS:pe_mem_factor". The "message_size" parameter indicates the upper bound of event sizes that ROSS is expected to handle, while the "pe_mem_factor" parameter indicates the multiplier to the number of events allocated per ROSS PE (an MPI rank). Both of these exist to support the static allocation scheme ROSS uses for efficiency - both LP states and the maximum population of events are allocated statically at the beginning of the simulation. The API is located at codes/configuration.h, which provides various types of access into the simulation configuration. Detailed configuration files can be found at doc/example/example.conf and doc/example_heterogeneous/example.conf. == LP mapping The codes-mapping API maps user LPs to global LP IDs, providing numerous options for modulating the namespace under which the mapping is conducted. Mapping is performed on a per-group or per-LP-type basis, with numerous further filtering options including on an LPs annotation. Finally, the mapping API provides LP counts using the aforementioned filtering options. The API can be found at codes/codes_mapping.h. doc/example/example.c shows a simple example of the mapping functionality, while the test program tests/mapping_test.c with configuration file tests/conf/mapping_test.conf extensively demonstrates the mapping API. === LP mapping context In many cases (the resource and local storage model LPs, and modelnet in codes-net), the mapping from caller to callee LPs are implicit. The codes-mapping context API (codes/codes-mapping-context.h) provides various options for influencing these mappings, as well as providing the capability to bypass codes-mapping and send messages directly to an LP. The test file tests/map-ctx-test.c, along with the config file tests/conf/map-ctx-test.conf provides the best usage examples. == LP messaging conventions Models in CODES that have a request-response form of communication resembling that of RPCs follow a specific convention, defined in codes/codes-callback.h. The header documents the convention. The current best example usage is in the CODES sources themselves, i.e. in codes/resource-lp.h, src/util/resource-lp.c, codes/local-storage-model.h, and src/util/local-storage-model.c. == Workload generator(s) codes-workload is an in-development abstraction layer for feeding I/O / network workloads into a simulation. It supports multiple back-ends for generating I/O and network events; data could come from a trace file, from Darshan, or from a synthetic description. The workload generator is currently a work in progress, and the API is subject to change. The I/O generation component exposes a "POSIX-ish" open/close/read/write interface, while the network generation component exposes an "MPI-ish" send/recv/barrier/collective interface. As an additional utility, we provide a simple debug program, src/workload/codes-workload-dump, that processes a given workload and prints to standard out. === IO We currently have initial support for extrapolating (lossy) Darshan logs (https://www.mcs.anl.gov/research/projects/darshan/), a simple synthetic IO kernel language, and IO Recorder (https://github.com/babakbehzad/Recorder) traces. ==== Synthetic IO language The synthetic IO language is a simple, interpreted set of IO and basic arithmetic commands meant to simplify the specification and running of application workloads. The input for the workload generator consists of an IO kernel metadata file and a number of IO kernel files. The former specifies a set of kernel files to run and logical client IDs to participate in the workload, while the latter describes the IO to be performed. The format of the metadata file is a set of lines containing: where: * is the ID of this group (see restrictions) * and form the range of logical client IDs that will perform the given workload. Note that the end ID is inclusive, so a start, end pair of 0, 3 will include IDs 0, 1, 2, and 3. An of -1 indicates to use the remaining number of clients as specified by the user. * is the path to the IO kernel workload. It may either be an absolute or relative path. The IO kernel file contains a set of commands performed on a per-client basis. Like the workload generator interface, files are represented by integer IDs, and the standard set of "POSIX-ish" operations can be applied (e.g., open, close, sync, write, read) and have a similar argument list (file ID, [length], [offset] where applicable). pread/pwrite equivalents are given by readat/writeat. More detailed documentation on the language is ongoing, but for now a general example can be seen at doc/workload, which shows a simple out-of-core data shuffle. Braver souls may wish to visit the implementation at src/iokernellang and src/workload/codes-iolang-wrkld.c. The following restrictions currently apply to the IO language: * all user-defined variables must be a single, lower-case letter (the symbol table from the code we inherited is an array of 26 chars) * the implementation of "groups" is currently broken. We have gotten around this by hard-coding in the group size and client ID into the parser when a kernel file is loaded (parsing currently occurs on a per-client basis). Hence, getgroupid should be completely ignored and getgrouprank and getgroupsize ignore the group ID parameter passed in. ===== Limitations The IO language is frozen and no future development will be happening with it, so keep the following limitations in mind when using it. * There is currently no way to specify a "create" flag to open. * Variables are expected to be a single lowercase character. ==== "Mock" IO workload The mock IO workload generator creates a sequential workload of N requests of size M. The generated file ID is either an optional input or 0 - there's also an option to add a (simulated) processes rank to the file IDs, giving in effect a unique file per rank. Relevant configuration parameters are: * mock_num_requests - the number of requests * mock_request_size - the size of each request * mock_request_type - the type of request ("read" or "write") * mock_file_id (optional) - the file ID to use, default 0 * mock_use_unique_file_ids (optional) - if non-zero, add the workload processor's rank to the file ID. Default is 0. * mock_rank_table_size (optional) - the hash table size to store the ranks in. For minimal collisions, choose a value larger than the expected workload number of ranks. ==== Recorder IO workload TODO... ==== Checkpoint IO workload TODO... === Network There has been an addition of online workload generator that replays calls similar to MPI on the network models. The SWM workloads are closed-source right now but integration with conceptual communication library is in progress. Our primary network workload generator is via the DUMPI tool (http://sst.sandia.gov/about_dumpi.html). DUMPI collects and reads events from MPI applications. See the DUMPI documentation for how to generate traces. There are additionally publically-available traces at http://portal.nersc.gov/project/CAL/doe-miniapps.htm. Note on trace reading - the input file prefix to the dumpi workload generator should be everything up to the rank number. E.g., if the dumpi files are of the form "dumpi-YYYY.MM.DD.HH.MM.SS-XXXX.bin", then the input should be "dumpi-YYYY.MM.DD.HH.MM.SS-" === Quality of Service Two models (dragonfly-dally.C and dragonfly-plus.C) can now support traffic differentiation and prioritization. The models support quality of service by directing the network traffic on separate class of virtual channels. Additional documentation on using traffic classes can be found at the wiki link: https://xgitlab.cels.anl.gov/codes/codes/wikis/Quality-of-Service === Workload generator helpers The codes-jobmap API (codes/codes-jobmap.h) specifies mechanisms to initialize and run multiple jobs with unique job IDs and linear namespaces, used to allow multiple concurrent workloads to be run within the same simulation. Example usage can be seen in tests/jobmap-test.c and, for the list allocation method, in tests/conf/jobmap-test-list.conf. Note that only the DUMPI trace generator has been tested using this interface at this time. == LP-IO LP-IO is a set of simple reverse-computation-aware routines for conditionally outputting data on a per-LP basis. As the focus is on convenient, small-scale data output, data written via LP-IO remains in memory until the end of the simulation, or freed upon reverse computation. Large-scale, reverse-computation-aware IO is a feature we're thinking about for future usage. The API can be found at codes/lp-io.h and is fairly self-explanatory. == CODES configurator The configurator is a set of scripts intended to make the auto-generation of multiple CODES configuration files easier, for the purposes of performing parameter sweeps of simulations. The configuration file defining the parameters in the parameter sweep is defined by a python source file with well-defined field names, to maximize flexibility and enable some essential features for flexible parameter sweeps (disabling certain combinations of parameters, deriving parameters from other parameters in the sweep). The actual replacement is driven by token replacement defined by the values in the configuration file. An exhaustive example can be found at scripts/example. The scripts themselves are codes_configurator.py, codes_filter_configs.py, and codes_config_get_vals.py, each with detailed usage info. These scripts have heavily-overlapping functionality, so in the future these may be merged. == Miscellaneous utilities === Workload display utility For debugging and experimentation purposes, a plain-text "dump" of an IO workload can be seen using the utility src/workload/codes-workload-dump (it gets installed into $bindir). === LP template (src/util/templates) As writing ROSS/CODES models currently entail a not-insignificant amount of boilerplate for defining LPs and hooking them into ROSS, we have a template model for use at src/util/templates/lp_template.* . === Generic message header (see best practices) We recommend the use of codes/lp-msg.h to standardize LP event headers, making it easier to identify messages. = Utility models == Local storage model The local storage model (LSM) is fairly simple in design but is sufficient for many simulations with reasonable I/O access patterns. It is an overhead/latency/bandwidth model that tracks file and offset accesses to determine whether to apply seeking penalties to the performance of the access. It uses a simple histogram-based approach to parameterization: overhead/latency/bandwidth numbers are given relative to different access sizes. To gather such parameters, well-known I/O benchmarks such as fio (http://git.kernel.dk/?p=fio.git;a=summary) can be used. The LP name used in configuration is "lsm" and the configuration is expected to be in a similarly named standalone group, an example of which is shown below: lsm { # in bytes request_sizes = ( "4096","8192","16384","32768","65536","131072","262144","524288","1048576","2097152","4194304" ); # in MiB/s (2^20 bytes / s) write_rates = ( "1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7","1511.7" ); read_rates = ( "1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1","1542.1" ); # in microseconds write_seeks = ( "499.5","509.0","514.7","525.9","546.4","588.3","663.1","621.8","539.1","3179.5","6108.8" ); read_seeks = ( "3475.6","3470.0","3486.2","3531.2","3608.6","3741.0","3988.9","4530.2","5644.2","7922.0","11700.3" ); write_overheads = ( "29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67","29.67" ); read_overheads = ( "23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67","23.67" ); } The API can be found at codes/local-storage-model.h and example usage can be seen in tests/local-storage-model-test.c and tests/conf/lsm-test.conf. The queueing policy of LSM is currently FIFO, and the default mode uses an implicit queue, simply incrementing counters and scheduling future events when I/O requests come in. Additionally, an explicit queue has been added and provides a simple FIFO+priority mechanism. To use, in the "lsm" group set "enable_scheduler" to the value "1". == Resource model The resource model presents a simple integer counter representing some finite resource (e.g., bytes of memory available). LPs request some number of units of the resource, receiving a success/failure completion message via a callback mechanism. Optional "blocking" can be used to defer the completion message until the request is successfully completed. The configuration LP name is "resource" and the parameters are given in a similarly-named group. An example is shown below: resource { available="8192"; } The API for the underlying resource data structure can be found in codes/resource.h. The user-facing API for communicating with the LP can be found in codes/resource-lp.h. = model-net overview model-net is a set of networking APIs and models intended to provide transparent message passing support between LPs, allowing the underlying network LPs to do the work of routing while user LPs model their applications. It consists of a number of both simple and complex network models as well as configuration utilities and communication APIs. A somewhat stale overview is also given at src/networks/model-net/doc/README. = Components of model-net == messaging API model-net provides a few messaging routines for use by LPs. Each specify a message payload and an event to execute upon message receipt. They allow for: * push-style messaging, e.g., MPI_Send with an asynchronous MPI_Recv at the receiver. * pull-style messaging (e.g., RMA). E.g., LP x issues a pull message event, which reaches LP y. Upon receipt, LP y sends the specified message payload back to LP x. LP x executes the event it specified once the message payload is fully received. Note that, at the modeler level, The pull message event is not seen by the destination LP (LP y in this example). * (experimental) initial collective communication support. This is currently being evaluated - more documentation on this feature will follow. The messaging types, with the exception of the collective routines, support specializing by LP annotation, only communicating through model-net LPs containing the given annotation. Additionally, there is initial support for different message scheduling algorithms - see codes/model-net-sched.h and the following section. In particular, message-specific parameters can be set via the model_net_set_msg_param function in codes/model-net.h, which currently is only used with the "priority" scheduler. The messaging API is given by codes/model-net.h, through the model_net_event family of functions. Usage can be found in the example program of the codes-base repository. == LP configuration / mapping model-net LPs, like any other LP in CODES/ROSS, require specification in the configuration file as well as LP-specific parameters. To identify model-net LPs to CODES, we use the naming scheme of prefixing model-net LP names with the string "modelnet_". For example, the "simplenet" LP (see "model-net models") would be specified in the LPGROUPS section as "modelnet_simplenet". Currently, model-net LPs expect their configuration values in the PARAMS section. This is due to historical precedent and may be changed in the future to use their own dedicated section. There are a combination of general model-net parameters and model-specific parameters. The general parameters are: * modelnet_order - this is the only required option. model-net assigns integer IDs to the different types of networks used (independent of annotation) so that the user can differentiate between different network topologies without knowing the particular models used. All models used in the CODES configuration file must be referenced in the parameter. See doc/example_heterogeneous in the codes-base repository. * packet_size - the size of packets in bytes. model-net will break messages into * modelnet_scheduler - the algorithm for scheduling packets when multiple concurrent messages are being processed. Options can be found in the SCHEDULER_TYPES macro in codes/model-net-sched.h). In particular, the "priority" scheduler requires two additional parameters in PARAMS: "prio-sched-num-prios" and "prio-sched-sub-sched", the former of which sets the number of priorities to use and the latter of which sets the scheduler used for messages with the same priority. == Statistics tracking model-net tracks basic statistics for each LP such as bytes sent/received and time spent sending/receiving messages. If LP-IO has been enabled in the program, they will be printed to the specified directory. = model-net models Currently, model-net contains a combination of analytical models and specific models for high-performance networking, with an HPC bent. Configuration files for each model can be found in tests/conf, under the name "model-net-test*.conf". == Simplenet/SimpleP2P The Simplenet and SimpleP2P models (model-net LP names "simplenet" and "simplep2p") are basic queued point-to-point latency/bandwidth models, assuming infinite packet buffering when routing. These are best used for models that require little fidelity out of the network performance. SimpleP2P is the same model as Simplenet, except it provides heterogeneous link capacities from point to point. Rather than a single entry, it requires files containing a matrix of point-to-point bandwidths and latencies. Simplenet models require two configuration parameters: "net_startup_ns" and "net_bw_mbps", which define the startup and bandwidth costs in nanoseconds and megabytes per second, respectively. SimpleP2P requires two configuration files for the latency and bandwidth costs: "net_latency_ns_file" and "net_bw_mbps_file". More details about the models can be found at src/networks/model-net/doc/README.simplenet.txt and src/networks/model-net/doc/README.simplep2p.txt, respectively. == LogGP The LogGP model (model-net LP name: "loggp"), based on the LogP model, is a more advanced analytical model, containing point-to-point latency ("L"), CPU overhead ("o"), a minimum time between consecutive messages ("gap", or "g"), and a time-per-byte-sent ("G") independent from the gap. The only configuration entry the LogGP model requires is "PARAMS:net_config_file", which points to a file with path *relative* to the configuration file. For more details on gathering parameters for the LogGP model, as well as it's usage and caveats, see the document src/model-net/doc/README.loggp.txt. == Torus The torus model (model-net LP name: "torus") represents a torus topology with arbitrary dimension, used primarily in HPC systems. It is based on the work performed in: M. Mubarak, C. D. Carothers, R. B. Ross, P. Carns. “A case study in using massively parallel simulation for extreme-scale torus network co-design”, ACM SIGSIM conference on Principles of Advanced Discrete Simulations (PADS), 2014. The configuration and model setup can be found at: src/model-net/doc/README.torus.txt == Dragonfly The dragonfly model (model-net LP name: "dragonfly") is a network topology that utilizes the concept of virtual routers to produce systems with very high virtual radix out of network components with a lower radix. The topology itself and the simulation model are both described in src/networks/model-net/doc/README.dragonfly.txt. cite). The configuration parameters are a little trickier here, as additional LPs other than the "modelnet_dragonfly" LP must be specified. "modelnet_dragonfly" represent the node LPs (terminals), while a second type "dragonfly_router" represents a physical router. At least one "dragonfly_router" LP must be present in every LP group with a "modelnet_dragonfly" LP. Further configuration and model setup can be found at src/model-net/doc/README.dragonfly.txt. A newer version of dragonfly model that supports a Cray style network is also available. Instructions can be found at src/model-net/doc/README.dragonfly-custom.txt. = CODES example model An example model representing most of the functionality present in CODES is available in doc/example. In this scenario, we have a certain number of storage servers, identified through indices 0, ... , n-1 where each server has a network interface card (NIC) associated with it. The servers exchange messages with their neighboring server via their NIC card (i.e., server i pings server i+1, rolling over the index if necessary). When the neighboring server receives the message, it sends an acknowledgement message to the sending server in response. Upon receiving the acknowledgement, the sending server issues another message. This process continues until some number of messages have been sent. For simplicity, it is assumed that each server has a direct link to its neighbor, and no network congestion occurs due to concurrent messages being sent. The model is relatively simple to simulate through the usage of ROSS. There are two distinct LP types in the simulation: the server and the NIC. Refer to example.c for data structure definitions. The server LPs are in charge of issuing/acknowledging the messages, while the NIC LPs (implemented via CODES's model-net component, available in the codes-net repository) transmit the data and inform their corresponding servers upon completion. This LP decomposition strategy is generally preferred for ROSS-based simulations: have single-purpose, simple LPs representing logical system components. In this program, CODES is used in the following four ways: to provide configuration utilities for the program (example.conf), to logically separate and provide lookup functionality for multiple LP types, to automate LP placement on KPs/PEs, and to simplify/modularize the underlying network structure. The configuration API is used for the first use-case, the mapping API is used for the second and third use-cases, and the model-net API is used for the fourth use-case. The following sections discuss these while covering necessary ROSS-specific information. == Configuration and mapping In the example program, there are one server LP and one "modelnet_simplenet" LP type in a group and this combination is repeated 16 times (repetitions="16") for a total of 32 LPs. The section "server_pings" is server-LP-specific and defines the number of rounds of communication and the payload for each round. We use the simple-net LP provided by model-net as the underlying network model. The simple-net parameters are specified by the user in the PARAMS section of the example.conf config file. == Server state and event handlers The server LP state maintains a count of the number of remote messages it has sent and received as well as the number of local completion messages. For the server event message, we have four message types: KICKOFF, REQ, ACK and LOCAL. With a KICKOFF event, each LP sends a message to itself to begin the simulation proper. To avoid event ties, we add a small amount of random noise using codes_local_latency. The REQ message is sent by a server to its neighboring server and when received, neighboring server sends back a message of type ACK. We've shown a hard-coded direct communication method which directly computes the LP ID, and a codes-mapping API-based method. == Server reverse computation ROSS has the capability for optimistic parallel simulation, but instead of saving the state of each LP, they instead require users to perform reverse computation. That is, while the event messages are themselves preserved (until the Global Virtual Time (GVT) algorithm renders the messages unneeded), the LP state is not preserved. Hence, it is up to the simulation developer to provide functionality to reverse the LP state, given the event to be reversed. ROSS makes this simpler in that events will always be rolled back in exactly the order they were applied. Note that ROSS also has both serial and parallel conservative modes, so reverse computation may not be necessary if the simulation is not compute- or memory-intensive. For our example program, recall the "forward" event handlers. They perform the following: * Kickoff: send a message to the peer server, and increment sender LP's count of sent messages. * Request (received from peer server): increment receiver count of received messages, and send an acknowledgement to the sender. * Acknowledgement (received from message receiver): send the next message to the receiver and increment messages sent count. Set a flag indicating whether a message has been sent. * Local model-net callback: increment the local model-net received messages count. In terms of LP state, the four operations are simply modifying counts. Hence, the "reverse" event handlers need to merely roll back those changes: * Kickoff: decrement sender LP's count of sent messages. * Request (received from peer server): decrement receiver count of received messages. * Acknowledgement (received from message receiver): decrement messages sent count if flag indicating a message has been sent has not been set. * Local model-net callback: decrement the local model-net received messages count. For more complex LP states (such as maintaining queues), reverse event processing becomes similarly more complex. Refer to the best practices document for strategies of coping with the increase in complexity.