Commit 75a2ddb9 authored by Xin Wang's avatar Xin Wang
Browse files

update with master branch to fix ROSS instrumentation issue

parents b4aa6ce8 84bd42a0
......@@ -13,6 +13,8 @@ EXTRA_PROGRAMS =
CLEANFILES = $(bin_SCRIPTS)
EXTRA_DIST =
BUILT_SOURCES =
AM_LDFLAGS =
# pkgconfig files
pkgconfigdir = $(libdir)/pkgconfig
......@@ -47,7 +49,7 @@ endif
if USE_DARSHAN
AM_CPPFLAGS += ${DARSHAN_CFLAGS} -DUSE_DARSHAN=1
src_libcodes_la_SOURCES += src/workload/methods/codes-darshan-io-wrkld.c
src_libcodes_la_SOURCES += src/workload/methods/codes-darshan3-io-wrkld.c
LDADD += ${DARSHAN_LIBS}
TESTS += tests/workload/darshan-dump.sh
endif
......@@ -91,3 +93,7 @@ endif
LDADD += ${DUMPI_LIBS}
endif
if USE_RDAMARIS
AM_CPPFLAGS += ${ROSS_Damaris_CFLAGS} -DUSE_RDAMARIS=1
LDADD += ${ROSS_Damaris_LIBS}
endif
## README for using ROSS instrumentation with CODES
For details about the ROSS instrumentation, see the [ROSS Instrumentation blog post](http://carothersc.github.io/ROSS/feature/instrumentation.html)
For details about the ROSS instrumentation, see the [ROSS Instrumentation blog post](http://carothersc.github.io/ROSS/instrumentation/instrumentation.html)
on the ROSS webpage.
There are currently 3 types of instrumentation: GVT-based, real time, and event tracing. See the ROSS documentation for more info on
the specific options or use `--help` with your model. To collect data about the simulation engine, no changes are needed to model code
for any of the instrumentation modes. Some additions to the model code is needed in order to turn on any model-level data collection.
See the "Model-level data sampling" section on [ROSS Instrumentation blog post](http://carothersc.github.io/ROSS/feature/instrumentation.html).
There are currently 4 types of instrumentation: GVT-based, real time sampling, virtual time sampling, and event tracing.
See the ROSS documentation for more info on the specific options or use `--help` with your model.
To collect data about the simulation engine, no changes are needed to model code for any of the instrumentation modes.
Some additions to the model code is needed in order to turn on any model-level data collection.
See the "Model-level data sampling" section on [ROSS Instrumentation blog post](http://carothersc.github.io/ROSS/instrumentation/instrumentation.html).
Here we describe CODES specific details.
### Register Instrumentation Callback Functions
......@@ -17,15 +18,11 @@ The examples here are based on the dragonfly router and terminal LPs (`src/netwo
As described in the ROSS Vis documentation, we need to create a `st_model_types` struct with the pointer and size information.
```C
st_model_types dragonfly_model_types[] = {
{(rbev_trace_f) dragonfly_event_collect,
sizeof(int),
(ev_trace_f) dragonfly_event_collect,
{(ev_trace_f) dragonfly_event_collect,
sizeof(int),
(model_stat_f) dragonfly_model_stat_collect,
sizeof(tw_lpid) + sizeof(long) * 2 + sizeof(double) + sizeof(tw_stime) * 2},
{(rbev_trace_f) dragonfly_event_collect,
sizeof(int),
(ev_trace_f) dragonfly_event_collect,
{(ev_trace_f) dragonfly_event_collect,
sizeof(int),
(model_stat_f) dfly_router_model_stat_collect,
0}, // updated in router_setup()
......@@ -33,20 +30,17 @@ st_model_types dragonfly_model_types[] = {
}
```
`dragonfly_model_types[0]` is the function pointers for the terminal LP and `dragonfly_model_types[1]` is for the router LP.
For the first two function pointers for each LP, we use the same `dragonfly_event_collec()` because right now we just collect the event type, so
it's the same for both of these LPs. You can change these if you want to use different functions for different LP types or if you want a different
function for the full event tracing than that used for the rollback event trace (`rbev_trace_f` is for the event tracing of rollback triggering events only,
while `ev_trace_f` is for the full event tracing).
The number following each function pointer is the size of the data that will be saved when the function is called.
The third pointer is for the data to be sampled at the GVT or real time sampling points.
For the first function pointer for each LP type, we use the same `dragonfly_event_collect()` because right now we just collect the event type, so it's the same for both of these LP types.
You can change these if you want to use different functions for different LP types.
The number following that function pointer is the size of the data that will be saved when the function is called.
The second pointer is for the data to be sampled at the GVT or real time sampling points.
In this case the LPs have different function pointers since we want to collect different types of data for the two LP types.
For the terminal, I set the appropriate size of the data to be collected, but for the router, the size of the data is dependent on the radix for the
dragonfly configuration being used, which isn't known until runtime.
For the terminal, I set the appropriate size of the data to be collected, but for the router, the size of the data is dependent on the radix for the dragonfly configuration being used, which isn't known until runtime.
*Note*: You can only reuse the function for event tracing for LPs that use the same type of message struct.
For example, the dragonfly terminal and router LPs both use the `terminal_message` struct, so they can
use the same functions for event tracing. However the model net base LP uses the `model_net_wrap_msg` struct, so it gets its own event collection function and
`st_trace_type` struct, in order to read the event type correctly from the model.
use the same functions for event tracing.
However the model net base LP uses the `model_net_wrap_msg` struct, so it gets its own event collection function and `st_trace_type` struct, in order to read the event type correctly from the model.
In the ROSS instrumentation documentation, there are two methods provided for letting ROSS know about these `st_model_types` structs.
In CODES, this step is a little different, as `codes_mapping_setup()` calls `tw_lp_settype()`.
......@@ -106,9 +100,7 @@ Using the synthetic workload LP for dragonfly as an example (`src/network-worklo
In the main function, you call the register function *before* calling `codes_mapping_setup()`.
```C
st_model_types svr_model_types[] = {
{(rbev_trace_f) svr_event_collect,
sizeof(int),
(ev_trace_f) svr_event_collect,
{(ev_trace_f) svr_event_collect,
sizeof(int),
(model_stat_f) svr_model_stat_collect,
0}, // at the moment, we're not actually collecting any data about this LP
......@@ -143,10 +135,13 @@ modes are collecting model-level data as well.
### CODES LPs that currently have event type collection implemented:
If you're using any of the following CODES models, you don't have to add anything, unless you want to change the data that's being collected.
- nw-lp (model-net-mpi-replay.c)
- original dragonfly router and terminal LPs (dragonfly.c)
- dfly server LP (model-net-synthetic.c)
- model-net-base-lp (model-net-lp.c)
- custom dfly server LP (model-net-synthetic-custom-dfly.c)
- fat tree server LP (model-net-synthetic-fattree.c)
- slimfly server LP (model-net-synthetic-slimfly.c)
- original dragonfly router and terminal LPs (dragonfly.c)
- dragonfly custom router and terminal LPs (dragonfly-custom.C)
- slimfly router and terminal LPs (slimfly.c)
- fat tree switch and terminal LPs (fat-tree.c)
- model-net-base-lp (model-net-lp.c)
The fat-tree terminal and switch LPs (fattree.c) are only partially implemented at the moment. It needs two `model_net_method` structs to fully implement,
but currently both terminal and switch LPs use the same `fattree_method` struct.
......@@ -61,7 +61,7 @@ struct iolang_params
struct darshan_params
{
char log_file_path[MAX_NAME_LENGTH_WKLD];
int64_t aggregator_cnt;
int app_cnt;
};
struct recorder_params
......@@ -155,7 +155,24 @@ enum codes_workload_op_type
/* for workloads that have events not yet handled
* (eg the workload language) */
CODES_WK_IGNORE
CODES_WK_IGNORE,
/* extended IO workload operations: MPI */
/* open */
CODES_WK_MPI_OPEN,
/* close */
CODES_WK_MPI_CLOSE,
/* write */
CODES_WK_MPI_WRITE,
/* read */
CODES_WK_MPI_READ,
/* collective open */
CODES_WK_MPI_COLL_OPEN,
/* collective_write */
CODES_WK_MPI_COLL_WRITE,
/* collective_read */
CODES_WK_MPI_COLL_READ,
};
/* I/O operation paramaters */
......@@ -166,7 +183,7 @@ struct codes_workload_op
*/
/* what type of operation this is */
int op_type;
enum codes_workload_op_type op_type;
/* currently only used by network workloads */
double start_time;
double end_time;
......@@ -329,6 +346,11 @@ void codes_workload_print_op(
int app_id,
int rank);
int codes_workload_get_time(const char *type,
const char * params,
int app_id,
int rank, double *read_time, double *write_time, int64_t *read_bytes, int64_t *written_bytes);
/* implementation structure */
struct codes_workload_method
{
......@@ -341,6 +363,8 @@ struct codes_workload_method
void (*codes_workload_get_next_rc2)(int app_id, int rank);
int (*codes_workload_get_rank_cnt)(const char* params, int app_id);
int (*codes_workload_finalize)(const char* params, int app_id, int rank);
/* added for get all read or write time */
int (*codes_workload_get_time)(const char * params, int app_id, int rank, double *read_time, double *write_time, int64_t *read_bytes, int64_t *written_bytes);
};
......
......@@ -97,6 +97,8 @@ static inline void codes_local_latency_reverse(tw_lp *lp)
return;
}
void codes_comm_update();
#ifdef __cplusplus
}
#endif
......
#ifndef CONNECTION_MANAGER_H
#define CONNECTION_MANAGER_H
/**
* connection-manager.h -- Simple, Readable, Connection management interface
* Neil McGlohon
*
* Copyright (c) 2018 Rensselaer Polytechnic Institute
*/
#include <map>
#include <vector>
#include <set>
#include "codes/codes.h"
#include "codes/model-net.h"
using namespace std;
/**
* @brief Enum differentiating local router connection types from global.
* Local connections will have router IDs ranging from [0,num_router_per_group)
* whereas global connections will have router IDs ranging from [0,total_routers)
*/
enum ConnectionType
{
CONN_LOCAL = 1,
CONN_GLOBAL = 2,
CONN_TERMINAL = 3
};
/**
* @brief Struct for connection information.
*/
struct Connection
{
int port; //port ID of the connection
int src_lid; //local id of the source
int src_gid; //global id of the source
int src_group_id; //group id of the source
int dest_lid; //local id of the destination
int dest_gid; //global id of the destination
int dest_group_id; //group id of the destination
ConnectionType conn_type; //type of the connection: CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
};
inline bool operator<(const Connection& lhs, const Connection& rhs)
{
return lhs.port < rhs.port;
}
/**
* @class ConnectionManager
*
* @brief
* This class is meant to make organization of the connections between routers more
* streamlined. It provides a simple, readable interface which helps reduce
* semantic errors during development.
*
* @note
* This class was designed with dragonfly type topologies in mind. Certain parts may not
* make sense for other types of topologies, they might work fine, but no guarantees.
*
* @note
* There is the property intermediateRouterToGroupMap and related methods that are implemented but the
* logistics to get this information from input file is more complicated than its worth so I have commented
* them out.
*
* @note
* This class assumes that each router group has the same number of routers in it: _num_routers_per_group.
*/
class ConnectionManager {
map< int, vector< Connection > > intraGroupConnections; //direct connections within a group - IDs are group local - maps local id to list of connections to it
map< int, vector< Connection > > globalConnections; //direct connections between routers not in same group - IDs are global router IDs - maps global id to list of connections to it
map< int, vector< Connection > > terminalConnections; //direct connections between this router and its compute node terminals - maps terminal id to connections to it
map< int, Connection > _portMap; //Mapper for ports to connections
vector< int > _other_groups_i_connect_to;
set< int > _other_groups_i_connect_to_set;
map< int, vector< Connection > > _connections_to_groups_map; //maps group ID to connections to said group
map< int, vector< Connection > > _all_conns_by_type_map;
// map< int, vector< Connection > > intermediateRouterToGroupMap; //maps group id to list of routers that connect to it.
// //ex: intermediateRouterToGroupMap[3] returns a vector
// //of connections from this router to routers that have
// //direct connections to group 3
int _source_id_local; //local id (within group) of owner of this connection manager
int _source_id_global; //global id (not lp gid) of owner of this connection manager
int _source_group; //group id of the owner of this connection manager
int _used_intra_ports; //number of used ports for intra connections
int _used_inter_ports; //number of used ports for inter connections
int _used_terminal_ports; //number of used ports for terminal connections
int _max_intra_ports; //maximum number of ports for intra connecitons
int _max_inter_ports; //maximum number of ports for inter connections
int _max_terminal_ports; //maximum number of ports for terminal connections.
int _num_routers_per_group; //number of routers per group - used for turning global ID into local and back
public:
ConnectionManager(int src_id_local, int src_id_global, int src_group, int max_intra, int max_inter, int max_term, int num_router_per_group);
/**
* @brief Adds a connection to the manager
* @param dest_gid the global ID of the destination router
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
*/
void add_connection(int dest_gid, ConnectionType type);
// /**
// * @brief adds knowledge of what next hop routers have connections to specific groups
// * @param local_intm_id the local intra group id of the router that has the connection to dest_group_id
// * @param dest_group_id the id of the group that the connection goes to
// */
// void add_route_to_group(int local_intm_id, int dest_group_id);
// /**
// * @brief returns a vector of connections to routers that have direct connections to the specified group id
// * @param dest_group_id the id of the destination group that all connections returned have a direct connection to
// */
// vector< Connection > get_intm_conns_to_group(int dest_group_id);
// /**
// * @brief returns a vector of local router ids that have direct connections to the specified group id
// * @param dest_group_id the id of the destination group that all routers returned have a direct connection to
// * @note if a router has multiple intra group connections to a single router and that router has a connection
// * to the dest group then that router will appear multiple times in the returned vector.
// */
// vector< int > get_intm_routers_to_group(int dest_group_id)
/**
* @brief get the source ID of the owner of the manager
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
*/
int get_source_id(ConnectionType type);
/**
* @brief get the port(s) associated with a specific destination ID
* @param dest_id the ID (local or global depending on type) of the destination
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
*/
vector<int> get_ports(int dest_id, ConnectionType type);
/**
* @brief get the connection associated with a specific port number
* @param port the enumeration of the port in question
*/
Connection get_connection_on_port(int port);
/**
* @brief returns true if a connection exists in the manager from the source to the specified destination ID BY TYPE
* @param dest_id the ID of the destination depending on the type
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
* @note Will not return true if dest_id is within own group and type is CONN_GLOBAL, see is_any_connection_to()
*/
bool is_connected_to_by_type(int dest_id, ConnectionType type);
/**
* @brief returns true if any connection exists in the manager from the soruce to the specified global destination ID
* @param dest_global_id the global id of the destination
* @note This is meant to allow for a developer to determine connectivity just from the global ID, even if the two entities
* are connected by a local or terminal connection.
*/
bool is_any_connection_to(int dest_global_id);
/**
* @brief returns the total number of used ports by the owner of the manager
*/
int get_total_used_ports();
/**
* @brief returns the number of used ports for a specific connection type
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
*/
int get_used_ports_for(ConnectionType type);
/**
* @brief returns the type of connection associated with said port
* @param port_num the number of the port in question
*/
ConnectionType get_port_type(int port_num);
/**
* @brief returns a vector of connections to the destination ID based on the connection type
* @param dest_id the ID of the destination depending on the type
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
*/
vector< Connection > get_connections_to_gid(int dest_id, ConnectionType type);
/**
* @brief returns a vector of connections to the destination group. connections will be of type CONN_GLOBAL
* @param dest_group_id the id of the destination group
*/
vector< Connection > get_connections_to_group(int dest_group_id);
/**
* @brief returns a vector of all connections to routers via type specified.
* @param type the type of the connection, CONN_LOCAL, CONN_GLOBAL, or CONN_TERMINAL
* @note this will return connections to same destination on different ports as individual connections
*/
vector< Connection > get_connections_by_type(ConnectionType type);
/**
* @brief returns a vector of all group IDs that the router has a global connection to
* @note this does not include the router's own group as that is a given
*/
vector< int > get_connected_group_ids();
/**
*
*/
void solidify_connections();
/**
* @brief prints out the state of the connection manager
*/
void print_connections();
};
//******************* BEGIN IMPLEMENTATION ********************************************************
//******************* Connection Manager Implementation *******************************************
ConnectionManager::ConnectionManager(int src_id_local, int src_id_global, int src_group, int max_intra, int max_inter, int max_term, int num_router_per_group)
{
_source_id_local = src_id_local;
_source_id_global = src_id_global;
_source_group = src_group;
_used_intra_ports = 0;
_used_inter_ports = 0;
_used_terminal_ports = 0;
_max_intra_ports = max_intra;
_max_inter_ports = max_inter;
_max_terminal_ports = max_term;
_num_routers_per_group = num_router_per_group;
}
void ConnectionManager::add_connection(int dest_gid, ConnectionType type)
{
Connection conn;
conn.src_lid = _source_id_local;
conn.src_gid = _source_id_global;
conn.src_group_id = _source_group;
conn.conn_type = type;
conn.dest_lid = dest_gid % _num_routers_per_group;
conn.dest_gid = dest_gid;
conn.dest_group_id = dest_gid / _num_routers_per_group;
switch (type)
{
case CONN_LOCAL:
conn.port = this->get_used_ports_for(CONN_LOCAL);
intraGroupConnections[conn.dest_lid].push_back(conn);
_used_intra_ports++;
break;
case CONN_GLOBAL:
conn.port = _max_intra_ports + this->get_used_ports_for(CONN_GLOBAL);
globalConnections[conn.dest_gid].push_back(conn);
_used_inter_ports++;
break;
case CONN_TERMINAL:
conn.port = _max_intra_ports + _max_inter_ports + this->get_used_ports_for(CONN_TERMINAL);
conn.dest_group_id = _source_group;
terminalConnections[conn.dest_gid].push_back(conn);
_used_terminal_ports++;
break;
default:
assert(false);
// TW_ERROR(TW_LOC, "add_connection(dest_id, type): Undefined connection type\n");
}
if(conn.dest_group_id != conn.src_group_id)
_other_groups_i_connect_to_set.insert(conn.dest_group_id);
_portMap[conn.port] = conn;
}
// void ConnectionManager::add_route_to_group(Connection conn, int dest_group_id)
// {
// intermediateRouterToGroupMap[dest_group_id].push_back(conn);
// }
// vector< Connection > ConnectionManager::get_intm_conns_to_group(int dest_group_id)
// {
// return intermediateRouterToGroupMap[dest_group_id];
// }
// vector< int > ConnectionManager::get_intm_routers_to_group(int dest_group_id)
// {
// vector< Connection > intm_router_conns = get_intm_conns_to_group(dest_group_id);
// vector< int > loc_intm_router_ids;
// vector< Connection >::iterator it;
// for(it = intm_router_conns.begin(); it != intm_router_conns.end(); it++)
// {
// loc_intm_router_ids.push_back((*it).other_id);
// }
// return loc_intm_router_ids;
// }
int ConnectionManager::get_source_id(ConnectionType type)
{
switch (type)
{
case CONN_LOCAL:
return _source_id_local;
case CONN_GLOBAL:
return _source_id_global;
default:
assert(false);
// TW_ERROR(TW_LOC, "get_source_id(type): Unsupported connection type\n");
}
}
vector<int> ConnectionManager::get_ports(int dest_id, ConnectionType type)
{
vector< Connection > conns = this->get_connections_to_gid(dest_id, type);
vector< int > ports_used;
vector< Connection >::iterator it = conns.begin();
for(; it != conns.end(); it++) {
ports_used.push_back((*it).port); //add port from connection list to the used ports list
}
return ports_used;
}
Connection ConnectionManager::get_connection_on_port(int port)
{
return _portMap[port];
}
bool ConnectionManager::is_connected_to_by_type(int dest_id, ConnectionType type)
{
switch (type)
{
case CONN_LOCAL:
if (intraGroupConnections.find(dest_id) != intraGroupConnections.end())
return true;
break;
case CONN_GLOBAL:
if (globalConnections.find(dest_id) != globalConnections.end())
return true;
break;
case CONN_TERMINAL:
if (terminalConnections.find(dest_id) != terminalConnections.end())
return true;
break;
default:
assert(false);
// TW_ERROR(TW_LOC, "get_used_ports_for(type): Undefined connection type\n");
}
return false;
}
bool ConnectionManager::is_any_connection_to(int dest_global_id)
{
int local_id = dest_global_id % _num_routers_per_group;
if (intraGroupConnections.find(local_id) != intraGroupConnections.end())
return true;
if (globalConnections.find(dest_global_id) != globalConnections.end())
return true;
if (terminalConnections.find(dest_global_id) != terminalConnections.end())
return true;
return false;
}
int ConnectionManager::get_total_used_ports()
{
return _used_intra_ports + _used_inter_ports + _used_terminal_ports;
}
int ConnectionManager::get_used_ports_for(ConnectionType type)
{
switch (type)
{
case CONN_LOCAL:
return _used_intra_ports;
case CONN_GLOBAL:
return _used_inter_ports;
case CONN_TERMINAL:
return _used_terminal_ports;
default:
assert(false);
// TW_ERROR(TW_LOC, "get_used_ports_for(type): Undefined connection type\n");
}
}
ConnectionType ConnectionManager::get_port_type(int port_num)
{
return _portMap[port_num].conn_type;
}
vector< Connection > ConnectionManager::get_connections_to_gid(int dest_gid, ConnectionType type)
{
switch (type)
{
case CONN_LOCAL:
return intraGroupConnections[dest_gid%_num_routers_per_group];
case CONN_GLOBAL:
return globalConnections[dest_gid];
case CONN_TERMINAL:
return terminalConnections[dest_gid];
default:
assert(false);
// TW_ERROR(TW_LOC, "get_connections(type): Undefined connection type\n");
}
}
vector< Connection > ConnectionManager::get_connections_to_group(int dest_group_id)
{
return _connections_to_groups_map[dest_group_id];
}
vector< Connection > ConnectionManager::get_connections_by_type(ConnectionType type)
{
switch (type)
{
case CONN_LOCAL:
return _all_conns_by_type_map[CONN_LOCAL];
break;
case CONN_GLOBAL:
return _all_conns_by_type_map[CONN_GLOBAL];
break;
case CONN_TERMINAL:
return _all_conns_by_type_map[CONN_TERMINAL];