- 04 Mar, 2015 38 commits
-
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The implementations of sendNoncontig for intra-node communication in Nemesis and inter-node communication in network modules (except for TCP and SCIF) assume that req->dev.segment_first is zero and req->dev.segment_size is the size of data, which is not always true. If we stream an RMA operation and issue partial of derived data, req->dev.segment_first specifies the current starting location of the data and req->dev.segment_size specifies the current ending location of the data. Also, the data size should be (req->dev.segment_size - req->dev.segment_first). This patch corrects this issue in Nemesis and network modules. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The original implementation of ACC/GACC on SHM first allocates a temporary buffer which has the same data layout as the target data, copies the entire origin data to that temporary buffer, and then performs the ACC computation between the temporary buffer and the target buffer. The temporary buffer can use potentially large amount of memory. This patch fixes this issue as follows: (1) SHM ACC/GACC routines directly call do_accumulate_op() function, which requires the origin data to be in a 'packed manner'; (2) if the origin data is basic type, we directly perform do_accumulate_op() between origin buffer and target buffer; if the origin data is derived, we stream the origin data by copying partial of origin data into a packed streaming buffer and performing do_accumulate_op() between the streaming buffer and target buffer each time. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
For queued ACC/GACC data piggybacked with LOCK, we do not need to allocate the buffer for the entire operation, but only need to allocate a buffer with stream unit size. This patch fixes this issue. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
On target side, we always allocate a SRBuf with 256K, which equals to the size of stream unit, to receive ACC/GACC data. Note that in MPIDI_CH3U_Request_load_recv_iov(), for ACC/GACC operations, since we already use SRBuf to receive the data at beginning, we will not use another SRBuf here, in order to avoid one more memory copy. Also, we pass the stream_offset in the current RMA packet to the request struct (when receiving is not finished) and do_accumulate_op function (when receiving is finished). Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Originally, do_accumulate_op() is used to perform the ACC computation on target between data from origin side and data on the target window. It requires that the target side must first unpack the received origin data into the same data layout as the target data before calling this function, which may consume potentially large of memory. This patch fixes do_accumulate_op() function in the following aspects: (1) It requires that the origin data passed to the function must be "in a packed manner", which means it looks as if all basic type elements in the origin data is placed one by one. Note that the origin data is not necessarily contiguous, since we may use non-contiguous basic type. If the basic type is contiguous, then the origin data must be contiguous. (2) It adds a new function argument, stream_offset, which specifies a starting location in the target data. This allows the origin data to work with partial of target data with stream size. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
This patch adds req types for FOP operation, and calls FOP req handler after SRBuf is unpacked. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Add stream_offset area into ACC-related packets and request struct to remember current stream unit's starting position in the entire target data. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Add a counter in op struct to remember number of stream units that have already been issued. For example, when the first stream unit piggybacked with LOCK is issued out, we temporarily stop issuing the following units. After the origin receives the ACK from the target, it can continue to issue the following units. This counter helps avoid issuing the first unit again. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
For all stream units within one RMA operation, we only needs to piggyback flags for the first operation to the first stream unit, and piggyback flags for the last operation to the last stream unit. Note that for operations piggybacked with LOCK flag, we should just issue the first stream unit, and wait until we receive ACK from the target to decide if we continue to issue the following units, or re-transmit the first unit. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
In this patch, we define the size of streaming unit the same as the SRBuf size (256 * 1024 bytes), and cut the ACC/GACC packet according to this size. The streaming unit always contains complete basic type data and does not contain partial basic type data. Note that we also increment the ref counter of the pointer to the derived datatype since multiple streaming units within one RMA operation will refer to it. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The stream version of issue_from_origin_buffer is used in ACC/GACC operations. It allows the user to stream the data by passing stream_offset and stream_size to the function. The normal version of issue_from_origin_buffer is used in other RMA operations. It issue all the data as a whole. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The original implementation of create_datatype can only generate a new datatype that describes 'dtype_info + dataloop + one data layout'. It does not support generating 'dtype_info + dataloop + multiple data layouts'. This patch makes create_datatype function to achieve that purpose. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
In the request handler, we should use MPIDI_CH3U_Request_complete to complete user request instead of directly setting it to being completed. This is because when one operation is cut into several packets, we must wait until all packets to be completed to set the user request to be completed. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Because we may cut one RMA operation into multiple packets, and each packet needs a request object to track the completion, here we use a request array instead of single request in RMA operation structure. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Increment active_req_cnt when actually issuing the packet instead of issuing the operation, since we may cut one operation into multiple packets. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
In the original implementation, issue_from_origin_buffer is used to issue out one RMA packet. Since each RMA operation only has one packet, it just attaches the returned request pointer to the RMA operation structure. Now since we are going to cut one RMA operation into multiple stream packets, this function will be used to issue each streamed packets, and each RMA operation may have multiple requests. Therefore, we make this function returns the request pointer and let the caller store the request in the request array of op structure. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
After (1) issuing an op (no LOCK flag), or (2) issuing an op (with LOCK flag) and receiving an ACK that LOCK is granted or queued, we should set the user request (ureq) to be completed. This patch wraps up the work of setting ureq into a function, and call that function after (1) and (2) happens. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
After we issue an op, we set the next_op_to_issue to the next op, and if next op is NULL, we set sync_flag to NONE. When we receive the lock ACK saying that lock request is discarded, we set the next_op_to_issue back to the current op, we reset the sync_flag from NONE to corresponding flag, since we need to re-transmit the current op. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
This patch does not change any functionality but just makes the code structure cleaner. The original code structure of perform_get_acc_in_lock_queue is a mess since the code of dealing with IMMED packet type and the code of dealing with normal packet type are mixed together. This patch separates these two parts and makes the function looks cleaner. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
When the lock is not satisfied, we queue up the lock request and op data in a lock entry queue. In the struct of lock entry, we use 'data_size' to remember the size of buffer for storing the data. Since the size of buffer is not type_size*count but might be type_extent*extent, here we change its name from 'data_size' to 'buf_size'. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The original implementation of RMA does not consider pair basic types (e.g. MPI_FLOAT_INT, MPI_DOUBLE_INT). It only works correctly with builtin datatypes (e.g. MPI_INT, MPI_FLOAT). This patch makes the RMA work correctly with pair basic types. The bug is that: (1) when performing the ACC computation, the original implementation uses 'eltype' in the datatype structure, which is set when all basic elements in this datatype have the same builtin datatype. When basic elements have different builtin datatypes, like pair datatypes, the 'eltype' is set to MPI_DATATYPE_NULL. This makes the ACC computation be unable to work with pair types; (2) for all basic type of data, the original implementation assumes that they are all contiguous and issues them in an unpacked manner with length of data size (count*type_size). This is incorrect for pair datatypes, because most pair datatypes are non-contiguous (type_extent != type_size). In the previous patch, we already made 'eltype' to store basic type instead of builtin type. In this patch, we fixed this bug by (1) modify ACC computation to treat 'eltype' as basic type; (2) For non-contiguous basic type data, we use the noncontig API so that it will be issued in a packed manner. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
'eltype' in datatype struct is originally used to store the builtin datatype. However, this is not correct when working with RMA ACC-like operation since ACC-like operation needs to work with basic type. In this patch we make the 'eltype' to store basic type. Note that (1) whenever we need the builtin type, we should call macro MPID_Datatype_get_basic_type instead of directly accessing 'eltype'; (2) 'element_size' and 'n_elements' still represents builtin type, whereas 'eltype' represents basic type. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The original implementation of PAIRTYPE_SIZE_EXTENT is not correct because it directly modifies variables internally without letting the user pass them. This patch adds those variables in the argument list. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
We piggyback LOCK flag with operations that does not use derived datatypes. Therefore, here we delete the unnecessary code that deal with derived datatypes in piggyback LOCK code. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Flag MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP is used to tell the target if the response packet of current GET, GACC and FOP should use IMMED packet type. We use IMMED packet type only when origin/target/result datatypes are all basic types. Since the target does not know origin/result datatypes, origin process needs to set a flag to inform the target. However, this usage is redundant for GACC and FOP packets. The reason is that, when we use IMMED packet type for GACC/FOP packets, origin/target/result datatypes must be basic types, in such case, we must use IMMED packet type for response packets as well, and usage of MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP and related code is not necessary. In short, flag MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP is useful only for GET operation. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
The original implementation of win_free is not correct. The problem is described as follows: It uses a function pointer which is initially set to the CH3 implementation, and can be overridden by the channel layer if the channel provides an specific implementation. In the CH3 win_free implementation, it first checks if all RMA communication is finished and epoch states is reset, then performs a global barrier, then frees the window resources that are allocated in CH3, and finally returns. In the Nemesis win_free implementation, it directly frees the window resources that are allocated in Nemesis, and calls the CH3 win_free at last. This makes no sense because we free the window resources before checking if the RMA communication is completed. To fix this issue, we add a function hook for channel layer to free its own resources, the the function hook is called from the CH3 win_free. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
In this patch, we first add a function pointer of Win_gather_info in CH3 to allow different channel layers to implement their own version of Win_gather_info function. The function pointer is initially set to the default implementation in CH3 layer. If the channel layer provides an implementation of Win_gather_info, it will override the function pointer. Secondly, we provide an implementation of Win_gather_info in the Nemesis layer. In this implementation, we allocate basic_info_table[] in the SHM region, so that processes on the same node can share the same base_info_table[]. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
There are some window attributes in the channel layer that needs to be initialized during window creation. In this patch, we first add a win_hooks table that contains pointers to the channel's implementation of the function hooks. Secondly, we add a function hook 'win_init' to allow the channel layer to initialize its own attributes. The hook is called from the CH3 win_init function. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Given one process, shm_base_addrs[] is used to store the base addresses (in the address space of this process) of SHM window on other processes. The original size of it is comm_size. However, the maximum number of SHM windows that this process can access to is node_size instead of comm_size, which results in a waste of memory since most slots in the array is NULL. In this patch we reduce the size of shm_base_addrs[] from comm_size to node_size. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
In this patch, we gather window basic attributes of other processes (base_addr, size, disp_unit, win_handle) using a struct called "basic_info_table". By doing this, we can use one contiguous memory region to store them. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Function MPIDI_CH3U_Win_create_gather exchanges the window information among processes. It does not create new window. Here we change the function name to a more suitable one. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Originally MPIDI_Alloc_mem(size, info) and MPIDI_Free_mem(base_ptr) in CH3 layer are implemented by calling MPIU_Malloc(size) and MPIU_Free(base_ptr) internally. This makes the underlying hardware be unable to develop a specific implementation of Alloc_mem and Free_mem, which is necessary when registering memory for RDMA operations. This patch defines new APIs, MPIDI_CH3I_Alloc_mem(size, info) and MPIDI_CH3I_Free_mem(base_ptr), to allow channels to implement their own memory allocators. If the channel does not have its own implementation, MPICH will fallback to the default implementation in CH3 layer which uses MPIU_Malloc and MPIU_Free. Thanks to Steffen Christgau <christgau@cs.uni-potsdam.de> for this contribution. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 03 Mar, 2015 2 commits
-
-
Xin Zhao authored
No reviewer.
-
Junchao Zhang authored
Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-