- 03 Nov, 2014 32 commits
-
-
Xin Zhao authored
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
During PSCW, when there are active-message operations to be issued in Win_complete, we piggback a AT_COMPLETE flag with it so that when target receives it, it can decrement a counter on target side and detect completion when target counter reaches zero. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
When the origin wants to do a FLUSH sync, if there are active-message operations that are going to be issued, we piggback the FLUSH message with the last operation; if no such operations, we just send a single FLUSH packet. If the last operation is a write op (PUT, ACC) or only a single FLUSH packet is sent, after target recieves it, target will send back a single FLUSH_ACK packet; if the last operation contains a read action (GET, GACC, FOP, CAS), after target receiveds it, target will piggback a FLUSH_ACK flag with the response packet. After origin receives the FLUSH_ACK packet or response packet with FLUSH_ACK flag, it will decrement the counter which indicates number of outgoing sync messages (FLUSH / UNLOCK). When that counter reaches zero, origin can know that remote completion is achieved. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Separate final request handler of PUT, ACC, GACC into three. Separate derived DT request handler of ACC and GACC into two. Renaming request handlers as follows: (1) Normal request handler: it is triggered on target side when all data from origin is received. It includes: ReqHandler_PutRecvComplete --- for PUT ReqHandler_AccumRecvComplete --- for ACC ReqHandler_GaccumRecvComplete --- for GACC (2) Derived DT request handler: it is triggered on target side when all derived DT info is recieved. It includes: ReqHandler_PutDerivedDTRecvComplete --- for PUT ReqHandler_AccumDerivedDTRecvComplete --- for ACC ReqHandler_GaccumDerivedDTRecvComplete --- for GACC (3) Reponse request handler: it is triggered on target side when sending back process is finished in GET-like operations. It includes: ReqHandler_GetSendComplete --- for GET ReqHandler_GaccumLikeSendComplete --- for GACC, FOP, CAS Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Previously several RMA packet types share the same structure, which is misleading for coding. Here make different RMA packet types use different packet data structures. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
We use new algorithms for RMA synchronization functions and RMA epochs. The old implementation uses a lazy-issuing algorithm, which queues up all operations and issues them at end. This forbid opportunites to do hardware RMA operations and can use up all memory resources when we queue up large number of operations. Here we use a new algorithm, which will initialize the synchonization at beginning, and issue operations as soon as the synchronization is finished. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
When there are too many active requests in the runtime, the internal memory might be used up. This patch prevents such situation by triggering blocking wait loop in operation routines when no. of active requests reaches certain threshold value. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
We no longer use the lazy-issuing model, which delays all operations to the end to issue, but issues them as early as possible. To achieve this, we enable making progress in RMA routines, so that RMA operations can be issued out as long as synchronization is finished. Sometimes we also need to poke the progress in operation routines to make sure that target side makes enough progress to receiving packets. Here we trigger it when no. of posted operations reaches certain threshold value. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
GET_OP function may be a blocking function which guarantees to return an RMA operation. Inside GET_OP we first call the normal OP_ALLOC function which will try to get a new OP from OP pools; if failed, we call nonblocking GC function to cleanup completed ops and then call OP_ALLOC again; if we still cannot get a new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION function if hardware ordering is provided and then call OP_ALLOC again; if still failed, finally we call blocking aggressive cleanup function, which will guarantee to return a new OP element. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
When FLUSH sync is issued and remote completion ordering between the last FLUSH message and all previous ops is provided by curent hardware, we no longer need to maintain incomplete operations but only need to wait for the ACK of current FLUSH. Therefore we can free those operation resources without blocking waiting. Not that if we do this, we temporarily lose the opportunity to do a real FLUSH_LOCAl until the current FLUSH ACK is received. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
When we run out of resources for operations and targets, we need to make the runtime to complete some operations so that it can free some resources. For RMA operations, we implement by doing an internal FLUSH_LOCAL for one target and waiting for operation resources; for RMA targets, we implement by doing an internal FLUSH operation for one target and wait for target resources. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Progress making functions check if current synchronization is finished, change synchronization state if possible, and issue pending operations on window as many as possible. There are three granularity of progress making functions: per-target, per-window and per-process. Per-target routine is used in RMA routine functions (PUT/GET/ACC...) and single passive lock (Win_unlock, Win_flush, Win_flush_local); per-window routine is used in window-wide synchronization calls (Win_fence, Win_complete, Win_unlock_all, Win_flush_all, Win_flush_local_all), and per-process routine is used in progress engine. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Here we implement garbage collection functions for both operations and targets. There are two level of GC functions: per-target and per-window. Per-target functions are used in single passive lock ending calls: Win_unlock; per-window functions are used in window-wide ending calls: Win_fence, Win_complete, Win_unlock_all. Garbage collection functions for RMA ops go over all incomplete operation lists in target element and free completed operations. It also returns flags indicating local completion and remote completion. Garbage collection functions for RMA targets go over all targets and free those targets that have compeleted empty operation lists. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Keep track of no. of non-empty slots on window so that when number is 0, there are no operations needed to be processed and we can ignore that window. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
We define new states to indicate the current situation of RMA synchronization. The states contain both ACCESS states and EXPOPSURE states, and specify if the synchronization is initialized (_CALLED), on-going (_ISSUED) and completed (_GRANTED). For single lock in Passive Target, we use per-target state whereas the window state is set to PER_TARGET. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Add flag is_dt in op structure which is set when any buffers involved in RMA operations contains derived datatype data. It is convenient for us to enqueue issued but not completed operation to the DT specific list. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Add a list of created windows on this process, so that we can make progress on all windows in the progress engine. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Given an RMA op, finding the correct slot and target, enqueue op to the pending op list in that target object. If the target is not existed, create one in that slot. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
We allocate a fixed size of targets array on window during window creation. The size can be configured by the user via CVAR. Each slot entry contains a list of target elements. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Here we add a data structure to store information of active target. The information includes operation lists, pasive lock state, sync state, etc. The target element is created by origin on-demand, and can be freed after the remote completion of all previous oeprations is detected. After RMA ending synchrnization calls, all target elements should be freed. Similiarly with operation pools, we create two-level target pools for target elements: one pre-window target pool and one global target pool. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Portals4 by itself does not provide any flow-control. This needs to be managed by an upper-layer, such as MPICH. Before this patch we were relying on a bunch of unexpected buffers that were posted to the portals library to manage unexpected messages. However, since portals asynchronously pulls out messages from the network, if the application is delayed, it might result in the unexpected buffers being filled out and the portal disabled. This would cause MPICH to abort. In this patch, we implement an initial version of flow-control that allows us to reenable the portal when it gets disabled. All this is done in the context of the "rportals" wrappers that are implemented in the rptl.* files. We create an extra control portal that is only used by rportals. When the primary data portal gets disabled, the target sends PAUSE messages to all other processes. Once each process confirms that it has no outstanding packets on the wire (i.e., all packets have either been ACKed or NACKed), it sends a PAUSE-ACK message. When the target receives PAUSE-ACK messages from all processes (thus confirming that the network traffic to itself has been quiesced), it reenables the portal and sends an UNPAUSE message to all processes. This patch still does not deal with origin-side resource exhaustion. This can happen, for example, if we run out of space on the event queue on the origin side. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Huiwei Lu authored
Fixes the case when configured with default setting but with no fortran installed. It should give an error of 'No Fortran 77/90 compiler found' but not. This patch is related with [d4e30cc0 ], when configure was changed to support '--disable-fc'. Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
Instead of allocating / deallocating RMA operations whenever an RMA op is posted by user, we allocate fixed size operation pools beforehand and take the op element from those pools when an RMA op is posted. With only a local (per-window) op pool, the number of ops allocated can increase arbitrarily if many windows are created. Alternatively, if we only use a global op pool, other windows might use up all operations thus starving the window we are working on. In this patch we create two pools: a local (per-window) pool and a global pool. Every window is guaranteed to have at least the number of operations in the local pool. If we run out of these operations, we check in the global pool to see if we have any operations left. When an operation is released, it is added back to the same pool it was allocated from. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
We were duplicating information in the operation structure and in the packet structure when the message is actually issued. Since most of the information is the same anyway, this patch just embeds a packet structure into the operation structure, so that we eliminate unnessary copy. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The packet type MPIDI_CH3_PKT_PT_RMA_DONE is used for ACK of FLUSH / UNLOCK packets. Here we rename it to MPIDI_CH3_PKT_FLUSH_ACK and modify the related functions and data structures. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
We were adding an unnecessary dependency on VC structure declarations in the mpidpkt.h file. The required information in RMA lock queue is only the rank, but not actual VC. Here we replace VC with rank. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Split RMA functionality into smaller files, and move functions to where they belong based on the file names. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Because we are going to rewrite the RMA infrastructure and many PVARs will no longer be used, here we temporarily remove all PVARs and will add needed PVARs back after new implementation is done. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
Rather than having a static value for the initial size of the RTS queue, have a CVAR to define it. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
When the RTS queue fills up the first time, print out a warning to let the user know that they've done it and FT won't be provided anymore. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
- 01 Nov, 2014 3 commits
-
-
Xin Zhao authored
req->dev.user_buf points to the data sent from origin process to target process, and for FOP sometimes it points to the IMMED area in packet header when data can be fit in packet header. In such case, we should not free req->dev.user_buf in final request handler since that data area will be freed by the runtime when packet header is freed. In this patch we initialize user_buf to NULL when creating the request, and set it to NULL when FOP is completed, and avoid free a NULL pointer in final request handler. Signed-off-by:
Min Si <msi@il.is.s.u-tokyo.ac.jp>
-
Igor Ivanov authored
Signed-off-by:
Devendar Bureddy <devendar@mellanox.com> Signed-off-by:
Igor Ivanov <Igor.Ivanov@itseez.com>
-
The original implementation includes an optimization which allows Win_unlock for exclusive lock to return without waiting for remote completion. This relys on the assumption that window memory on target process will not be accessed by a third party until that target process finishes all RMA operations and grants the lock to other processes. However, this assumption is not correct if user uses assert MPI_MODE_NOCHECK. Consider the following code: P0 P1 P2 MPI_Win_lock(P1, NULL, exclusive); MPI_Put(X); MPI_Win_unlock(P1, exclusive); MPI_Send (P2); MPI_Recv(P0); MPI_Win_lock(P1, MODE_NOCHECK, exclusive); MPI_Get(X); MPI_Win_unlock(P1, exclusive); Both P0 and P2 issue exclusive lock to P1, and P2 uses assert MPI_MODE_NOCHECK because the lock should be granted to P2 after synchronization between P2 and P0. However, in the original implementation, GET operation on P2 might not get the updated value since Win_unlock on P0 return without waiting for remote completion. In this patch we delete this optimization. In Win_free, since every Win_unlock guarantees the remote completion, target process no longer needs to do additional counting works to detect target-side completion, but only needs to do a global barrier. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 31 Oct, 2014 5 commits
-
-
Antonio Pena Monferrer authored
-
Min Si authored
In reqops.c, the ring communication test assumes remote completion after MPI_RPut/MPI_Racc + MPI_Wait, which is not correct. MPI_Wait only guarantees local completion. Here we fixed it by replace MPI_Rput/MPI_Racc + MPI_Wait with MPI_Put/MPI_Acc + MPI_Win_flush. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Sangmin Seo authored
Used discuss@mpich.org instead of mpich-discucc@mcs.anl.gov in the installers' guide. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
RTS messages (the first part of the LMT sequence) had no way of being cancelled if an error occurred. This adds a small queue that keeps track of these messages. If a failure is detected, the message is removed from the queue and the associated request is cancelled to get out of the progress engine. See #1945 Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
Added another send call to this test to take out the race condition. Now, the test should fail under any circumstances if the send request isn't being cleaned up correctly. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-