- 27 Aug, 2014 1 commit
-
-
Norio Yamaguchi authored
After one thread finishes processing all operations in the ops list, a new RMA operation may be enqueued by another thread in MPID_Progress_wait(). In such case, it has not got issued yet and we should avoid processing it at end of synchronization calls. This situation occurred when running test/mpi/threads/rma/multirma.c Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 30 Jul, 2014 1 commit
-
-
Xin Zhao authored
Change default values of MPIR_CVAR_CH3_RMA_NREQUEST_NEW_THRESHOLD, MPIR_CVAR_CH3_RMA_NREQUEST_VISIT_THRESHOLD and MPIR_CVAR_CH3_RMA_NREQUEST_TEST_THRESHOLD for better performance. This experience is from running graph500 on single node on BLUES and breadboard machine, with 16 or 8 processes and problem size is 2^16 to 2^20. We make the number of new requests since the last attempt to complete pending requests to 0, so that the issuing code will always try to complete pending requests. We also disable the threshold of completed requests in GC and make the threshold of tested requests in GC to be 100, so that we have opportunity to find more pending requests in GC. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 18 Jul, 2014 1 commit
-
-
Pavan Balaji authored
This reverts commit 274a5a70.
-
- 17 Jul, 2014 1 commit
-
-
Pavan Balaji authored
We were creating duplicating information in the operation structure and in the packet structure when the message is actually issued. Since most of the information is the same anyway, this patch just embeds a packet structure into the operation structure. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
- 13 Jul, 2014 2 commits
-
-
See "Notes for memory barriers in RMA synchronizations" in src/mpid/ch3/src/ch3u_rma_sync.c. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 11 Jul, 2014 1 commit
-
-
Pavan Balaji authored
We were creating the rank list for the window start group twice earlier, once for synchronization and once for the actual issuing of the operations. This patch combines them into a single creation of the array. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
- 08 Jul, 2014 3 commits
-
-
Pavan Balaji authored
We need to add a memory barrier at the end of the Win_complete function, so that shared memory operations issued during the start/complete epoch are visible to other processes on the node. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Pavan Balaji authored
When a window uses direct shared-memory operations that are immediately issued internally, we cannot avoid synchronization during the start operation. This patch synchronizes processes that reside on the same node during start and the processes that do not reside on the same node during complete. Fixes #2041. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Xin Zhao authored
When SHM is allocated for RMA window, operations are completed eagerly (as soon as they are posted by the user), therefore we need barrier semantics in the FENCE that opens an epoch to prevent SHM ops happening on target process before that target process starts an epoch. Note that we need memory barrier before and after synchronization calls in both FENCEs that starts and ends an epoch to guarantee the ordering of load/store operations with synchronizations. See #2041. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 30 Jun, 2014 6 commits
-
-
Xin Zhao authored
When cleanning up completed requests, the original RMA implementation keeps traversing the op list until it finds a completed request. This may cause significant O(N) overhead if there is no completed request in the list. We add a CVAR to let the user control the number of visited requests as a fixed value. Note that the default value is set to (-1) in order to be in accordance with the performance of orignal implementation. Note that in garbage collection function, if runtime finds a chain of completed RMA requests, it will temporarily ignore this CVAR and try to find continuous completed requests as many as possible, until it meets an incomplete request. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Add a CVAR to let the user specify the threshold for number of completed requests the runtime finds before it stops trying to find more completed requests in garbage collection function. It may make the runtime to find more completed requests, but may also cause significant overhead due to visiting too many requests. Note that the default value is set to 1 in order to be in accordance with the performance of original implementation. Note that in garbage collection function, if runtime finds a chain of completed RMA requests, it will temporarily ignore this CVAR and try to find continuous completed requests as many as possible, until it meets an incomplete request. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Originally rma_list_complete() function traverses the operation list to clean up completed requests, which is what rma_list_gc() is doing now. So we simplify rma_list_complete() function by deleting the code of traversing loop and just invoking rma_list_gc() in rma_list_complete(). Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Currently the code of poking progress engine to complete requests and the code of cleanning up completed requests are mixed up in one function rma_list_gc(), which is not a clear code structure. We move the code of poking progress engine out of rma_list_gc() and encapsule the code into a separate function so that rma_list_gc() only does garbage collection work. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Rename RMAListPartialComplete to rma_list_gc and rename RMAListComplete to rma_list_complete. Declare both functions as inline function. Add error handling code for both functions. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Xin Zhao authored
Static functions should not have name starting with prefix "MPIDI_CH3I_". We delete those prefix in function names as well as in state names. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 22 May, 2014 1 commit
-
-
Wesley Bland authored
There are quite a few places where the request cleanup is done via: MPIU_Object_set_ref(req, 0); MPIDI_CH3_Request_destroy(req); when it should be: MPID_Request_release(req); This makes the handling more uniform so requests are cleaned up by releasing references rather than hitting them with the destroy hammer. Fixes #1664 Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 19 Dec, 2013 1 commit
-
-
Fixes #1963 Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 17 Dec, 2013 1 commit
-
-
Junchao Zhang authored
Fixes #1962 Signed-off-by: Junchao Zhang<jczhang@mcs.anl.gov> (Reviewed by Bill Gropp)
-
- 15 Nov, 2013 4 commits
-
-
Xin Zhao authored
Delete code for zero-size data transfer in packet handlers of Put/Accumulate/Accumulate_Immed/Get_AccumulateResp/GetResp/ LockPutUnlock/LockAccumUnlock, because they are redundant. (Note that packet handlers of LockPutUnlock and LockAccumUnlock are for single operation optimization in passive RMA) Zero-size data transfer has already been handled when issuing RMA operations (L146, L258, L369 in src/mpid/ch3/src/ch3u_rma_ops.c and L50 in src/mpid/ch3/src/ch3u_rma_acc_ops.c). RMA operation routines will directly exit if data size is zero. Signed-off-by:
Wesley Bland <wbland@mcs.anl.gov>
-
Antonio J. Pena authored
This reverts commit 676c29f9.
-
Antonio J. Pena authored
Addresses #1932. Includes: - MPI_Bsend/MPI_Ibsend - Several collectives - Some RMA operations - MPI_Dist_graph_create Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Xin Zhao authored
MPIU_Assert at L2311 checks if rma_ops_list is empty before exiting MPIDI_Win_flush. It causes /test/mpi/threads/rma/multirma to fail because while one thread is executing the loop of poking progress engine at L2293 ~ L2302, another thread may enqueue new RMA operations to rma_ops_list. rma_ops_list has already been checked for empty before exiting MPIDI_CH3I_Do_passive_target_rma (L2724) to ensure that all enqueued operations are issued out, therefore it does not need to be checked again here. Signed-off-by:
Wesley Bland <wbland@mcs.anl.gov>
-
- 31 Oct, 2013 1 commit
-
-
Also includes random fixes to `-Wshorten-64-to-32` warnings which might need to be teased out. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 26 Oct, 2013 1 commit
-
-
To adapt to naming for control variables in MPI_T. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 26 Sep, 2013 11 commits
-
-
Pavan Balaji authored
The check was originally in the ch3 layer, but doesn't seem to use any ch3 specific information. This macro will be useful at the upper layers for optimizations, e.g., in the localcopy routine. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Pavan Balaji authored
The memory barrier ensures that all load/store operations issued directly to shared memory are complete. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Because when SHM is allocated, it is possible that orig rank and target rank are on different nodes, in such situation operations are not done yet and win_flush cannot exit. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Check shm_allocated flag in win_flush to determine if do full memory barrier or not. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Pavan Balaji authored
During a win_flush_all, if a target does not have any operations to flush out, don't call the win_flush function at all. This reduces the number of function calls on large systems where the RMA operations are sparsely issued. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
If SHM is allocated by MPI_Win_allocate and target is on the same node with origin, origin needs to acquire lock eagerly before it can perform any SHM RMA operations immediately on target's SHM region. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Originally for SHM RMA operations, we create strcutures to queue them up and perform them lazily when closing the epoch. Because creating queued structure causes siginificant performance overhead, we decide to not queue them up but perform them immediately. Therefore MPIDI_DO_SHM_OP macro and some special judgements on SHM operations (to count queued operations) are not needed anymore. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Change the condition of full memory barrier when closing an epoch from *judging create_flavor* to *checking if SHM is allocated*. Because condition of *SHM is allocated* means either create_flavor is SHARED or alloc_shm optimization is enabled for MPI_Win_allocate. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
We don't need the full memory barrier when opening an epoch, ordering of modifications on the same window location can be protected by the full memory barrier when closing the epoch. User can modify any window location only within an RMA epoch. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
We don't need the full memory barrier when opening an epoch, ordering of modifications on the same window location can be protected by the full memory barrier when closing the epoch. User can modify any window location only within an RMA epoch. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Do a memory barrier when winow is allocated by MPI_Win_allocate_shared, if this fence is (1) not call with MPI_MODE_NO_PROCEDE; (2) not the very first fence; (3) not following a fence with MPI_MODE_NO_SUCCEED. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 08 Aug, 2013 1 commit
-
-
Initialize "list_complete" before entering MPIDI_CH3I_RMAListPartialComplete because it is used in that function. Fixes ticket #1906. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 01 Aug, 2013 1 commit
-
-
When judging if origin and target process are on the same node, using vc->node_id flag instead of vc->ch.is_local flag. Flag 'is_local' is not correct because it is defined in nemesis, not in CH3. Flag 'node_id' is defined in CH3. Note that for ch3:sock, even if origin and target are on the same node, they are not within the same SHM region. Currently ch3:sock is filtered out by checking shm_allocated flag first. In future we need to figure out a way to check if origin and target are within the same "SHM comm". Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
- 28 Jul, 2013 2 commits
-
-
If "alloc_shm" is set, it may happen that the target process is doing a RMA operation from a remote process concurrently with a local process is also doing a RMA operation on the same target and on overlapping memory location. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-
Delete decrementing ref count in SHM RMA operations, but add conditions in operaiton issue routines. In RMA operation issue routines, judge if shm_allocate == 1 and target vc is local, if so, do not add reference count on datatypes, because they will not be referenced by the progress engine, but will be completed directly by origin. Signed-off-by:
Pavan Balaji <balaji@mcs.anl.gov>
-