1. 03 Nov, 2014 5 commits
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
    • Xin Zhao's avatar
      Embedding packet structure into RMA operation structure. · b1685139
      Xin Zhao authored
      
      
      We were duplicating information in the operation structure and in the
      packet structure when the message is actually issued.  Since most of
      the information is the same anyway, this patch just embeds a packet
      structure into the operation structure, so that we eliminate unnessary
      copy.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b1685139
    • Xin Zhao's avatar
      Avoid using VC in RMA lock queue structure. · 0eaf344b
      Xin Zhao authored
      
      
      We were adding an unnecessary dependency on VC structure
      declarations in the mpidpkt.h file. The required information
      in RMA lock queue is only the rank, but not actual VC.
      Here we replace VC with rank.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0eaf344b
    • Xin Zhao's avatar
      Code refactoring to clean up the RMA code. · 61f952c7
      Xin Zhao authored
      
      
      Split RMA functionality into smaller files, and move functions
      to where they belong based on the file names.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      61f952c7
    • Xin Zhao's avatar
      Temporarily remove all RMA PVARs. · 5c513032
      Xin Zhao authored
      
      
      Because we are going to rewrite the RMA infrastructure
      and many PVARs will no longer be used, here we temporarily
      remove all PVARs and will add needed PVARs back after new
      implementation is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5c513032
  2. 01 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: always waiting for remote completion in Win_unlock. · c76aa786
      Xin Zhao authored
      
      
      The original implementation includes an optimization which
      allows Win_unlock for exclusive lock to return without
      waiting for remote completion. This relys on the
      assumption that window memory on target process will not
      be accessed by a third party until that target process
      finishes all RMA operations and grants the lock to other
      processes. However, this assumption is not correct if user
      uses assert MPI_MODE_NOCHECK. Consider the following code:
      
                P0                              P1           P2
          MPI_Win_lock(P1, NULL, exclusive);
          MPI_Put(X);
          MPI_Win_unlock(P1, exclusive);
          MPI_Send (P2);                                MPI_Recv(P0);
                                                        MPI_Win_lock(P1, MODE_NOCHECK, exclusive);
                                                        MPI_Get(X);
                                                        MPI_Win_unlock(P1, exclusive);
      
      Both P0 and P2 issue exclusive lock to P1, and P2 uses assert
      MPI_MODE_NOCHECK because the lock should be granted to P2 after
      synchronization between P2 and P0. However, in the original
      implementation, GET operation on P2 might not get the updated
      value since Win_unlock on P0 return without waiting for remote
      completion.
      
      In this patch we delete this optimization. In Win_free, since every
      Win_unlock guarantees the remote completion, target process no
      longer needs to do additional counting works to detect target-side
      completion, but only needs to do a global barrier.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c76aa786
  3. 30 Oct, 2014 2 commits
    • Xin Zhao's avatar
      Clean up white-space and code format in RMA code. · fe283e91
      Xin Zhao authored
      No reviewer.
      fe283e91
    • Min Si's avatar
      Bug-fix: trigger final req handler for receiving derived datatype. · 920661c3
      Min Si authored
      
      
      There are two request handlers used when receiving data:
      (1) OnDataAvail, which is triggered when data is arrived;
      (2) OnFinal, which is triggered when receiving data is finished;
      
      When receiving large derived datatype, the receiving iov can be divided
      into multiple iovs. The OnDataAvail handler is set to iov load function
      when still waiting for remaining data. However, such handler should be
      set to OnFinal when starting receiving the last iov.
      
      The original code does not set OnDataAvail handler to OnFinal at end.
      This patch fixes this bug.
      
      Note that this bug only appears in RMA calls, because only the RMA
      packet handers need to specify OnFinal.
      
      Resolve #2189.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      920661c3
  4. 01 Oct, 2014 5 commits
  5. 28 Sep, 2014 2 commits
    • Xin Zhao's avatar
      Fix completion on target side in Active Target synchronization. · aa36f043
      Xin Zhao authored
      
      
      For Active Target synchronization, the original implementation
      does not guarantee the completion of all ops on target side
      when Win_wait / Win_fence returns. It is implemented using a
      counter, which is decremented when the last operation from that
      origin finishes. Win_wait / Win_fence waits until that counter
      reaches zero. Problem is that, when the last operation finishes,
      the previous GET-like operation (for example with a large data
      volume) may have not finished yet. This breaks the semantic of
      Win_wait / Win_fence.
      
      Here we fix this by increment the counter whenever we meet a
      GET-like operation, and decrement it when that operation finishes
      on target side. This will guarantee that when counter reaches
      zero and Win_wait / Win_fence returns, all operations are completed
      on the target.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      aa36f043
    • Xin Zhao's avatar
      Revert "Bug-fix: waiting for ACKs for Active Target Synchronization." · 32596b62
      Xin Zhao authored
      This reverts commit 74189446.
      32596b62
  6. 24 Sep, 2014 1 commit
  7. 23 Sep, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: waiting for ACKs for Active Target Synchronization. · 74189446
      Xin Zhao authored
      
      
      The original implementation of FENCE and PSCW does not
      guarantee the remote completion of issued-out RMA operations
      when MPI_Win_complete and MPI_Win_fence returns. They only
      guarantee the local completion of issued-out operations and
      the completion of coming-in operations. This is not correct
      if we try to get updated values on target side using synchronizations
      with MPI_MODE_NOCHECK.
      
      Here we modify it by making runtime wait for ACKs from all
      targets before returning from MPI_Win_fence and MPI_Win_complete.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      74189446
  8. 19 Sep, 2014 2 commits
  9. 03 Sep, 2014 1 commit
    • Min Si's avatar
      Bug-fix: correct the behavior of flush in exclusively locked epoch. · 22924f35
      Min Si authored
      
      
      FLUSH should guarantee the completion of operations on both origin
      and target side. However, for exclusive lock, there is an optimization
      in MPICH which allows FLUSH to return without waiting for the
      acknowledgement of remote completion from the target side. It relys
      on the fact that there will be no other processes accessing the window
      during the exclusive lock epoch.
      
      However, such optimization is not correct when two processes allocating
      windows on overlapping SHM region. Suppose P0 and P1 (on the same node)
      allocate RMA window using the same SHM region, and P2 (on a different node)
      locks both windows. P2 first issues a PUT and FLUSH to P0, then issues
      a GET to P1 on the same memory location with PUT, since FLUSH does not
      guarantee the remote completion of PUT, GET operation may not get the
      updated value.
      
      This patch disables the optimization for FLUSH and forces FLUSH to always
      wait for the remote completion of operations.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
      22924f35
  10. 27 Aug, 2014 1 commit
  11. 30 Jul, 2014 1 commit
    • Xin Zhao's avatar
      Change default values of CVARs in RMA code. · 522c2688
      Xin Zhao authored
      
      
      Change default values of MPIR_CVAR_CH3_RMA_NREQUEST_NEW_THRESHOLD,
      MPIR_CVAR_CH3_RMA_NREQUEST_VISIT_THRESHOLD and
      MPIR_CVAR_CH3_RMA_NREQUEST_TEST_THRESHOLD for better performance.
      
      This experience is from running graph500 on single node on BLUES
      and breadboard machine, with 16 or 8 processes and problem size is
      2^16 to 2^20. We make the number of new requests since the last
      attempt to complete pending requests to 0, so that the issuing code
      will always try to complete pending requests. We also disable the
      threshold of completed requests in GC and make the threshold of
      tested requests in GC to be 100, so that we have opportunity to
      find more pending requests in GC.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      522c2688
  12. 18 Jul, 2014 1 commit
  13. 17 Jul, 2014 1 commit
    • Pavan Balaji's avatar
      Simplified RMA_Op structure. · 274a5a70
      Pavan Balaji authored
      
      
      We were creating duplicating information in the operation structure
      and in the packet structure when the message is actually issued.
      Since most of the information is the same anyway, this patch just
      embeds a packet structure into the operation structure.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      274a5a70
  14. 13 Jul, 2014 2 commits
  15. 11 Jul, 2014 1 commit
  16. 08 Jul, 2014 3 commits
    • Pavan Balaji's avatar
      Add a memory barrier at the end of the Win_complete function. · c244ba49
      Pavan Balaji authored
      
      
      We need to add a memory barrier at the end of the Win_complete
      function, so that shared memory operations issued during the
      start/complete epoch are visible to other processes on the node.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      c244ba49
    • Pavan Balaji's avatar
      Add barrier-like semantics in PSCW for shared-memory operations. · 39361532
      Pavan Balaji authored
      
      
      When a window uses direct shared-memory operations that are
      immediately issued internally, we cannot avoid synchronization during
      the start operation.  This patch synchronizes processes that reside on
      the same node during start and the processes that do not reside on the
      same node during complete.
      
      Fixes #2041.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      39361532
    • Xin Zhao's avatar
      Fix bug: add barrier semantic in FENCE for SHM ops. · 1c07dbaf
      Xin Zhao authored
      
      
      When SHM is allocated for RMA window, operations are completed
      eagerly (as soon as they are posted by the user), therefore we
      need barrier semantics in the FENCE that opens an epoch to prevent
      SHM ops happening on target process before that target process
      starts an epoch.
      
      Note that we need memory barrier before and after synchronization
      calls in both FENCEs that starts and ends an epoch to guarantee the
      ordering of load/store operations with synchronizations.
      
      See #2041.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      1c07dbaf
  17. 30 Jun, 2014 6 commits
    • Xin Zhao's avatar
      Add CVAR (# of tested reqs) to control when to stop in RMA GC function · 283319f5
      Xin Zhao authored
      
      
      When cleanning up completed requests, the original RMA implementation
      keeps traversing the op list until it finds a completed request. This
      may cause significant O(N) overhead if there is no completed request
      in the list. We add a CVAR to let the user control the number of visited
      requests as a fixed value.
      
      Note that the default value is set to (-1) in order to be in accordance
      with the performance of orignal implementation.
      
      Note that in garbage collection function, if runtime finds a chain
      of completed RMA requests, it will temporarily ignore this CVAR
      and try to find continuous completed requests as many as possible,
      until it meets an incomplete request.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      283319f5
    • Xin Zhao's avatar
      Add CVAR (# of completed reqs) to control when to stop in RMA GC function · dda458a1
      Xin Zhao authored
      
      
      Add a CVAR to let the user specify the threshold for number of
      completed requests the runtime finds before it stops trying to
      find more completed requests in garbage collection function. It
      may make the runtime to find more completed requests, but may also
      cause significant overhead due to visiting too many requests.
      
      Note that the default value is set to 1 in order to be in
      accordance with the performance of original implementation.
      
      Note that in garbage collection function, if runtime finds a chain
      of completed RMA requests, it will temporarily ignore this CVAR
      and try to find continuous completed requests as many as possible,
      until it meets an incomplete request.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      dda458a1
    • Xin Zhao's avatar
      Simplify RMA requests completion function · 7dbdc413
      Xin Zhao authored
      
      
      Originally rma_list_complete() function traverses the
      operation list to clean up completed requests, which is
      what rma_list_gc() is doing now. So we simplify
      rma_list_complete() function by deleting the code of
      traversing loop and just invoking rma_list_gc() in
      rma_list_complete().
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7dbdc413
    • Xin Zhao's avatar
      Separate progress engine code from garbage collection · da7700a0
      Xin Zhao authored
      
      
      Currently the code of poking progress engine to complete
      requests and the code of cleanning up completed requests
      are mixed up in one function rma_list_gc(), which is not
      a clear code structure. We move the code of poking progress
      engine out of rma_list_gc() and encapsule the code into
      a separate function so that rma_list_gc() only does garbage
      collection work.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      da7700a0
    • Xin Zhao's avatar
      Rename RMA request gc and complete function · 73f6a4b3
      Xin Zhao authored
      
      
      Rename RMAListPartialComplete to rma_list_gc
      and rename RMAListComplete to rma_list_complete.
      Declare both functions as inline function.
      Add error handling code for both functions.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      73f6a4b3
    • Xin Zhao's avatar
      Rename static functions in RMA code · 33b7d251
      Xin Zhao authored
      
      
      Static functions should not have name starting with prefix "MPIDI_CH3I_".
      We delete those prefix in function names as well as in state names.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      33b7d251
  18. 22 May, 2014 1 commit
    • Wesley Bland's avatar
      Make handling of request cleanup more uniform · 1e171ff6
      Wesley Bland authored
      
      
      There are quite a few places where the request cleanup is done via:
      
      MPIU_Object_set_ref(req, 0);
      MPIDI_CH3_Request_destroy(req);
      
      when it should be:
      
      MPID_Request_release(req);
      
      This makes the handling more uniform so requests are cleaned up by releasing
      references rather than hitting them with the destroy hammer.
      
      Fixes #1664
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
      1e171ff6
  19. 19 Dec, 2013 1 commit
  20. 17 Dec, 2013 1 commit
  21. 15 Nov, 2013 1 commit
    • Xin Zhao's avatar
      Fix #1701 - cleanup code for zero-size data transfer. · dc9275be
      Xin Zhao authored
      
      
      Delete code for zero-size data transfer in packet handlers
      of Put/Accumulate/Accumulate_Immed/Get_AccumulateResp/GetResp/
      LockPutUnlock/LockAccumUnlock, because they are redundant.
      
      (Note that packet handlers of LockPutUnlock and LockAccumUnlock
      are for single operation optimization in passive RMA)
      
      Zero-size data transfer has already been handled when issuing
      RMA operations (L146, L258, L369 in src/mpid/ch3/src/ch3u_rma_ops.c
      and L50 in src/mpid/ch3/src/ch3u_rma_acc_ops.c). RMA operation
      routines will directly exit if data size is zero.
      Signed-off-by: default avatarWesley Bland <wbland@mcs.anl.gov>
      dc9275be