1. 03 Nov, 2014 6 commits
    • Xin Zhao's avatar
      Implement GET_OP routine which guarantees to return an OP. · 5dd55154
      Xin Zhao authored
      
      
      GET_OP function may be a blocking function which guarantees
      to return an RMA operation.
      
      Inside GET_OP we first call the normal OP_ALLOC function
      which will try to get a new OP from OP pools; if failed,
      we call nonblocking GC function to cleanup completed ops
      and then call OP_ALLOC again; if we still cannot get a
      new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION
      function if hardware ordering is provided and then call
      OP_ALLOC again; if still failed, finally we call blocking
      aggressive cleanup function, which will guarantee to
      return a new OP element.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd55154
    • Xin Zhao's avatar
      Keep track of no. of non-empty slots on window. · f91d4633
      Xin Zhao authored
      
      
      Keep track of no. of non-empty slots on window so that
      when number is 0, there are no operations needed to
      be processed and we can ignore that window.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f91d4633
    • Xin Zhao's avatar
      Add new RMA states on window / target and modify state checking. · f076f3fe
      Xin Zhao authored
      
      
      We define new states to indicate the current situation of
      RMA synchronization. The states contain both ACCESS states
      and EXPOPSURE states, and specify if the synchronization
      is initialized (_CALLED), on-going (_ISSUED) and completed
      (_GRANTED). For single lock in Passive Target, we use
      per-target state whereas the window state is set to PER_TARGET.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f076f3fe
    • Xin Zhao's avatar
      Add RMA slots and related APIs. · 0f596c48
      Xin Zhao authored
      
      
      We allocate a fixed size of targets array on window
      during window creation. The size can be configured
      by the user via CVAR. Each slot entry contains a list
      of target elements.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0f596c48
    • Xin Zhao's avatar
      Add target element and global / local pools and related APIs. · 5dd8a0a4
      Xin Zhao authored
      
      
      Here we add a data structure to store information of active target.
      The information includes operation lists, pasive lock state,
      sync state, etc.
      
      The target element is created by origin on-demand, and can
      be freed after the remote completion of all previous oeprations
      is detected. After RMA ending synchrnization calls, all
      target elements should be freed.
      
      Similiarly with operation pools, we create two-level target
      pools for target elements: one pre-window target pool and
      one global target pool.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd8a0a4
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
  2. 01 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: always waiting for remote completion in Win_unlock. · c76aa786
      Xin Zhao authored
      
      
      The original implementation includes an optimization which
      allows Win_unlock for exclusive lock to return without
      waiting for remote completion. This relys on the
      assumption that window memory on target process will not
      be accessed by a third party until that target process
      finishes all RMA operations and grants the lock to other
      processes. However, this assumption is not correct if user
      uses assert MPI_MODE_NOCHECK. Consider the following code:
      
                P0                              P1           P2
          MPI_Win_lock(P1, NULL, exclusive);
          MPI_Put(X);
          MPI_Win_unlock(P1, exclusive);
          MPI_Send (P2);                                MPI_Recv(P0);
                                                        MPI_Win_lock(P1, MODE_NOCHECK, exclusive);
                                                        MPI_Get(X);
                                                        MPI_Win_unlock(P1, exclusive);
      
      Both P0 and P2 issue exclusive lock to P1, and P2 uses assert
      MPI_MODE_NOCHECK because the lock should be granted to P2 after
      synchronization between P2 and P0. However, in the original
      implementation, GET operation on P2 might not get the updated
      value since Win_unlock on P0 return without waiting for remote
      completion.
      
      In this patch we delete this optimization. In Win_free, since every
      Win_unlock guarantees the remote completion, target process no
      longer needs to do additional counting works to detect target-side
      completion, but only needs to do a global barrier.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c76aa786
  3. 01 Oct, 2014 1 commit
  4. 28 Sep, 2014 1 commit
  5. 23 Sep, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: waiting for ACKs for Active Target Synchronization. · 74189446
      Xin Zhao authored
      
      
      The original implementation of FENCE and PSCW does not
      guarantee the remote completion of issued-out RMA operations
      when MPI_Win_complete and MPI_Win_fence returns. They only
      guarantee the local completion of issued-out operations and
      the completion of coming-in operations. This is not correct
      if we try to get updated values on target side using synchronizations
      with MPI_MODE_NOCHECK.
      
      Here we modify it by making runtime wait for ACKs from all
      targets before returning from MPI_Win_fence and MPI_Win_complete.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      74189446
  6. 26 Aug, 2014 1 commit
  7. 31 Jul, 2014 3 commits
    • Wesley Bland's avatar
      Add MPI_Comm_revoke · 57f6ee88
      Wesley Bland authored
      
      
      MPI_Comm_revoke is a special function because it does not have a matching call
      on the "receiving side". This is because it has to act as an out-of-band,
      resilient broadcast algorithm. Because of this, in this commit, in addition to
      the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.),
      we add a new CH3 packet type that will handle revoking a communicator without
      involving a matching call from the MPI layer (similar to how RMA is currently
      implemented).
      
      The thing that must be handled most carefully when revoking a communicator is
      to ensure that a previously used context ID will eventually be returned to the
      pool of available context IDs and that after this occurs, no old messages will
      match the new usage of the context ID (for instance, if some messages are very
      slow and show up late). To accomplish this, revoke is implemented as an
      all-to-all algorithm. When one process calls revoke, it will send a message to
      all other processes in the communicator, which will trigger that process to
      send a message to all other processes, and so on. Once a process has already
      revoked its communicator locally, it won't send out another wave of messages.
      As each process receives the revoke messages from the other processes, it will
      track how many messages have been received. Once it has either received a
      revoke message or a message about a process failure for each other process, it
      will release its refcount on the communicator object. After the application
      has freed all of its references to the communicator (and all requests, files,
      etc. associated with it), the context ID will be returned to the available
      pool.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      57f6ee88
    • Wesley Bland's avatar
      Remove coll_active field in MPIDI_Comm · 5c71c3a8
      Wesley Bland authored
      
      
      The collectively active field wasn't doing anything anymore so it's been
      removed. This was a remnant from a previous FT proposal.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5c71c3a8
    • Wesley Bland's avatar
      Add MPIX_Comm_failure_ack/get_acked · 8652e0ad
      Wesley Bland authored
      
      
      This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and
      MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to
      get the group of failed processes.
      
      Most of the implementation for this is pushed into the MPID layer since some
      systems won't support this (PAMI). The existing function
      MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of
      acknowledged failed processes. There is an inefficiency here in that the list
      of failed processes is retrieved from PMI and parsed every time the user calls
      both failure_ack and get_acked, but this means we don't have to try to cache
      the list that comes back from PMI (which could potentially be expensive, but
      would have some cost even in the failure-free case).
      
      This commit adds a failed to the MPID_Comm structure. There is now a field
      called last_ack_rank. This is a single integer that stores the last
      acknowledged failure for this communicator which is used to determine when to
      stop parsing when getting back the list of acknowledged failed processes.
      
      Lastly, this commit includes a test to make sure that all of the above works
      (test/mpi/ft/failure_ack). This tests that a failure is appropriately included
      in the failed group and excluded if the failure was not previously
      acknowledged.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      8652e0ad
  8. 22 Jul, 2014 2 commits
  9. 21 Jul, 2014 1 commit
    • Pavan Balaji's avatar
      Don't start enums with 0. · faa37d89
      Pavan Balaji authored
      
      
      This is to help with debugging.  Zero is too common a value, and is
      often set automatically by the system if not initialized.  Starting at
      a different value helps us catch uninitialized cases more easily.
      
      We pick "42" as our magic number as it is the answer to the ultimate
      question of life, the Universe, and everything.
      Signed-off-by: default avatarWesley Bland <wbland@anl.gov>
      faa37d89
  10. 23 Mar, 2014 1 commit
    • Wesley Bland's avatar
      Remove the use of MPIDI_TAG_UB · 055abbd3
      Wesley Bland authored
      
      
      The constant MPIDI_TAG_UB is used in only one place at the moment, in the
      initialization of ch3 (source:src/mpid/ch3/src/mpid_init.c@4b35902a#L131). The
      problem is that the value which is being set (MPIR_Process.attrs.tag_ub) is
      set differently in pamid (INT_MAX). This leads to weird results when we set
      apart a bit in the tag space for failure propagation in non-blocking
      collectives (see #2008).
      
      Since this value isn't being referenced anywhere else, there doesn't seem to
      be a use for it and it's just leading to confusion. To avoid this, here we
      remove this value and just set MPIR_Process.attrs.tag_ub to INT_MAX in both
      ch3 and pamid.
      
      See #2009
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@mcs.anl.gov>
      055abbd3
  11. 27 Jan, 2014 2 commits
    • Wesley Bland's avatar
      Remove a comment that doesn't apply anymore. · 201b0dbf
      Wesley Bland authored
      No reviewer
      201b0dbf
    • Wesley Bland's avatar
      Moves the tag reservation to MPI layer · bb755b5c
      Wesley Bland authored
      
      
      Resets MPIDI_TAG_UB back to 0x7fffffff. This value was changed a while back,
      but the change should have happened at the MPI layer instead of the CH3 layer.
      This resets the value to allow CH3 to use the tag space.
      
      Instead, the value is now set in the MPI layer during initthread. This means
      that it will be safe regardless of the device being used. This prevents a
      collision that was occurring on the pamid device where the values for
      MPIR_TAG_ERROR_BIT and the MPIR_Process.attr.tagged_coll_mask values were the
      same.
      
      Fixes #2008
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@mcs.anl.gov>
      bb755b5c
  12. 31 Oct, 2013 1 commit
  13. 01 Aug, 2013 1 commit
  14. 28 Jul, 2013 1 commit
    • Xin Zhao's avatar
      Add "alloc_shm" info to MPI_Win_allocate. · 384d96b7
      Xin Zhao authored
      
      
      Add "alloc_shm" to window's info arguments and initialize it to FALSE.
      In MPID_Win_allocate, if "alloc_shm" is set to true, call ALLOCATE_SHARED,
      otherwise call ALLOCATE.
      
      Free window memory only when SHM region is not allocated, therwise it is
      already freed in MPIDI_CH3I_SHM_Win_free.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@mcs.anl.gov>
      384d96b7
  15. 25 Jul, 2013 2 commits
  16. 07 May, 2013 3 commits
  17. 01 Apr, 2013 1 commit
    • Ralf Gunter's avatar
      Add per-communicator eager threshold support. · a3c816ac
      Ralf Gunter authored
      Message transfers now respect the communicator-specific threshold.  This
      change has not been carefully checked for impact on our shared-memory
      ping-pong latency.
      
      Reviewed-by: goodell
      a3c816ac
  18. 21 Feb, 2013 2 commits
    • James Dinan's avatar
      Removed unused single_op_opt field from MPID_Request · 255fb4a6
      James Dinan authored
      The single_op_opt flag in the request object was previously used to
      track whether an operation is a lock-op-unlock type, for the purposes of
      completion.  Tracking this state has been merged into the packet header
      flags, so the single_op_opt flag is no longer needed.
      
      Reviewer: goodell
      255fb4a6
    • James Dinan's avatar
      Added flags to MPID_Request · 90be9ee1
      James Dinan authored
      Added a flags field to MPID_Request that we can use to stash flags from
      suspended RMA ops and retrieve them later when we complete the operation.
      
      Reviewer: goodell
      90be9ee1
  19. 06 Feb, 2013 1 commit
    • James Dinan's avatar
      Eliminate enqueueing of lock op in RMA ops list · fbd95593
      James Dinan authored
      Prior to this patch, a lock entry was enqueued in the RMA ops list when
      Win_lock was called.  This patch adds a new state tracking mechanism, which we
      use to record the synchronization state with respect to each RMA target.  This
      new mechanism absorbs tracking of lock operation and the lock state at the
      target.  It significantly simplifies the RMA synchronization and ops list
      processing.
      
      Reviewer: goodell
      fbd95593
  20. 11 Jan, 2013 1 commit
    • James Dinan's avatar
      Implemented interprocess shared memory RMA ops · 58ec39c5
      James Dinan authored
      Communication operations on shared memory windows now perform the op directly
      on the shared buffer.  This requried the addition of a per-window interprocess
      mutex to ensure that atomics and accumulates are performed atomically.
      
      Reviewer: buntinas
      58ec39c5
  21. 17 Dec, 2012 1 commit
  22. 27 Nov, 2012 3 commits
  23. 08 Nov, 2012 3 commits