1. 26 Jun, 2015 1 commit
  2. 22 Jun, 2015 1 commit
  3. 14 Jun, 2015 1 commit
    • Min Si's avatar
      Expose AM flush ordering and issue per OP flush if unordered. · 5324a41f
      Min Si authored
      
      
      This patch includes three changes:
      (1) Added netmod API get_ordering to allow netmod to expose the network
      ordering. A netmod may issue some packets via multiple connections in
      parallel if those packets (such as RMA) do not require ordering, and
      thus the packets may be unordered. This patch sets the network ordering
      in every existing netmod (tcp|mxm|ofi|portals|llc) to true, since all
      packets are sent orderly via one connection.
      (2) Nemesis exposes the window packet orderings such as AM flush
      ordering at init time. It supports ordered packets only when netmod
      supports ordered network.
      (3) If AM flush is ordered (flush must be finished after all previous
      operations), then CH3 RMA only requests FLUSH ACK on the last operation.
      Otherwise, CH3 must request per-OP FLUSH ACK to ensure all operations
      are remotely completed.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5324a41f
  4. 30 May, 2015 1 commit
    • Xin Zhao's avatar
      Add extended packet header in CH3 layer used by RMA messages · 25e40e43
      Xin Zhao authored
      
      
      Here we added extended packet header in CH3 layer used to
      transmit attributes that are only needed in RMA and are not
      needed in two-sided communication. The key implementation
      details are listed as follows:
      
      Origin side:
      
      (1) The extended packet header is stored in the request, and
      the request is passed to the issuing function (iSendv() or
      sendNoncontig_fn()) in the lower layer. The issuing function
      checks if the extended packet header exists in the request,
      if so, it will issue that header. (The modifications in lower
      layer are in the next commit.)
      
      (2) There is a fast path used when (origin data is contiguous &&
      target data is predefined && extended packet header is not used).
      In such case, we do not need to create a request beforehand
      but can use iStartMsgv() issuing function which try to issue
      the entire message as soon as possible.
      
      Target side:
      
      (1) There are two req handler being used when extended packet header
      is used or target datatype is derived. The first req handler is
      triggered when extended packet header / target datatype info is
      arrived, and the second req handler is triggered when actual data
      is arrived.
      
      (2) When target side receives a stream unit which is piggybacked with
      LOCK, it will drop the stream_offset in extended packet header, since
      the stream unit must be the first one and stream_offset must be 0.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      25e40e43
  5. 04 Mar, 2015 2 commits
  6. 13 Feb, 2015 1 commit
    • Xin Zhao's avatar
      Rewrite code of piggybacking IMMED data with RMA packets. · de9d0f21
      Xin Zhao authored
      
      
      Originally we add "immed_data" and "immed_len" areas to RMA packets,
      in order to piggyback small amount of data with packet header to
      reduce number of packets (Note that "immed_len" is necessary when
      the piggybacked data is not the entire data). However, those areas
      potentially increase the packet union size and worsen the two-sided
      communication. This patch fixes this issue.
      
      In this patch, we remove "immed_data" and "immed_len" from normal
      "MPIDI_CH3_Pkt_XXX_t" operation type (e.g. MPIDI_CH3_Pkt_put_t), and
      we introduce new "MPIDI_CH3_Pkt_XXX_immed_t" packt type for each
      operation (e.g. MPIDI_CH3_Pkt_put_immed_t).
      
      "MPIDI_CH3_Pkt_XXX_immed_t" is used when (1) both origin and target
      are basic datatypes, AND, (2) the data to be sent can be entirely fit
      into the header. By doing this, "MPIDI_CH3_Pkt_XXX_immed_t" needs
      "immed_data" area but can drop "immed_len" area. Also, since it only
      works with basic target datatype, it can drop "dataloop_size" area
      as well. All operations that do not satisfy (1) or (2) will use
      normal "MPIDI_CH3_Pkt_XXX_t" type.
      
      Originally we always piggyback FOP data into the packet header,
      which makes the packet size too large. In this patch we split the
      FOP operaton into IMMED packets and normal packets.
      
      Because CAS only work with 2 basic datatype and non-complex
      elements, the data amount is relatively small, we always piggyback
      the data with packet header and only use "MPIDI_CH3_Pkt_XXX_immed_t"
      packet type for CAS.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      de9d0f21
  7. 03 Feb, 2015 1 commit
  8. 30 Jan, 2015 1 commit
  9. 13 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Move MPID_Comm_AS_enabled to MPID layer function · d1ab9e68
      Wesley Bland authored
      
      
      It was pointed out that by putting this in a macro and failing silently
      when unimplemented, this make things challenging for derivatives which
      will implement this function in the future. By moving this to an MPID
      level function, it becomes more obvious that the function should be
      implemented later.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      d1ab9e68
  10. 12 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Change MPIDI_CH3I_Comm_AS_enabled to be MPID level · 8cbbcae4
      Wesley Bland authored
      
      
      This macro was used inside CH3 to determine if the communicator could be
      used for anysource communication. With the rewrite of the anysource
      fault tolerance logic, it is now necessary to use it at the MPI level.
      Because it is a macro and not a function, the macro is defined in
      mpiimple.h as (1) and then overwritten in the ch3 device. Future devices
      can also overwrite it if desired.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      8cbbcae4
  11. 16 Dec, 2014 6 commits
  12. 13 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Perf-tuning: issue FLUSH, FLUSH ACK, UNLOCK ACK messages only when needed. · a9d968cc
      Xin Zhao authored
      
      
      When operation pending list and request lists are all empty, FLUSH message
      needs to be sent by origin only when origin issued PUT/ACC operations since
      the last synchronization calls, otherwise origin does not need to issue FLUSH
      at all and does not need to wait for FLUSH ACK message.
      
      Similiarly, origin waits for ACK of UNLOCK message only when origin issued
      PUT/ACC operations since the last synchronization calls. However, UNLOCK
      message always needs to be sent out because origin needs to unlock the
      target process. This patch avoids issuing unnecessary
      FLUSH / FLUSH ACK / UNLOCK ACK messages.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      a9d968cc
  13. 11 Nov, 2014 2 commits
  14. 04 Nov, 2014 1 commit
    • Min Si's avatar
      Implement true request-based RMA operations. · 3e005f03
      Min Si authored
      
      
      There are two requests associated with each request-based
      operation: one normal internal request (req) and one newly
      added user request (ureq). We return ureq to user when
      request-based op call returns.
      
      The ureq is initialized with completion counter (CC) to 1
      and ref count to 2 (one is referenced by CH3 and another
      is referenced by user). If the corresponding op can be
      finished immediately in CH3, the runtime will complete ureq
      in CH3, and let user's MPI_Wait/Test to destroy ureq. If
      corresponding op cannot be finished immediately, we will
      first increment ref count to 3 (because now there are
      three places needed to reference ureq: user, CH3,
      progress engine). Progress engine will complete ureq when
      op is completed, then CH3 will release its reference during
      garbage collection, finally user's MPI_Wait/Test will
      destroy ureq.
      
      The ureq can be completed in following three ways:
      
      1. If op is issued and completed immediately in CH3
      (req is NULL), we just complete ureq before free op.
      
      2. If op is issued but not completed, we remember the ureq
      handler in req and specify OnDataAvail / OnFinal handlers
      in req to a newly added request handler, which will complete
      user reqeust. The handler is triggered at three places:
         2-a. when progress engine completes a put/acc req;
         2-b. when get/getacc handler completes a get/getacc req;
         2-c. when progress engine completes a get/getacc req;
      
      3. If op is not issued (i.e., wait for lock granted), the 2nd
      way will be eventually performed when such op is issued by
      progress engine.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      3e005f03
  15. 03 Nov, 2014 12 commits
    • Xin Zhao's avatar
      Delete no longer needed code. · cc63b367
      Xin Zhao authored
      
      
      We made a huge change to RMA infrastructure and
      a lot of old code can be droped, including separate
      handlers for lock-op-unlock, ACCUM_IMMED specific
      code, O(p) data structure code, code of lazy issuing,
      etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      cc63b367
    • Xin Zhao's avatar
      Rewrite code of passive lock control messages. · 0542e304
      Xin Zhao authored
      
      
      1. Piggyback LOCK request with first IMMED operation.
      
      When we see an IMMED operation, we can always piggyback
      LOCK request with that operation to reduce one sync
      message of single LOCK request. When packet header of
      that operation is received on target, we will try to
      acquire the lock and perform that operation. The target
      either piggybacks LOCK_GRANTED message with the response
      packet (if available), or sends a single LOCK_GRANTED
      message back to origin.
      
      2. Rewrite code of manage lock queue.
      
      When the lock request cannot be satisfied on target,
      we need to buffer that lock request on target. All we
      need to do is enqueuing the packet header, which contains
      all information we need after lock is granted. When
      the current lock is released, the runtime will goes
      over the lock queue and grant the lock to the next
      available request. After lock is granted, the runtime
      just trigger the packet handler for the second time.
      
      3. Release lock on target side if piggybacking with UNLOCK.
      
      If there are active-message operations to be issued,
      we piggyback a UNLOCK flag with the last operation.
      When the target recieves it, it will release the current
      lock and grant the lock to the next process.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0542e304
    • Xin Zhao's avatar
      Add IMMED area in packet header. · e8d4c6d5
      Xin Zhao authored
      
      
      We add a IMMED data area (16 bytes by default) in
      packet header which will contains as much origin
      data as possible. If origin can put all data in
      packet header, then it no longer needs to send
      separate data packet. When target recieves the
      packet header, it will first copy data out from
      the IMMED data area. If there is still more
      data coming, it continues to receive following
      packets; if all data is included in header, then
      recieving is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      e8d4c6d5
    • Xin Zhao's avatar
      Rewrite all synchronization routines. · 38b20e57
      Xin Zhao authored
      
      
      We use new algorithms for RMA synchronization
      functions and RMA epochs. The old implementation
      uses a lazy-issuing algorithm, which queues up
      all operations and issues them at end. This
      forbid opportunites to do hardware RMA operations
      and can use up all memory resources when we
      queue up large number of operations.
      
      Here we use a new algorithm, which will initialize
      the synchonization at beginning, and issue operations
      as soon as the synchronization is finished.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      38b20e57
    • Xin Zhao's avatar
      Control no. of active RMA requests in the runtime. · 257faca2
      Xin Zhao authored
      
      
      When there are too many active requests in the runtime,
      the internal memory might be used up. This patch
      prevents such situation by triggering blocking
      wait loop in operation routines when no. of active
      requests reaches certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      257faca2
    • Xin Zhao's avatar
      Enable making progress in operation routines. · 33d96690
      Xin Zhao authored
      
      
      We no longer use the lazy-issuing model, which delays
      all operations to the end to issue, but issues them
      as early as possible. To achieve this, we enable
      making progress in RMA routines, so that RMA operations
      can be issued out as long as synchronization is finished.
      
      Sometimes we also need to poke the progress in
      operation routines to make sure that target side
      makes enough progress to receiving packets. Here
      we trigger it when no. of posted operations reaches
      certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      33d96690
    • Xin Zhao's avatar
      Implement GET_OP routine which guarantees to return an OP. · 5dd55154
      Xin Zhao authored
      
      
      GET_OP function may be a blocking function which guarantees
      to return an RMA operation.
      
      Inside GET_OP we first call the normal OP_ALLOC function
      which will try to get a new OP from OP pools; if failed,
      we call nonblocking GC function to cleanup completed ops
      and then call OP_ALLOC again; if we still cannot get a
      new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION
      function if hardware ordering is provided and then call
      OP_ALLOC again; if still failed, finally we call blocking
      aggressive cleanup function, which will guarantee to
      return a new OP element.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd55154
    • Xin Zhao's avatar
      Keep track of no. of non-empty slots on window. · f91d4633
      Xin Zhao authored
      
      
      Keep track of no. of non-empty slots on window so that
      when number is 0, there are no operations needed to
      be processed and we can ignore that window.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f91d4633
    • Xin Zhao's avatar
      Add new RMA states on window / target and modify state checking. · f076f3fe
      Xin Zhao authored
      
      
      We define new states to indicate the current situation of
      RMA synchronization. The states contain both ACCESS states
      and EXPOPSURE states, and specify if the synchronization
      is initialized (_CALLED), on-going (_ISSUED) and completed
      (_GRANTED). For single lock in Passive Target, we use
      per-target state whereas the window state is set to PER_TARGET.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f076f3fe
    • Xin Zhao's avatar
      Add RMA slots and related APIs. · 0f596c48
      Xin Zhao authored
      
      
      We allocate a fixed size of targets array on window
      during window creation. The size can be configured
      by the user via CVAR. Each slot entry contains a list
      of target elements.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0f596c48
    • Xin Zhao's avatar
      Add target element and global / local pools and related APIs. · 5dd8a0a4
      Xin Zhao authored
      
      
      Here we add a data structure to store information of active target.
      The information includes operation lists, pasive lock state,
      sync state, etc.
      
      The target element is created by origin on-demand, and can
      be freed after the remote completion of all previous oeprations
      is detected. After RMA ending synchrnization calls, all
      target elements should be freed.
      
      Similiarly with operation pools, we create two-level target
      pools for target elements: one pre-window target pool and
      one global target pool.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd8a0a4
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
  16. 01 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: always waiting for remote completion in Win_unlock. · c76aa786
      Xin Zhao authored
      
      
      The original implementation includes an optimization which
      allows Win_unlock for exclusive lock to return without
      waiting for remote completion. This relys on the
      assumption that window memory on target process will not
      be accessed by a third party until that target process
      finishes all RMA operations and grants the lock to other
      processes. However, this assumption is not correct if user
      uses assert MPI_MODE_NOCHECK. Consider the following code:
      
                P0                              P1           P2
          MPI_Win_lock(P1, NULL, exclusive);
          MPI_Put(X);
          MPI_Win_unlock(P1, exclusive);
          MPI_Send (P2);                                MPI_Recv(P0);
                                                        MPI_Win_lock(P1, MODE_NOCHECK, exclusive);
                                                        MPI_Get(X);
                                                        MPI_Win_unlock(P1, exclusive);
      
      Both P0 and P2 issue exclusive lock to P1, and P2 uses assert
      MPI_MODE_NOCHECK because the lock should be granted to P2 after
      synchronization between P2 and P0. However, in the original
      implementation, GET operation on P2 might not get the updated
      value since Win_unlock on P0 return without waiting for remote
      completion.
      
      In this patch we delete this optimization. In Win_free, since every
      Win_unlock guarantees the remote completion, target process no
      longer needs to do additional counting works to detect target-side
      completion, but only needs to do a global barrier.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c76aa786
  17. 01 Oct, 2014 1 commit
  18. 28 Sep, 2014 1 commit
  19. 23 Sep, 2014 1 commit
    • Xin Zhao's avatar
      Bug-fix: waiting for ACKs for Active Target Synchronization. · 74189446
      Xin Zhao authored
      
      
      The original implementation of FENCE and PSCW does not
      guarantee the remote completion of issued-out RMA operations
      when MPI_Win_complete and MPI_Win_fence returns. They only
      guarantee the local completion of issued-out operations and
      the completion of coming-in operations. This is not correct
      if we try to get updated values on target side using synchronizations
      with MPI_MODE_NOCHECK.
      
      Here we modify it by making runtime wait for ACKs from all
      targets before returning from MPI_Win_fence and MPI_Win_complete.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      74189446
  20. 26 Aug, 2014 1 commit
  21. 31 Jul, 2014 2 commits
    • Wesley Bland's avatar
      Add MPI_Comm_revoke · 57f6ee88
      Wesley Bland authored
      
      
      MPI_Comm_revoke is a special function because it does not have a matching call
      on the "receiving side". This is because it has to act as an out-of-band,
      resilient broadcast algorithm. Because of this, in this commit, in addition to
      the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.),
      we add a new CH3 packet type that will handle revoking a communicator without
      involving a matching call from the MPI layer (similar to how RMA is currently
      implemented).
      
      The thing that must be handled most carefully when revoking a communicator is
      to ensure that a previously used context ID will eventually be returned to the
      pool of available context IDs and that after this occurs, no old messages will
      match the new usage of the context ID (for instance, if some messages are very
      slow and show up late). To accomplish this, revoke is implemented as an
      all-to-all algorithm. When one process calls revoke, it will send a message to
      all other processes in the communicator, which will trigger that process to
      send a message to all other processes, and so on. Once a process has already
      revoked its communicator locally, it won't send out another wave of messages.
      As each process receives the revoke messages from the other processes, it will
      track how many messages have been received. Once it has either received a
      revoke message or a message about a process failure for each other process, it
      will release its refcount on the communicator object. After the application
      has freed all of its references to the communicator (and all requests, files,
      etc. associated with it), the context ID will be returned to the available
      pool.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      57f6ee88
    • Wesley Bland's avatar
      Remove coll_active field in MPIDI_Comm · 5c71c3a8
      Wesley Bland authored
      
      
      The collectively active field wasn't doing anything anymore so it's been
      removed. This was a remnant from a previous FT proposal.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5c71c3a8