1. 12 Jun, 2015 3 commits
  2. 30 May, 2015 2 commits
    • Xin Zhao's avatar
      Add extended packet header in CH3 layer used by RMA messages · 25e40e43
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      Here we added extended packet header in CH3 layer used to
      transmit attributes that are only needed in RMA and are not
      needed in two-sided communication. The key implementation
      details are listed as follows:
      Origin side:
      (1) The extended packet header is stored in the request, and
      the request is passed to the issuing function (iSendv() or
      sendNoncontig_fn()) in the lower layer. The issuing function
      checks if the extended packet header exists in the request,
      if so, it will issue that header. (The modifications in lower
      layer are in the next commit.)
      (2) There is a fast path used when (origin data is contiguous &&
      target data is predefined && extended packet header is not used).
      In such case, we do not need to create a request beforehand
      but can use iStartMsgv() issuing function which try to issue
      the entire message as soon as possible.
      Target side:
      (1) There are two req handler being used when extended packet header
      is used or target datatype is derived. The first req handler is
      triggered when extended packet header / target datatype info is
      arrived, and the second req handler is triggered when actual data
      is arrived.
      (2) When target side receives a stream unit which is piggybacked with
      LOCK, it will drop the stream_offset in extended packet header, since
      the stream unit must be the first one and stream_offset must be 0.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Revert "Move 'stream_offset' out of RMA packet struct." · 6f62c424
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      This reverts commit 19f29078
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
  3. 20 Apr, 2015 1 commit
    • Xin Zhao's avatar
      Move 'stream_offset' out of RMA packet struct. · 19f29078
      Xin Zhao authored
      'stream_offset' is used to specify the starting position
      (on target window) of the current streaming unit in ACC-like
      operations. It is originally put in the RMA packet struct,
      which potentially increases the size of CH3 packet size.
      In this patch, we move 'stream_offset' out of the RMA
      packet as follows: 1. when target data is basic datatype,
      we use 'stream_offset' and the starting address for the entire
      operation to calculate the starting address for current
      streaming unit, and rewrite 'addr' in RMA packet with that
      value; 2. when target data is derived datatype, we cannot do
      the same thing as basic datatype because the target needs to
      know both the starting address for the entire operation and
      the starting address for the current streaming unit. Therefore,
      we send 'stream_offset' separately to the target side.
      Signed-off-by: default avatarMin Si <msi@il.is.s.u-tokyo.ac.jp>
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
  4. 04 Mar, 2015 8 commits
    • Xin Zhao's avatar
      Rename eltype, n_elements and element_size to better names. · 98c76f78
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add stream_offset to ACC-related packets and request struct. · d8eb8de2
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      Add stream_offset area into ACC-related packets and request struct
      to remember current stream unit's starting position in the entire
      target data.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Cutting ACC and GACC messages. · 382b04c4
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      In this patch, we define the size of streaming unit the same
      as the SRBuf size (256 * 1024 bytes), and cut the ACC/GACC packet
      according to this size. The streaming unit always contains
      complete basic type data and does not contain partial basic
      type data.
      Note that we also increment the ref counter of the pointer
      to the derived datatype since multiple streaming units within
      one RMA operation will refer to it.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Code refactoring: set ureq to NULL when creating op. · 9fa6582a
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Bug-fix: make RMA work correctly with pair basic type. · ce8bc310
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      The original implementation of RMA does not consider pair basic
      types (e.g. MPI_FLOAT_INT, MPI_DOUBLE_INT). It only
      works correctly with builtin datatypes (e.g. MPI_INT, MPI_FLOAT).
      This patch makes the RMA work correctly with pair basic types.
      The bug is that: (1) when performing the ACC computation, the original
      implementation uses 'eltype' in the datatype structure, which is set
      when all basic elements in this datatype have the same builtin
      datatype. When basic elements have different builtin datatypes, like
      pair datatypes, the 'eltype' is set to MPI_DATATYPE_NULL. This makes
      the ACC computation be unable to work with pair types; (2) for all
      basic type of data, the original implementation assumes that
      they are all contiguous and issues them in an unpacked manner
      with length of data size (count*type_size). This is incorrect for
      pair datatypes, because most pair datatypes are non-contiguous
      (type_extent != type_size).
      In the previous patch, we already made 'eltype' to store basic
      type instead of builtin type. In this patch, we fixed this
      bug by (1) modify ACC computation to treat 'eltype' as basic
      type; (2) For non-contiguous basic type data, we use the noncontig
      API so that it will be issued in a packed manner.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
    • Xin Zhao's avatar
      Simplify code: not using flag MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP for GACC/FOP. · 344bf958
      Xin Zhao authored
      Flag MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP is used to tell the target
      if the response packet of current GET, GACC and FOP should use
      IMMED packet type. We use IMMED packet type only when
      origin/target/result datatypes are all basic types.
      Since the target does not know origin/result datatypes, origin
      process needs to set a flag to inform the target.
      However, this usage is redundant for GACC and FOP packets. The
      reason is that, when we use IMMED packet type for GACC/FOP packets,
      origin/target/result datatypes must be basic types,
      in such case, we must use IMMED packet type for response packets
      as well, and usage of MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP and
      related code is not necessary. In short,
      flag MPIDI_CH3_PKT_FLAG_RMA_IMMED_RESP is useful only for GET operation.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Store window basic attributes into a struct on window. · 9404e953
      Xin Zhao authored
      In this patch, we gather window basic attributes of other
      processes (base_addr, size, disp_unit, win_handle) using a
      struct called "basic_info_table". By doing this, we can use
      one contiguous memory region to store them.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
  5. 03 Mar, 2015 1 commit
  6. 13 Feb, 2015 4 commits
    • Xin Zhao's avatar
      Remove source_win_handle from GET-like RMA packets. · 80a71e11
      Xin Zhao authored
      For GET-like RMA packets and response packets (GACC,
      originally we carry source_win_handle in packet struct
      in order to locate window handle on origin side in the
      packet handler of response packets. However, this is
      not necessary because source_win_handle can be stored
      in the request on the origin side. This patch delete
      source_win_handle from those packets to reduce the size
      of packet union.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Rewrite code of piggybacking IMMED data with RMA packets. · de9d0f21
      Xin Zhao authored
      Originally we add "immed_data" and "immed_len" areas to RMA packets,
      in order to piggyback small amount of data with packet header to
      reduce number of packets (Note that "immed_len" is necessary when
      the piggybacked data is not the entire data). However, those areas
      potentially increase the packet union size and worsen the two-sided
      communication. This patch fixes this issue.
      In this patch, we remove "immed_data" and "immed_len" from normal
      "MPIDI_CH3_Pkt_XXX_t" operation type (e.g. MPIDI_CH3_Pkt_put_t), and
      we introduce new "MPIDI_CH3_Pkt_XXX_immed_t" packt type for each
      operation (e.g. MPIDI_CH3_Pkt_put_immed_t).
      "MPIDI_CH3_Pkt_XXX_immed_t" is used when (1) both origin and target
      are basic datatypes, AND, (2) the data to be sent can be entirely fit
      into the header. By doing this, "MPIDI_CH3_Pkt_XXX_immed_t" needs
      "immed_data" area but can drop "immed_len" area. Also, since it only
      works with basic target datatype, it can drop "dataloop_size" area
      as well. All operations that do not satisfy (1) or (2) will use
      normal "MPIDI_CH3_Pkt_XXX_t" type.
      Originally we always piggyback FOP data into the packet header,
      which makes the packet size too large. In this patch we split the
      FOP operaton into IMMED packets and normal packets.
      Because CAS only work with 2 basic datatype and non-complex
      elements, the data amount is relatively small, we always piggyback
      the data with packet header and only use "MPIDI_CH3_Pkt_XXX_immed_t"
      packet type for CAS.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Code-refactoring for RMA operations routines. · 3a017faa
      Xin Zhao authored
      This patch just does code refactoring for RMA operation rountines
      to make the code structure clearer. This patch does not change any
      After code refactoring, in each operation routine, for non-SHM operations
      we do the work in the following order:
      (1) allocate a new op struct;
      (2) fill areas in op struct, except for packet struct in op struct;
      (3) initialize packet struct in op struct, fill areas in packet struct;
      (4) enqueue op to data structure on window.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Remove lock_type and origin_rank areas from RMA packet. · 81e2b274
      Xin Zhao authored
      Originally we added lock_type and origin_rank areas
      in RMA packet, in order to piggyback passive lock request
      with RMA operations. However, those areas potentially
      enlarged the packet union size, and actually they are
      not necessary and can be completetly avoided.
      "Lock_type" is used to remember what types of lock (shared or
      exclusive) the origin wants to acquire on the target. To remove
      it from RMA packet, we use flags (already exists in RMA packet)
      to remember such information.
      "Origin_rank" is used to remember which origin has sent lock
      request to the target, so that when the lock is granted to this
      origin later, the target can send ack to that origin. Actually
      the target does not need to store origin_rank but can only store
      origin_vc, which is known from progress engine on target side.
      Therefore, we can completely remove origin_rank from RMA packet.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
  7. 16 Dec, 2014 5 commits
    • Xin Zhao's avatar
    • Xin Zhao's avatar
      Use int instead of size_t in RMA pkt header. · 3a05784f
      Xin Zhao authored
      Use int instead of size_t in RMA pkt header to reduce
      packet size.
      No reviewer.
    • Xin Zhao's avatar
      Bug-fix: add IMMED area in GET/GACC response packets · 87acbbbe
      Xin Zhao authored
      In this patch we allow GET/GACC response packets to
      piggyback some IMMED data, just like what we did
      for PUT/GACC/FOP/CAS packets.
      No reviewer.
    • Xin Zhao's avatar
      Perf-optimize: support piggybacking LOCK on large RMA operations. · 4739df59
      Xin Zhao authored
      Originally we only allows LOCK request to be piggybacked
      with small RMA operations (all data can be fit in packet
      header). This brings communication overhead for larger
      operations since origin side needs to wait for the LOCK
      ACK before it can transmit data to the target.
      In this patch we add support of piggybacking LOCK with
      RMA operations with arbitrary size. Note that (1) this
      only works with basic datatypes; (2) if the LOCK cannot
      be satisfied, we temporarily buffer this operation on
      the target side.
      No reviewer.
    • Xin Zhao's avatar
      Bug-fix: correctly modify win_ptr->accumulated_ops_cnt · 7b1a5e2d
      Xin Zhao authored
      accumulated_ops_cnt is used to track no. of accumulated
      posted RMA operations between two synchronization calls,
      so that we can decide when to poke progress engine based
      on the current value of this counter.
      Here we initialize it to zero in the BEGINNING synchronization
      calls (Win_fence, Win_start, first Win_lock, Win_lock_all),
      and correctly decrement it in the ENDING synchronization calls
      (Win_fence, Win_complete, Win_unlock, Win_unlock_all,
      Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all).
      We also use a per-target counter to track single target case.
      No reviewer.
  8. 13 Nov, 2014 1 commit
  9. 11 Nov, 2014 1 commit
  10. 04 Nov, 2014 1 commit
    • Min Si's avatar
      Implement true request-based RMA operations. · 3e005f03
      Min Si authored
      There are two requests associated with each request-based
      operation: one normal internal request (req) and one newly
      added user request (ureq). We return ureq to user when
      request-based op call returns.
      The ureq is initialized with completion counter (CC) to 1
      and ref count to 2 (one is referenced by CH3 and another
      is referenced by user). If the corresponding op can be
      finished immediately in CH3, the runtime will complete ureq
      in CH3, and let user's MPI_Wait/Test to destroy ureq. If
      corresponding op cannot be finished immediately, we will
      first increment ref count to 3 (because now there are
      three places needed to reference ureq: user, CH3,
      progress engine). Progress engine will complete ureq when
      op is completed, then CH3 will release its reference during
      garbage collection, finally user's MPI_Wait/Test will
      destroy ureq.
      The ureq can be completed in following three ways:
      1. If op is issued and completed immediately in CH3
      (req is NULL), we just complete ureq before free op.
      2. If op is issued but not completed, we remember the ureq
      handler in req and specify OnDataAvail / OnFinal handlers
      in req to a newly added request handler, which will complete
      user reqeust. The handler is triggered at three places:
         2-a. when progress engine completes a put/acc req;
         2-b. when get/getacc handler completes a get/getacc req;
         2-c. when progress engine completes a get/getacc req;
      3. If op is not issued (i.e., wait for lock granted), the 2nd
      way will be eventually performed when such op is issued by
      progress engine.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
  11. 03 Nov, 2014 13 commits
    • Xin Zhao's avatar
      add original RMA PVARs back. · ed20cd37
      Xin Zhao authored
      Add some original RMA PVARs back to the new
      RMA infrastructure, including timing of packet
      handlers, op allocation and setting, window
      creation, etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Delete no longer needed code. · cc63b367
      Xin Zhao authored
      We made a huge change to RMA infrastructure and
      a lot of old code can be droped, including separate
      handlers for lock-op-unlock, ACCUM_IMMED specific
      code, O(p) data structure code, code of lazy issuing,
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Rewrite code of passive lock control messages. · 0542e304
      Xin Zhao authored
      1. Piggyback LOCK request with first IMMED operation.
      When we see an IMMED operation, we can always piggyback
      LOCK request with that operation to reduce one sync
      message of single LOCK request. When packet header of
      that operation is received on target, we will try to
      acquire the lock and perform that operation. The target
      either piggybacks LOCK_GRANTED message with the response
      packet (if available), or sends a single LOCK_GRANTED
      message back to origin.
      2. Rewrite code of manage lock queue.
      When the lock request cannot be satisfied on target,
      we need to buffer that lock request on target. All we
      need to do is enqueuing the packet header, which contains
      all information we need after lock is granted. When
      the current lock is released, the runtime will goes
      over the lock queue and grant the lock to the next
      available request. After lock is granted, the runtime
      just trigger the packet handler for the second time.
      3. Release lock on target side if piggybacking with UNLOCK.
      If there are active-message operations to be issued,
      we piggyback a UNLOCK flag with the last operation.
      When the target recieves it, it will release the current
      lock and grant the lock to the next process.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add IMMED area in packet header. · e8d4c6d5
      Xin Zhao authored
      We add a IMMED data area (16 bytes by default) in
      packet header which will contains as much origin
      data as possible. If origin can put all data in
      packet header, then it no longer needs to send
      separate data packet. When target recieves the
      packet header, it will first copy data out from
      the IMMED data area. If there is still more
      data coming, it continues to receive following
      packets; if all data is included in header, then
      recieving is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      code refactoring in operation routines. · d129eed3
      Xin Zhao authored
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Split shared RMA packet structures. · c0094faa
      Xin Zhao authored
      Previously several RMA packet types share the same structure,
      which is misleading for coding. Here make different
      RMA packet types use different packet data structures.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Control no. of active RMA requests in the runtime. · 257faca2
      Xin Zhao authored
      When there are too many active requests in the runtime,
      the internal memory might be used up. This patch
      prevents such situation by triggering blocking
      wait loop in operation routines when no. of active
      requests reaches certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Enable making progress in operation routines. · 33d96690
      Xin Zhao authored
      We no longer use the lazy-issuing model, which delays
      all operations to the end to issue, but issues them
      as early as possible. To achieve this, we enable
      making progress in RMA routines, so that RMA operations
      can be issued out as long as synchronization is finished.
      Sometimes we also need to poke the progress in
      operation routines to make sure that target side
      makes enough progress to receiving packets. Here
      we trigger it when no. of posted operations reaches
      certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Implement GET_OP routine which guarantees to return an OP. · 5dd55154
      Xin Zhao authored
      GET_OP function may be a blocking function which guarantees
      to return an RMA operation.
      Inside GET_OP we first call the normal OP_ALLOC function
      which will try to get a new OP from OP pools; if failed,
      we call nonblocking GC function to cleanup completed ops
      and then call OP_ALLOC again; if we still cannot get a
      new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION
      function if hardware ordering is provided and then call
      OP_ALLOC again; if still failed, finally we call blocking
      aggressive cleanup function, which will guarantee to
      return a new OP element.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add blocking ops / targets aggressively cleanup functions. · 41a365ec
      Xin Zhao authored
      When we run out of resources for operations and targets,
      we need to make the runtime to complete some operations
      so that it can free some resources.
      For RMA operations, we implement by doing an internal
      FLUSH_LOCAL for one target and waiting for operation
      resources; for RMA targets, we implement by doing an
      internal FLUSH operation for one target and wait for
      target resources.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add new RMA states on window / target and modify state checking. · f076f3fe
      Xin Zhao authored
      We define new states to indicate the current situation of
      RMA synchronization. The states contain both ACCESS states
      and EXPOPSURE states, and specify if the synchronization
      is initialized (_CALLED), on-going (_ISSUED) and completed
      (_GRANTED). For single lock in Passive Target, we use
      per-target state whereas the window state is set to PER_TARGET.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add a flag in op struct to indicate derived datatype. · 7eac974f
      Xin Zhao authored
      Add flag is_dt in op structure which is set when any
      buffers involved in RMA operations contains derived
      datatype data. It is convenient for us to enqueue
      issued but not completed operation to the DT specific
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
    • Xin Zhao's avatar
      Add routine to enqueue op to RMA slots. · 079a516b
      Xin Zhao authored
      Given an RMA op, finding the correct slot and target,
      enqueue op to the pending op list in that target object.
      If the target is not existed, create one in that slot.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>