1. 04 Mar, 2015 2 commits
  2. 03 Mar, 2015 1 commit
  3. 16 Dec, 2014 3 commits
    • Xin Zhao's avatar
      Re-organize progress engine functions. · 1962d3b1
      Xin Zhao authored
      Rewrite progress engine functions as following:
      
      Basic functions:
      
      (1) check_target_state: check to see if we can switch target state,
          issue synchronization messages if needed.
      (2) issue_ops_target: issue al pending operations to this target.
      (3) check_window_state: check to see if we can switch window state.
      (4) issue_ops_win: issue all pending operations on this window.
          Currently it internally calls check_target_state and
          issue_ops_target, it should be optimized in future.
      
      Progress making functions:
      
      (1) Make_progress_target: make progress on one target, which
          internally call check_target_state and issue_ops_target.
      (2) Make_progress_win: make progress on all targets on one window,
          which internally call check_window_state and issue_ops_win.
      (3) Make_progress_global: make progress on all windows, which
          internally call make_progress_win.
      
      No reviewer.
      1962d3b1
    • Xin Zhao's avatar
      Modify struct name: replace "struct XXX" with "XXX_t" · 7c533ef3
      Xin Zhao authored
      No reviewer.
      7c533ef3
    • Xin Zhao's avatar
      Bug-fix: correctly modify win_ptr->accumulated_ops_cnt · 7b1a5e2d
      Xin Zhao authored
      accumulated_ops_cnt is used to track no. of accumulated
      posted RMA operations between two synchronization calls,
      so that we can decide when to poke progress engine based
      on the current value of this counter.
      
      Here we initialize it to zero in the BEGINNING synchronization
      calls (Win_fence, Win_start, first Win_lock, Win_lock_all),
      and correctly decrement it in the ENDING synchronization calls
      (Win_fence, Win_complete, Win_unlock, Win_unlock_all,
      Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all).
      We also use a per-target counter to track single target case.
      
      No reviewer.
      7b1a5e2d
  4. 13 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Perf-tuning: issue FLUSH, FLUSH ACK, UNLOCK ACK messages only when needed. · a9d968cc
      Xin Zhao authored
      
      
      When operation pending list and request lists are all empty, FLUSH message
      needs to be sent by origin only when origin issued PUT/ACC operations since
      the last synchronization calls, otherwise origin does not need to issue FLUSH
      at all and does not need to wait for FLUSH ACK message.
      
      Similiarly, origin waits for ACK of UNLOCK message only when origin issued
      PUT/ACC operations since the last synchronization calls. However, UNLOCK
      message always needs to be sent out because origin needs to unlock the
      target process. This patch avoids issuing unnecessary
      FLUSH / FLUSH ACK / UNLOCK ACK messages.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      a9d968cc
  5. 11 Nov, 2014 2 commits
  6. 07 Nov, 2014 1 commit
    • Xin Zhao's avatar
      Move definition of global window counters to nemesis / sock. · a4223bc3
      Xin Zhao authored
      
      
      num_active_issued_win and num_passive_win are counters of
      windows in active ISSUED mode and in passive mode.
      It is modified in CH3 and is used in progress engine of
      nemesis / sock to skip windows that do not need to make
      progress on. Here we define them in mpidi_ch3_pre.h in
      nemesis / sock so that they can be exposed to upper layers.
      Signed-off-by: default avatarMin Si <msi@il.is.s.u-tokyo.ac.jp>
      a4223bc3
  7. 04 Nov, 2014 1 commit
    • Min Si's avatar
      Implement true request-based RMA operations. · 3e005f03
      Min Si authored
      
      
      There are two requests associated with each request-based
      operation: one normal internal request (req) and one newly
      added user request (ureq). We return ureq to user when
      request-based op call returns.
      
      The ureq is initialized with completion counter (CC) to 1
      and ref count to 2 (one is referenced by CH3 and another
      is referenced by user). If the corresponding op can be
      finished immediately in CH3, the runtime will complete ureq
      in CH3, and let user's MPI_Wait/Test to destroy ureq. If
      corresponding op cannot be finished immediately, we will
      first increment ref count to 3 (because now there are
      three places needed to reference ureq: user, CH3,
      progress engine). Progress engine will complete ureq when
      op is completed, then CH3 will release its reference during
      garbage collection, finally user's MPI_Wait/Test will
      destroy ureq.
      
      The ureq can be completed in following three ways:
      
      1. If op is issued and completed immediately in CH3
      (req is NULL), we just complete ureq before free op.
      
      2. If op is issued but not completed, we remember the ureq
      handler in req and specify OnDataAvail / OnFinal handlers
      in req to a newly added request handler, which will complete
      user reqeust. The handler is triggered at three places:
         2-a. when progress engine completes a put/acc req;
         2-b. when get/getacc handler completes a get/getacc req;
         2-c. when progress engine completes a get/getacc req;
      
      3. If op is not issued (i.e., wait for lock granted), the 2nd
      way will be eventually performed when such op is issued by
      progress engine.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      3e005f03
  8. 03 Nov, 2014 19 commits
    • Xin Zhao's avatar
      add original RMA PVARs back. · ed20cd37
      Xin Zhao authored
      
      
      Add some original RMA PVARs back to the new
      RMA infrastructure, including timing of packet
      handlers, op allocation and setting, window
      creation, etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ed20cd37
    • Xin Zhao's avatar
      Delete no longer needed code. · cc63b367
      Xin Zhao authored
      
      
      We made a huge change to RMA infrastructure and
      a lot of old code can be droped, including separate
      handlers for lock-op-unlock, ACCUM_IMMED specific
      code, O(p) data structure code, code of lazy issuing,
      etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      cc63b367
    • Xin Zhao's avatar
      Rewrite code of passive lock control messages. · 0542e304
      Xin Zhao authored
      
      
      1. Piggyback LOCK request with first IMMED operation.
      
      When we see an IMMED operation, we can always piggyback
      LOCK request with that operation to reduce one sync
      message of single LOCK request. When packet header of
      that operation is received on target, we will try to
      acquire the lock and perform that operation. The target
      either piggybacks LOCK_GRANTED message with the response
      packet (if available), or sends a single LOCK_GRANTED
      message back to origin.
      
      2. Rewrite code of manage lock queue.
      
      When the lock request cannot be satisfied on target,
      we need to buffer that lock request on target. All we
      need to do is enqueuing the packet header, which contains
      all information we need after lock is granted. When
      the current lock is released, the runtime will goes
      over the lock queue and grant the lock to the next
      available request. After lock is granted, the runtime
      just trigger the packet handler for the second time.
      
      3. Release lock on target side if piggybacking with UNLOCK.
      
      If there are active-message operations to be issued,
      we piggyback a UNLOCK flag with the last operation.
      When the target recieves it, it will release the current
      lock and grant the lock to the next process.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0542e304
    • Xin Zhao's avatar
      Decrement Active Target counter at target side. · b73778ea
      Xin Zhao authored
      
      
      During PSCW, when there are active-message operations
      to be issued in Win_complete, we piggback a AT_COMPLETE
      flag with it so that when target receives it, it can
      decrement a counter on target side and detect completion
      when target counter reaches zero.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b73778ea
    • Xin Zhao's avatar
      Rewrite all synchronization routines. · 38b20e57
      Xin Zhao authored
      
      
      We use new algorithms for RMA synchronization
      functions and RMA epochs. The old implementation
      uses a lazy-issuing algorithm, which queues up
      all operations and issues them at end. This
      forbid opportunites to do hardware RMA operations
      and can use up all memory resources when we
      queue up large number of operations.
      
      Here we use a new algorithm, which will initialize
      the synchonization at beginning, and issue operations
      as soon as the synchronization is finished.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      38b20e57
    • Xin Zhao's avatar
      Control no. of active RMA requests in the runtime. · 257faca2
      Xin Zhao authored
      
      
      When there are too many active requests in the runtime,
      the internal memory might be used up. This patch
      prevents such situation by triggering blocking
      wait loop in operation routines when no. of active
      requests reaches certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      257faca2
    • Xin Zhao's avatar
      Implement GET_OP routine which guarantees to return an OP. · 5dd55154
      Xin Zhao authored
      
      
      GET_OP function may be a blocking function which guarantees
      to return an RMA operation.
      
      Inside GET_OP we first call the normal OP_ALLOC function
      which will try to get a new OP from OP pools; if failed,
      we call nonblocking GC function to cleanup completed ops
      and then call OP_ALLOC again; if we still cannot get a
      new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION
      function if hardware ordering is provided and then call
      OP_ALLOC again; if still failed, finally we call blocking
      aggressive cleanup function, which will guarantee to
      return a new OP element.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd55154
    • Xin Zhao's avatar
      Free incomplete ops when FLUSH ordering is provided. · 7c1e12f0
      Xin Zhao authored
      
      
      When FLUSH sync is issued and remote completion
      ordering between the last FLUSH message and all
      previous ops is provided by curent hardware, we
      no longer need to maintain incomplete operations
      but only need to wait for the ACK of current
      FLUSH. Therefore we can free those operation
      resources without blocking waiting.
      
      Not that if we do this, we temporarily lose the
      opportunity to do a real FLUSH_LOCAl until the
      current FLUSH ACK is received.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7c1e12f0
    • Xin Zhao's avatar
      Add blocking ops / targets aggressively cleanup functions. · 41a365ec
      Xin Zhao authored
      
      
      When we run out of resources for operations and targets,
      we need to make the runtime to complete some operations
      so that it can free some resources.
      
      For RMA operations, we implement by doing an internal
      FLUSH_LOCAL for one target and waiting for operation
      resources; for RMA targets, we implement by doing an
      internal FLUSH operation for one target and wait for
      target resources.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      41a365ec
    • Xin Zhao's avatar
      Add nonblocking progress making functions. · ab058906
      Xin Zhao authored
      
      
      Progress making functions check if current
      synchronization is finished, change synchronization
      state if possible, and issue pending operations
      on window as many as possible.
      
      There are three granularity of progress making functions:
      per-target, per-window and per-process. Per-target
      routine is used in RMA routine functions (PUT/GET/ACC...)
      and single passive lock (Win_unlock, Win_flush, Win_flush_local);
      per-window routine is used in window-wide synchronization
      calls (Win_fence, Win_complete, Win_unlock_all,
      Win_flush_all, Win_flush_local_all), and per-process
      routine is used in progress engine.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ab058906
    • Xin Zhao's avatar
      Add nonblocking ops / targets GC functions. · ebee0b71
      Xin Zhao authored
      
      
      Here we implement garbage collection functions for
      both operations and targets. There are two level of GC
      functions: per-target and per-window. Per-target functions
      are used in single passive lock ending calls: Win_unlock;
      per-window functions are used in window-wide ending
      calls: Win_fence, Win_complete, Win_unlock_all.
      
      Garbage collection functions for RMA ops go over all
      incomplete operation lists in target element and free
      completed operations. It also returns flags indicating
      local completion and remote completion.
      
      Garbage collection functions for RMA targets go over
      all targets and free those targets that have compeleted empty
      operation lists.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ebee0b71
    • Xin Zhao's avatar
      Keep track of no. of non-empty slots on window. · f91d4633
      Xin Zhao authored
      
      
      Keep track of no. of non-empty slots on window so that
      when number is 0, there are no operations needed to
      be processed and we can ignore that window.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f91d4633
    • Xin Zhao's avatar
      Add new RMA states on window / target and modify state checking. · f076f3fe
      Xin Zhao authored
      
      
      We define new states to indicate the current situation of
      RMA synchronization. The states contain both ACCESS states
      and EXPOPSURE states, and specify if the synchronization
      is initialized (_CALLED), on-going (_ISSUED) and completed
      (_GRANTED). For single lock in Passive Target, we use
      per-target state whereas the window state is set to PER_TARGET.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f076f3fe
    • Xin Zhao's avatar
      Add a flag in op struct to indicate derived datatype. · 7eac974f
      Xin Zhao authored
      
      
      Add flag is_dt in op structure which is set when any
      buffers involved in RMA operations contains derived
      datatype data. It is convenient for us to enqueue
      issued but not completed operation to the DT specific
      list.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7eac974f
    • Xin Zhao's avatar
      Add routine to enqueue op to RMA slots. · 079a516b
      Xin Zhao authored
      
      
      Given an RMA op, finding the correct slot and target,
      enqueue op to the pending op list in that target object.
      If the target is not existed, create one in that slot.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      079a516b
    • Xin Zhao's avatar
      Add RMA slots and related APIs. · 0f596c48
      Xin Zhao authored
      
      
      We allocate a fixed size of targets array on window
      during window creation. The size can be configured
      by the user via CVAR. Each slot entry contains a list
      of target elements.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0f596c48
    • Xin Zhao's avatar
      Add target element and global / local pools and related APIs. · 5dd8a0a4
      Xin Zhao authored
      
      
      Here we add a data structure to store information of active target.
      The information includes operation lists, pasive lock state,
      sync state, etc.
      
      The target element is created by origin on-demand, and can
      be freed after the remote completion of all previous oeprations
      is detected. After RMA ending synchrnization calls, all
      target elements should be freed.
      
      Similiarly with operation pools, we create two-level target
      pools for target elements: one pre-window target pool and
      one global target pool.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd8a0a4
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
    • Xin Zhao's avatar
      Code refactoring to clean up the RMA code. · 61f952c7
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Split RMA functionality into smaller files, and move functions
      to where they belong based on the file names.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      61f952c7