1. 03 Nov, 2014 40 commits
    • Xin Zhao's avatar
      add original RMA PVARs back. · ed20cd37
      Xin Zhao authored
      
      
      Add some original RMA PVARs back to the new
      RMA infrastructure, including timing of packet
      handlers, op allocation and setting, window
      creation, etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ed20cd37
    • Pavan Balaji's avatar
      Remove namespacing for static functions and types. · b682ec0e
      Pavan Balaji authored
      
      
      Names of static functions and types need not to have
      namespacing. Here we remove prefix MPIDI_CH3I_ for
      those functions and types.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b682ec0e
    • Xin Zhao's avatar
      Delete no longer needed code. · cc63b367
      Xin Zhao authored
      
      
      We made a huge change to RMA infrastructure and
      a lot of old code can be droped, including separate
      handlers for lock-op-unlock, ACCUM_IMMED specific
      code, O(p) data structure code, code of lazy issuing,
      etc.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      cc63b367
    • Xin Zhao's avatar
      Rewrite code of passive lock control messages. · 0542e304
      Xin Zhao authored
      
      
      1. Piggyback LOCK request with first IMMED operation.
      
      When we see an IMMED operation, we can always piggyback
      LOCK request with that operation to reduce one sync
      message of single LOCK request. When packet header of
      that operation is received on target, we will try to
      acquire the lock and perform that operation. The target
      either piggybacks LOCK_GRANTED message with the response
      packet (if available), or sends a single LOCK_GRANTED
      message back to origin.
      
      2. Rewrite code of manage lock queue.
      
      When the lock request cannot be satisfied on target,
      we need to buffer that lock request on target. All we
      need to do is enqueuing the packet header, which contains
      all information we need after lock is granted. When
      the current lock is released, the runtime will goes
      over the lock queue and grant the lock to the next
      available request. After lock is granted, the runtime
      just trigger the packet handler for the second time.
      
      3. Release lock on target side if piggybacking with UNLOCK.
      
      If there are active-message operations to be issued,
      we piggyback a UNLOCK flag with the last operation.
      When the target recieves it, it will release the current
      lock and grant the lock to the next process.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0542e304
    • Xin Zhao's avatar
      Reset the start of the enum to 0. · 7fbe72dd
      Xin Zhao authored
      
      
      We must make the initial value of enum to zero because some places
      check number of packet types by checking ending type value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7fbe72dd
    • Xin Zhao's avatar
      Rearrange enum of pkt types. · be3e5bdd
      Xin Zhao authored
      
      
      Rearrange the ordering of packet types so that all RMA issuing types
      can be placed together. This is convenient when we check if currently
      involved packets are all RMA packets.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      be3e5bdd
    • Xin Zhao's avatar
      Simplify PktHandler_FOP and PktHandler_FOPResp. · a42b916d
      Xin Zhao authored
      
      
      For FOP operation, all data can be fit into the packet
      header, so on origin side we do not need to send separate
      data packets, and on target side we do not need request
      handler, only packet handler is needed. Similar with FOP
      response packet, we can receive all data in FOP resp packet
      handler. This patch delete the request handler on target
      side and simplify packet handler on target / origin side.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      a42b916d
    • Xin Zhao's avatar
      Simplify issuing functions at origin side. · 52c2fc11
      Xin Zhao authored
      
      
      Here we extract the common code of different
      issuing functions at origin side and simplify
      those issuing functions.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      52c2fc11
    • Xin Zhao's avatar
      Add IMMED area in packet header. · e8d4c6d5
      Xin Zhao authored
      
      
      We add a IMMED data area (16 bytes by default) in
      packet header which will contains as much origin
      data as possible. If origin can put all data in
      packet header, then it no longer needs to send
      separate data packet. When target recieves the
      packet header, it will first copy data out from
      the IMMED data area. If there is still more
      data coming, it continues to receive following
      packets; if all data is included in header, then
      recieving is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      e8d4c6d5
    • Xin Zhao's avatar
      Add useful pkt wrappers. · 1c638a12
      Xin Zhao authored
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      1c638a12
    • Xin Zhao's avatar
      code refactoring in operation routines. · d129eed3
      Xin Zhao authored
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      d129eed3
    • Xin Zhao's avatar
      Decrement Active Target counter at target side. · b73778ea
      Xin Zhao authored
      
      
      During PSCW, when there are active-message operations
      to be issued in Win_complete, we piggback a AT_COMPLETE
      flag with it so that when target receives it, it can
      decrement a counter on target side and detect completion
      when target counter reaches zero.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b73778ea
    • Xin Zhao's avatar
      Detect remote completion by FLUSH / FLUSH_ACK messages. · 6578785d
      Xin Zhao authored
      
      
      When the origin wants to do a FLUSH sync, if there are
      active-message operations that are going to be issued,
      we piggback the FLUSH message with the last operation;
      if no such operations, we just send a single FLUSH packet.
      
      If the last operation is a write op (PUT, ACC) or only
      a single FLUSH packet is sent, after target recieves it,
      target will send back a single FLUSH_ACK packet;
      if the last operation contains a read action (GET, GACC, FOP,
      CAS), after target receiveds it, target will piggback a
      FLUSH_ACK flag with the response packet.
      
      After origin receives the FLUSH_ACK packet or response packet
      with FLUSH_ACK flag, it will decrement the counter which
      indicates number of outgoing sync messages (FLUSH / UNLOCK).
      When that counter reaches zero, origin can know that remote
      completion is achieved.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      6578785d
    • Xin Zhao's avatar
      Separate request handler of PUT, ACC, GACC and rename them. · fe15ea26
      Xin Zhao authored
      
      
      Separate final request handler of PUT, ACC, GACC into three.
      Separate derived DT request handler of ACC and GACC into two.
      
      Renaming request handlers as follows:
      
      (1) Normal request handler: it is triggered on target side
          when all data from origin is received.
      
          It includes:
      
          ReqHandler_PutRecvComplete --- for PUT
          ReqHandler_AccumRecvComplete --- for ACC
          ReqHandler_GaccumRecvComplete --- for GACC
      
      (2) Derived DT request handler: it is triggered on target
          side when all derived DT info is recieved.
      
          It includes:
      
          ReqHandler_PutDerivedDTRecvComplete --- for PUT
          ReqHandler_AccumDerivedDTRecvComplete --- for ACC
          ReqHandler_GaccumDerivedDTRecvComplete --- for GACC
      
      (3) Reponse request handler: it is triggered on target
          side when sending back process is finished in GET-like
          operations.
      
          It includes:
      
          ReqHandler_GetSendComplete --- for GET
          ReqHandler_GaccumLikeSendComplete --- for GACC, FOP, CAS
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fe15ea26
    • Xin Zhao's avatar
      Split shared RMA packet structures. · c0094faa
      Xin Zhao authored
      
      
      Previously several RMA packet types share the same structure,
      which is misleading for coding. Here make different
      RMA packet types use different packet data structures.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c0094faa
    • Xin Zhao's avatar
      bfbb1048
    • Xin Zhao's avatar
      Rewrite all synchronization routines. · 38b20e57
      Xin Zhao authored
      
      
      We use new algorithms for RMA synchronization
      functions and RMA epochs. The old implementation
      uses a lazy-issuing algorithm, which queues up
      all operations and issues them at end. This
      forbid opportunites to do hardware RMA operations
      and can use up all memory resources when we
      queue up large number of operations.
      
      Here we use a new algorithm, which will initialize
      the synchonization at beginning, and issue operations
      as soon as the synchronization is finished.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      38b20e57
    • Xin Zhao's avatar
      Control no. of active RMA requests in the runtime. · 257faca2
      Xin Zhao authored
      
      
      When there are too many active requests in the runtime,
      the internal memory might be used up. This patch
      prevents such situation by triggering blocking
      wait loop in operation routines when no. of active
      requests reaches certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      257faca2
    • Xin Zhao's avatar
      Enable making progress in operation routines. · 33d96690
      Xin Zhao authored
      
      
      We no longer use the lazy-issuing model, which delays
      all operations to the end to issue, but issues them
      as early as possible. To achieve this, we enable
      making progress in RMA routines, so that RMA operations
      can be issued out as long as synchronization is finished.
      
      Sometimes we also need to poke the progress in
      operation routines to make sure that target side
      makes enough progress to receiving packets. Here
      we trigger it when no. of posted operations reaches
      certain threshold value.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      33d96690
    • Xin Zhao's avatar
      Implement GET_OP routine which guarantees to return an OP. · 5dd55154
      Xin Zhao authored
      
      
      GET_OP function may be a blocking function which guarantees
      to return an RMA operation.
      
      Inside GET_OP we first call the normal OP_ALLOC function
      which will try to get a new OP from OP pools; if failed,
      we call nonblocking GC function to cleanup completed ops
      and then call OP_ALLOC again; if we still cannot get a
      new OP, we call nonblocking FREE_OP_BEFORE_COMPLETION
      function if hardware ordering is provided and then call
      OP_ALLOC again; if still failed, finally we call blocking
      aggressive cleanup function, which will guarantee to
      return a new OP element.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd55154
    • Xin Zhao's avatar
      Free incomplete ops when FLUSH ordering is provided. · 7c1e12f0
      Xin Zhao authored
      
      
      When FLUSH sync is issued and remote completion
      ordering between the last FLUSH message and all
      previous ops is provided by curent hardware, we
      no longer need to maintain incomplete operations
      but only need to wait for the ACK of current
      FLUSH. Therefore we can free those operation
      resources without blocking waiting.
      
      Not that if we do this, we temporarily lose the
      opportunity to do a real FLUSH_LOCAl until the
      current FLUSH ACK is received.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7c1e12f0
    • Xin Zhao's avatar
      Add blocking ops / targets aggressively cleanup functions. · 41a365ec
      Xin Zhao authored
      
      
      When we run out of resources for operations and targets,
      we need to make the runtime to complete some operations
      so that it can free some resources.
      
      For RMA operations, we implement by doing an internal
      FLUSH_LOCAL for one target and waiting for operation
      resources; for RMA targets, we implement by doing an
      internal FLUSH operation for one target and wait for
      target resources.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      41a365ec
    • Xin Zhao's avatar
      Add nonblocking progress making functions. · ab058906
      Xin Zhao authored
      
      
      Progress making functions check if current
      synchronization is finished, change synchronization
      state if possible, and issue pending operations
      on window as many as possible.
      
      There are three granularity of progress making functions:
      per-target, per-window and per-process. Per-target
      routine is used in RMA routine functions (PUT/GET/ACC...)
      and single passive lock (Win_unlock, Win_flush, Win_flush_local);
      per-window routine is used in window-wide synchronization
      calls (Win_fence, Win_complete, Win_unlock_all,
      Win_flush_all, Win_flush_local_all), and per-process
      routine is used in progress engine.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ab058906
    • Xin Zhao's avatar
      Add nonblocking ops / targets GC functions. · ebee0b71
      Xin Zhao authored
      
      
      Here we implement garbage collection functions for
      both operations and targets. There are two level of GC
      functions: per-target and per-window. Per-target functions
      are used in single passive lock ending calls: Win_unlock;
      per-window functions are used in window-wide ending
      calls: Win_fence, Win_complete, Win_unlock_all.
      
      Garbage collection functions for RMA ops go over all
      incomplete operation lists in target element and free
      completed operations. It also returns flags indicating
      local completion and remote completion.
      
      Garbage collection functions for RMA targets go over
      all targets and free those targets that have compeleted empty
      operation lists.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ebee0b71
    • Xin Zhao's avatar
      Keep track of no. of non-empty slots on window. · f91d4633
      Xin Zhao authored
      
      
      Keep track of no. of non-empty slots on window so that
      when number is 0, there are no operations needed to
      be processed and we can ignore that window.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f91d4633
    • Xin Zhao's avatar
      Add new RMA states on window / target and modify state checking. · f076f3fe
      Xin Zhao authored
      
      
      We define new states to indicate the current situation of
      RMA synchronization. The states contain both ACCESS states
      and EXPOPSURE states, and specify if the synchronization
      is initialized (_CALLED), on-going (_ISSUED) and completed
      (_GRANTED). For single lock in Passive Target, we use
      per-target state whereas the window state is set to PER_TARGET.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f076f3fe
    • Xin Zhao's avatar
      Add a flag in op struct to indicate derived datatype. · 7eac974f
      Xin Zhao authored
      
      
      Add flag is_dt in op structure which is set when any
      buffers involved in RMA operations contains derived
      datatype data. It is convenient for us to enqueue
      issued but not completed operation to the DT specific
      list.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7eac974f
    • Xin Zhao's avatar
      Add global window list. · 1d873639
      Xin Zhao authored
      
      
      Add a list of created windows on this process,
      so that we can make progress on all windows in
      the progress engine.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      1d873639
    • Xin Zhao's avatar
      Add routine to enqueue op to RMA slots. · 079a516b
      Xin Zhao authored
      
      
      Given an RMA op, finding the correct slot and target,
      enqueue op to the pending op list in that target object.
      If the target is not existed, create one in that slot.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      079a516b
    • Xin Zhao's avatar
      Add RMA slots and related APIs. · 0f596c48
      Xin Zhao authored
      
      
      We allocate a fixed size of targets array on window
      during window creation. The size can be configured
      by the user via CVAR. Each slot entry contains a list
      of target elements.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0f596c48
    • Xin Zhao's avatar
      Add target element and global / local pools and related APIs. · 5dd8a0a4
      Xin Zhao authored
      
      
      Here we add a data structure to store information of active target.
      The information includes operation lists, pasive lock state,
      sync state, etc.
      
      The target element is created by origin on-demand, and can
      be freed after the remote completion of all previous oeprations
      is detected. After RMA ending synchrnization calls, all
      target elements should be freed.
      
      Similiarly with operation pools, we create two-level target
      pools for target elements: one pre-window target pool and
      one global target pool.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5dd8a0a4
    • Pavan Balaji's avatar
      Initial draft of flow-control in the portals4 netmod. · f4253c38
      Pavan Balaji authored and Kenneth Raffenetti's avatar Kenneth Raffenetti committed
      
      
      Portals4 by itself does not provide any flow-control.  This needs to
      be managed by an upper-layer, such as MPICH.  Before this patch we
      were relying on a bunch of unexpected buffers that were posted to the
      portals library to manage unexpected messages.  However, since portals
      asynchronously pulls out messages from the network, if the application
      is delayed, it might result in the unexpected buffers being filled out
      and the portal disabled.  This would cause MPICH to abort.
      
      In this patch, we implement an initial version of flow-control that
      allows us to reenable the portal when it gets disabled.  All this is
      done in the context of the "rportals" wrappers that are implemented in
      the rptl.* files.  We create an extra control portal that is only used
      by rportals.  When the primary data portal gets disabled, the target
      sends PAUSE messages to all other processes.  Once each process
      confirms that it has no outstanding packets on the wire (i.e., all
      packets have either been ACKed or NACKed), it sends a PAUSE-ACK
      message.  When the target receives PAUSE-ACK messages from all
      processes (thus confirming that the network traffic to itself has been
      quiesced), it reenables the portal and sends an UNPAUSE message to all
      processes.
      
      This patch still does not deal with origin-side resource exhaustion.
      This can happen, for example, if we run out of space on the event
      queue on the origin side.
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
      f4253c38
    • Huiwei Lu's avatar
      Fixes configure.ac when no fortran is found · 28f6a689
      Huiwei Lu authored
      Fixes the case when configured with default setting but with no fortran
      installed. It should give an error of 'No Fortran 77/90 compiler found'
      but not.
      
      This patch is related with [d4e30cc0
      
      ], when configure was changed to
      support '--disable-fc'.
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
      28f6a689
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
    • Xin Zhao's avatar
      Embedding packet structure into RMA operation structure. · b1685139
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      We were duplicating information in the operation structure and in the
      packet structure when the message is actually issued.  Since most of
      the information is the same anyway, this patch just embeds a packet
      structure into the operation structure, so that we eliminate unnessary
      copy.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b1685139
    • Xin Zhao's avatar
      Rename ACK packets in RMA. · ba1a400c
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The packet type MPIDI_CH3_PKT_PT_RMA_DONE is used for ACK
      of FLUSH / UNLOCK packets. Here we rename it to
      MPIDI_CH3_PKT_FLUSH_ACK and modify the related functions
      and data structures.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ba1a400c
    • Xin Zhao's avatar
      Avoid using VC in RMA lock queue structure. · 0eaf344b
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      We were adding an unnecessary dependency on VC structure
      declarations in the mpidpkt.h file. The required information
      in RMA lock queue is only the rank, but not actual VC.
      Here we replace VC with rank.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0eaf344b
    • Xin Zhao's avatar
      Inline static functions to remove compiler warnings. · cb04acb3
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      cb04acb3
    • Xin Zhao's avatar
      Code refactoring to clean up the RMA code. · 61f952c7
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Split RMA functionality into smaller files, and move functions
      to where they belong based on the file names.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      61f952c7
    • Xin Zhao's avatar
      Temporarily remove all RMA PVARs. · 5c513032
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Because we are going to rewrite the RMA infrastructure
      and many PVARs will no longer be used, here we temporarily
      remove all PVARs and will add needed PVARs back after new
      implementation is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5c513032