1. 03 Nov, 2014 10 commits
    • Pavan Balaji's avatar
      Initial draft of flow-control in the portals4 netmod. · f4253c38
      Pavan Balaji authored and Kenneth Raffenetti's avatar Kenneth Raffenetti committed
      
      
      Portals4 by itself does not provide any flow-control.  This needs to
      be managed by an upper-layer, such as MPICH.  Before this patch we
      were relying on a bunch of unexpected buffers that were posted to the
      portals library to manage unexpected messages.  However, since portals
      asynchronously pulls out messages from the network, if the application
      is delayed, it might result in the unexpected buffers being filled out
      and the portal disabled.  This would cause MPICH to abort.
      
      In this patch, we implement an initial version of flow-control that
      allows us to reenable the portal when it gets disabled.  All this is
      done in the context of the "rportals" wrappers that are implemented in
      the rptl.* files.  We create an extra control portal that is only used
      by rportals.  When the primary data portal gets disabled, the target
      sends PAUSE messages to all other processes.  Once each process
      confirms that it has no outstanding packets on the wire (i.e., all
      packets have either been ACKed or NACKed), it sends a PAUSE-ACK
      message.  When the target receives PAUSE-ACK messages from all
      processes (thus confirming that the network traffic to itself has been
      quiesced), it reenables the portal and sends an UNPAUSE message to all
      processes.
      
      This patch still does not deal with origin-side resource exhaustion.
      This can happen, for example, if we run out of space on the event
      queue on the origin side.
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
      f4253c38
    • Xin Zhao's avatar
      Add global / local pools of RMA ops and related APIs. · fc7617f2
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Instead of allocating / deallocating RMA operations whenever
      an RMA op is posted by user, we allocate fixed size operation
      pools beforehand and take the op element from those pools
      when an RMA op is posted.
      
      With only a local (per-window) op pool, the number of ops
      allocated can increase arbitrarily if many windows are created.
      Alternatively, if we only use a global op pool, other windows
      might use up all operations thus starving the window we are
      working on.
      
      In this patch we create two pools: a local (per-window) pool and a
      global pool.  Every window is guaranteed to have at least the number
      of operations in the local pool.  If we run out of these operations,
      we check in the global pool to see if we have any operations left.
      When an operation is released, it is added back to the same pool it
      was allocated from.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fc7617f2
    • Xin Zhao's avatar
      Embedding packet structure into RMA operation structure. · b1685139
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      We were duplicating information in the operation structure and in the
      packet structure when the message is actually issued.  Since most of
      the information is the same anyway, this patch just embeds a packet
      structure into the operation structure, so that we eliminate unnessary
      copy.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      b1685139
    • Xin Zhao's avatar
      Rename ACK packets in RMA. · ba1a400c
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The packet type MPIDI_CH3_PKT_PT_RMA_DONE is used for ACK
      of FLUSH / UNLOCK packets. Here we rename it to
      MPIDI_CH3_PKT_FLUSH_ACK and modify the related functions
      and data structures.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ba1a400c
    • Xin Zhao's avatar
      Avoid using VC in RMA lock queue structure. · 0eaf344b
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      We were adding an unnecessary dependency on VC structure
      declarations in the mpidpkt.h file. The required information
      in RMA lock queue is only the rank, but not actual VC.
      Here we replace VC with rank.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0eaf344b
    • Xin Zhao's avatar
      Inline static functions to remove compiler warnings. · cb04acb3
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      cb04acb3
    • Xin Zhao's avatar
      Code refactoring to clean up the RMA code. · 61f952c7
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Split RMA functionality into smaller files, and move functions
      to where they belong based on the file names.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      61f952c7
    • Xin Zhao's avatar
      Temporarily remove all RMA PVARs. · 5c513032
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Because we are going to rewrite the RMA infrastructure
      and many PVARs will no longer be used, here we temporarily
      remove all PVARs and will add needed PVARs back after new
      implementation is done.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5c513032
    • Wesley Bland's avatar
      Add CVAR for initial size of RTS queue · 0086b7bf
      Wesley Bland authored
      
      
      Rather than having a static value for the initial size of the RTS queue,
      have a CVAR to define it.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      0086b7bf
    • Wesley Bland's avatar
      Add a warning if the RTS queue is overflowed · 01766815
      Wesley Bland authored
      
      
      When the RTS queue fills up the first time, print out a warning to let
      the user know that they've done it and FT won't be provided anymore.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      01766815
  2. 01 Nov, 2014 3 commits
    • Xin Zhao's avatar
      Bug-fix: avoid free NULL pointer in RMA. · 72a1e6f8
      Xin Zhao authored
      
      
      req->dev.user_buf points to the data sent from origin process
      to target process, and for FOP sometimes it points to the IMMED
      area in packet header when data can be fit in packet header.
      In such case, we should not free req->dev.user_buf in final
      request handler since that data area will be freed by the
      runtime when packet header is freed.
      
      In this patch we initialize user_buf to NULL when creating the
      request, and set it to NULL when FOP is completed, and avoid free
      a NULL pointer in final request handler.
      Signed-off-by: default avatarMin Si <msi@il.is.s.u-tokyo.ac.jp>
      72a1e6f8
    • Igor Ivanov's avatar
    • Xin Zhao's avatar
      Bug-fix: always waiting for remote completion in Win_unlock. · c76aa786
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The original implementation includes an optimization which
      allows Win_unlock for exclusive lock to return without
      waiting for remote completion. This relys on the
      assumption that window memory on target process will not
      be accessed by a third party until that target process
      finishes all RMA operations and grants the lock to other
      processes. However, this assumption is not correct if user
      uses assert MPI_MODE_NOCHECK. Consider the following code:
      
                P0                              P1           P2
          MPI_Win_lock(P1, NULL, exclusive);
          MPI_Put(X);
          MPI_Win_unlock(P1, exclusive);
          MPI_Send (P2);                                MPI_Recv(P0);
                                                        MPI_Win_lock(P1, MODE_NOCHECK, exclusive);
                                                        MPI_Get(X);
                                                        MPI_Win_unlock(P1, exclusive);
      
      Both P0 and P2 issue exclusive lock to P1, and P2 uses assert
      MPI_MODE_NOCHECK because the lock should be granted to P2 after
      synchronization between P2 and P0. However, in the original
      implementation, GET operation on P2 might not get the updated
      value since Win_unlock on P0 return without waiting for remote
      completion.
      
      In this patch we delete this optimization. In Win_free, since every
      Win_unlock guarantees the remote completion, target process no
      longer needs to do additional counting works to detect target-side
      completion, but only needs to do a global barrier.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c76aa786
  3. 31 Oct, 2014 2 commits
  4. 30 Oct, 2014 3 commits
    • Xin Zhao's avatar
      Clean up white-space and code format in RMA code. · fe283e91
      Xin Zhao authored
      No reviewer.
      fe283e91
    • Min Si's avatar
      Bug-fix: trigger final req handler for receiving derived datatype. · 920661c3
      Min Si authored
      
      
      There are two request handlers used when receiving data:
      (1) OnDataAvail, which is triggered when data is arrived;
      (2) OnFinal, which is triggered when receiving data is finished;
      
      When receiving large derived datatype, the receiving iov can be divided
      into multiple iovs. The OnDataAvail handler is set to iov load function
      when still waiting for remaining data. However, such handler should be
      set to OnFinal when starting receiving the last iov.
      
      The original code does not set OnDataAvail handler to OnFinal at end.
      This patch fixes this bug.
      
      Note that this bug only appears in RMA calls, because only the RMA
      packet handers need to specify OnFinal.
      
      Resolve #2189.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      920661c3
    • Pavan Balaji's avatar
      Disable hugepage support by default. · 669f4286
      Pavan Balaji authored
      
      
      This patch is a workaround for an issue with older HPC-X machines.
      Once we are comfortable upgrading to the latest HPC-X version, the
      default value of the CVAR should be changed to true.
      Signed-off-by: default avatarXin Zhao <xinzhao3@illinois.edu>
      669f4286
  5. 29 Oct, 2014 5 commits
  6. 28 Oct, 2014 3 commits
    • Xin Zhao's avatar
    • Paul Coffman's avatar
      Assign large blocks first in ADIOI_GPFS_Calc_file_domains · c16466e3
      Paul Coffman authored and Rob Latham's avatar Rob Latham committed
      
      
      For files that are less than the size of a gpfs block there seems to be
      an issue if successive MPI_File_write_at_all are called with proceeding
      offsets.  Given the simple case of 2 aggs, the 2nd agg/fd will be utilized,
      however the initial offset into the 2nd agg is distorted on the 2nd call
      to MPI_File_write_at_all because of the negative size of the 1st agg/fd
      because the offset info the 2nd agg/fd is influenced by the size of the
      first.  Simple solution is to reverse the default large block assignment so
      in the case where only 1 agg/fd will be used it will be the first.  By chance
      in the 2 agg situation this is what the GPFSMPIO_BALANCECONTIG
      optimization does and it does not have this problem.
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
      c16466e3
    • Paul Coffman's avatar
      MP_IOTASKLIST error checking · 976272a7
      Paul Coffman authored and Rob Latham's avatar Rob Latham committed
      
      
      PE users may manually specify the MP_IOTASKLIST for explicit aggregator
      selection.  Code needed to be added to verify that the user
      specification of aggregators were all valid.
      
      Do our best to maintain the old PE behavior of using as much of the
      correctly specified MP_IOTASKLIST as possible and issuing what it
      labeled error messages but were really warnings about the incorrect
      portions and functionally just ignoring it, unless none of it was usable
      in which case it fell back on the default.
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
      976272a7
  7. 27 Oct, 2014 1 commit
  8. 26 Oct, 2014 1 commit
  9. 25 Oct, 2014 2 commits
  10. 24 Oct, 2014 2 commits
  11. 23 Oct, 2014 4 commits
  12. 22 Oct, 2014 3 commits
  13. 21 Oct, 2014 1 commit