1. 31 Jul, 2014 32 commits
    • Wesley Bland's avatar
      Add some basic resilience to allred_group · 2bdc3116
      Wesley Bland authored
      
      
      If a process is dead, collectives still do all of the communictaions to
      prevent a deadlock. However, if we just skip the part where the data is
      updated in the allreduce_group function, we can let it be slightly more
      resilient to failures and possibly even produce a correct answer in the
      presence of a failure.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      2bdc3116
    • Wesley Bland's avatar
    • Wesley Bland's avatar
      Change MPID_Comm_valid_ptr to optionally ignore revoke · 05cb62bd
      Wesley Bland authored
      
      
      Adds a parameter to MPID_Comm_valid_ptr to take a second parameter that will
      either cause the macro to ignore the revoke flag or not.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      05cb62bd
    • Wesley Bland's avatar
      Add MPIX_Comm_agree · 1f0ee136
      Wesley Bland authored
      
      
      Adds function implementing an agreement algorithm for the user. This function
      lets the user manually perform an agreement as well as detect unacknowledged
      failures.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      1f0ee136
    • Wesley Bland's avatar
      Add MPIX_Comm_shrink functionality · 5be10ce9
      Wesley Bland authored
      
      
      This adds a new function MPIX_COMM_SHRINK. This is a communicator creation
      function that creates a new communicator based on a previous communicator, but
      excluding any failed processes.
      
      As part of the operation, the shrink call needs to perform an agreement to
      determine the group of failed processes. This is done using the algorithm
      published by Hursey et al. in his EuroMPI '12 paper.
      
      The list of failed processes is collected using a bit array. This happens via
      a few new functions in the CH3 layer to create and send a bitarry to the
      master process and receive an updated bitarray. Obviously, this is not a very
      scalable implementation yet, but something better can easily be plugged in
      here to replace the naïve implementation. This is also a use case for an
      MPI_Recv_reduce for future reference.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5be10ce9
    • Wesley Bland's avatar
      Add check for revoked communicator · ee5173e3
      Wesley Bland authored
      
      
      Piggybacking on the MPID_Comm_valid_ptr check in the HAVE_ERROR_CHECKING
      block, this checks to see if the communicator has been revoked and
      returns MPIX_ERR_REVOKED if so.
      
      This probably should move out of the HAVE_ERROR_CHECKING section since it
      requires the user to have this turned on. If the user leaves it off, they'll
      never be notified. However, if this moves out of the HAVE_ERROR_CHECKING
      section, it will probably have performance implications.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      ee5173e3
    • Wesley Bland's avatar
      Add MPI_Comm_revoke · 57f6ee88
      Wesley Bland authored
      
      
      MPI_Comm_revoke is a special function because it does not have a matching call
      on the "receiving side". This is because it has to act as an out-of-band,
      resilient broadcast algorithm. Because of this, in this commit, in addition to
      the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.),
      we add a new CH3 packet type that will handle revoking a communicator without
      involving a matching call from the MPI layer (similar to how RMA is currently
      implemented).
      
      The thing that must be handled most carefully when revoking a communicator is
      to ensure that a previously used context ID will eventually be returned to the
      pool of available context IDs and that after this occurs, no old messages will
      match the new usage of the context ID (for instance, if some messages are very
      slow and show up late). To accomplish this, revoke is implemented as an
      all-to-all algorithm. When one process calls revoke, it will send a message to
      all other processes in the communicator, which will trigger that process to
      send a message to all other processes, and so on. Once a process has already
      revoked its communicator locally, it won't send out another wave of messages.
      As each process receives the revoke messages from the other processes, it will
      track how many messages have been received. Once it has either received a
      revoke message or a message about a process failure for each other process, it
      will release its refcount on the communicator object. After the application
      has freed all of its references to the communicator (and all requests, files,
      etc. associated with it), the context ID will be returned to the available
      pool.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      57f6ee88
    • Wesley Bland's avatar
      Add test for anysource handling · 628d2daf
      Wesley Bland authored
      
      
      This test ensures that MPI_ANY_SOURCE receives are handles correctly after a
      failure occurs. It tests both that failures are returned when they should be
      (unacknowledged failures) and not returned when they shouldn't (acknowledged
      failures).
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      628d2daf
    • Wesley Bland's avatar
      Remove coll_active field in MPIDI_Comm · 5c71c3a8
      Wesley Bland authored
      
      
      The collectively active field wasn't doing anything anymore so it's been
      removed. This was a remnant from a previous FT proposal.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5c71c3a8
    • Wesley Bland's avatar
      Fix bug where ANY_SOURCE recv could complete when it shouldn't · 39b95805
      Wesley Bland authored
      
      
      There was a case where an MPI_ANY_SOURCE recv call could complete successfully
      if there was already a message waiting in the unexpected receive queue when
      the call to a receive function was processed, even if any_source operations
      had already been disabled on the communicator because of an unacknowledged
      failure.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      39b95805
    • Wesley Bland's avatar
      Add MPIX_Comm_failure_ack/get_acked · 8652e0ad
      Wesley Bland authored
      
      
      This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and
      MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to
      get the group of failed processes.
      
      Most of the implementation for this is pushed into the MPID layer since some
      systems won't support this (PAMI). The existing function
      MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of
      acknowledged failed processes. There is an inefficiency here in that the list
      of failed processes is retrieved from PMI and parsed every time the user calls
      both failure_ack and get_acked, but this means we don't have to try to cache
      the list that comes back from PMI (which could potentially be expensive, but
      would have some cost even in the failure-free case).
      
      This commit adds a failed to the MPID_Comm structure. There is now a field
      called last_ack_rank. This is a single integer that stores the last
      acknowledged failure for this communicator which is used to determine when to
      stop parsing when getting back the list of acknowledged failed processes.
      
      Lastly, this commit includes a test to make sure that all of the above works
      (test/mpi/ft/failure_ack). This tests that a failure is appropriately included
      in the failed group and excluded if the failure was not previously
      acknowledged.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      8652e0ad
    • Wesley Bland's avatar
      Add MPIDI_CH3U_Get_failed_group · 665ced28
      Wesley Bland authored
      
      
      This function will take a last_failed value and generate an MPID_Group. If the
      value is MPI_PROC_NULL, then it will parse the entire list. This function is
      exposed by MPID so this can be used by any functions that need the list of
      failed processes.
      
      This change necessitated changing the way the list of failed processes is
      retreived from PMI. Rather than allocating a char array on demand every time
      we get the list from PMI, this string is allocated at init time and freed at
      finalize time now. This means that we can cache the value to be used later for
      things like just querying the list of processes that we already know have
      failed, rather than also getting the new list (which is important for the
      failure_ack/get_acked semantics).
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      665ced28
    • Wesley Bland's avatar
      Don't compress and order list of failed procs in PMI · 782d036c
      Wesley Bland authored
      
      
      Previously, PMI provided a list of failed processes as a sorted list via a
      string. This meant the list could look something like this:
      
      1,3-5,7,10
      
      However, in the new fault tolerance specification, the function
      MPI_COMM_FAILURE_ACK needs to be able to determine the local order of the
      failures to more efficiently acknowledge them without creating a list per
      communciator. This requires that PMI not sort or compress the failure
      notification. So now, the previous string could look like this:
      
      3,1,4,5,10,7
      
      Obviously, this is less efficient if there are lots of failures. Hopefully,
      this is something that can be fixed in future versions of PMI.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      782d036c
    • Wesley Bland's avatar
      Return MPIX_ERR_PROC_FAILED_PENDING when appropriate · 3325b6f7
      Wesley Bland authored
      
      
      The MPI_Waitall and MPI_Testall functions should return
      MPIX_ERR_PROC_FAILED_PENDING when a process failure prevents the operations
      from completing.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      3325b6f7
    • Wesley Bland's avatar
      Add MPIX_ERR_PROC_FAILED_PENDING · ed98c983
      Wesley Bland authored
      
      
      This is a new error code required by the ULFM proposal. This code replaces
      MPI_ERR_PENDING in cases where the failure that would have otherwise cause
      MPI_ERR_PENDING is related to process failure (MPIX_ERR_PROC_FAILED). In that
      case, we return MPIX_ERR_PROC_FAILED_PENDING instead.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      ed98c983
    • Wesley Bland's avatar
      Rename error code to MPIX_ERR_PROC_FAILED · 6ce71547
      Wesley Bland authored
      
      
      Previously, MPICH was using MPIX_ERR_FAIL_STOP as the generic error code for
      process failures. The ULFM document specifies the error code to be
      MPIX_ERR_PROC_FAILED.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      6ce71547
    • Wesley Bland's avatar
      Destroy request object before setting it to NULL. · c83eddd9
      Wesley Bland authored
      
      
      When an isendv fails in MPIDI_CH3_EagerSyncNoncontigSend, the request is set
      to NULL before returning it back to the caller. Unfortunately, the request is
      not allocated inside this function so if we pass back NULL, we lose the handle
      on the request. If we're going to return NULL, we need to make sure the
      request is destroyed before giving it back.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      c83eddd9
    • Wesley Bland's avatar
      Introduce MPICH_ERR_LAST_MPIX · b68657dc
      Wesley Bland authored
      
      
      There were a few places in MPICH where the error class was being checked
      against MPICH_ERR_LAST_CLASS and being flagged as invalid if it was too large.
      This is incorrect now that we have a new space for MPIX error codes. Add
      MPICH_ERR_LAST_MPIX as a way of keeping track of what the actual last valid
      error class is.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      b68657dc
    • Masamichi Takagi's avatar
      Increase number of connection command buffer · edd6daa5
      Masamichi Takagi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Increase the number of connection command buffer to 64 to avoid
      dead-lock in command buffer allocation. This is a temporary
      work-around.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      edd6daa5
    • Norio Yamaguchi's avatar
      Fix build warnings in netmod-IB · ed414032
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ed414032
    • Norio Yamaguchi's avatar
      Fix netmod-IB to pass RMA-test · 19c00389
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      19c00389
    • Norio Yamaguchi's avatar
      Fix the managment of memory area for small size · 76e70960
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      In stead of reusing a memory pool when all elements of pool are freed,
      reuse a element when it is freed.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      76e70960
    • Norio Yamaguchi's avatar
      Add queue to store outstanding_tx_empty command · 51f6709f
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      When the value of outstanding_connection_tx is not 0, it may not be able
      to send a command for reporting outstanding_tx_empty. Therefore, create
      a queue and enqueue it, and send it later.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      51f6709f
    • Norio Yamaguchi's avatar
      Improve the method of IB-dereg_mr · 56a0b445
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Create a queue to store memory regions which are possibility of
      deregistration. When ibv_post_send fails with ENOMEM, dequeue some
      regions from the queu and release release them.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      56a0b445
    • Norio Yamaguchi's avatar
      Fix wrong type casting · a499ad05
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      a499ad05
    • Norio Yamaguchi's avatar
      Fix the buffer address for send/recv · fded59ae
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The data format to transmit or receive may be contiguous and
      have the nonzero lower bound.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fded59ae
    • Norio Yamaguchi's avatar
      Fix data size in a header · 3690597f
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      It's not necessary to include the size of ib_netmod_trailer_t in data size.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      3690597f
    • Norio Yamaguchi's avatar
      Fix clz after calculating power of two · 586e7122
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The value of clz has to be decremented after calculating power of two.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      586e7122
    • Norio Yamaguchi's avatar
      Fix transmission processing of large-message · df39ada6
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      IB can transmit the size of 'max_msg_sz' at one time. Fragmentation is
      required when transmitting a message which exceeds the size of it.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      df39ada6
    • Norio Yamaguchi's avatar
      Fix some conditional judgments in ib_drain_scq · 2f25f427
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      2f25f427
    • Norio Yamaguchi's avatar
      Delete some unnecessary increments · d5c2a5da
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      d5c2a5da
    • Norio Yamaguchi's avatar
      Add a command to notify of outstanding_tx_empty · 5bfff7d3
      Norio Yamaguchi authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      Add a command for reporting outstanding_tx is empty. And confirm that
      mutual outstanding_tx is empty, before closing a connection.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      5bfff7d3
  2. 30 Jul, 2014 3 commits
    • Xin Zhao's avatar
      Change default values of CVARs in RMA code. · 522c2688
      Xin Zhao authored
      
      
      Change default values of MPIR_CVAR_CH3_RMA_NREQUEST_NEW_THRESHOLD,
      MPIR_CVAR_CH3_RMA_NREQUEST_VISIT_THRESHOLD and
      MPIR_CVAR_CH3_RMA_NREQUEST_TEST_THRESHOLD for better performance.
      
      This experience is from running graph500 on single node on BLUES
      and breadboard machine, with 16 or 8 processes and problem size is
      2^16 to 2^20. We make the number of new requests since the last
      attempt to complete pending requests to 0, so that the issuing code
      will always try to complete pending requests. We also disable the
      threshold of completed requests in GC and make the threshold of
      tested requests in GC to be 100, so that we have opportunity to
      find more pending requests in GC.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      522c2688
    • Wesley Bland's avatar
      Remove old FT functions · b76ebc6f
      Wesley Bland authored
      
      
      A few MPIX functions pertaining to fault tolerance were added to MPICH a while
      back. These functions are no longer applicable given the new ULFM
      implementation so they are being removed here.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      b76ebc6f
    • Junchao Zhang's avatar
      Fixed compile warnings in MPI_File_f2c with --disable-romio · 1d3684ec
      Junchao Zhang authored
      
      
      MPI_File_f2c() is expected to return a value of type MPI_File. However,
      the old code returns MPI_File*. We do a type casting here. Whether the
      returned value is correct is not important since romio is disabled and we
      just want to make the compiler happy.
      
      Fixes #2140
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
      1d3684ec
  3. 29 Jul, 2014 4 commits
    • Wesley Bland's avatar
      Move MPIX functions to the end of mpi.h · 3e5395d4
      Wesley Bland authored
      
      
      Moving the MPIX functions to the end of the mpi.h.in file will help with ABI
      compatibility.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      3e5395d4
    • Wesley Bland's avatar
      Move MPIX_ERR_FAIL_STOP to the MPIX errcode section · 339b9cc6
      Wesley Bland authored
      
      
      MPIX error codes now have their own section. Move MPIX_ERR_FAIL_STOP to that
      section with a new value.
      
      This does not break ABI compatiblity because this error code was prefixed MPIX
      and therefore is not available in all implementations.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      339b9cc6
    • Wesley Bland's avatar
      Reorganize mpi.h for MPIX err codes · 8189d9cf
      Wesley Bland authored
      
      
      The mpi.h header already had one MPIX error code in the middle of the regular
      MPI error codes. This is bad for ABI compatibility since if that error code
      ever needs to change, it might cause problems. To avoid this, we'll now have a
      new value called MPI_ERR_FIRST_MPIX which is bigger than MPICH_ERR_LAST_CLASS.
      All MPIX error classes will be based on this value. If anything above that
      value ever changes, it's not a problem because it's MPIX and not part of the
      ABI agreement.
      
      There is a gap between MPICH_ERR_LAST_CLASS and MPICH_ERR_FIRST_MPIX because
      sock is currently using these values for some of its internal error codes.
      Someday in the future, we could consider removing this gap if sock goes away.
      
      This commit also does some minor reordering of the error codes within the file
      (not their values) for readability reasons.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      8189d9cf
    • Antonio Pena Monferrer's avatar
      Fixed datatype displacement use in portals4 netmod · 494f597b
      Antonio Pena Monferrer authored
      
      
      The current implementation was not taking into account the datatype offset.
      These changes mainly add the "dt_true_lb" field from the MPIDI_Datatype_get_info
      function to the user-specified buffer pointer.
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
      494f597b
  4. 28 Jul, 2014 1 commit