1. 26 Nov, 2014 1 commit
  2. 13 Nov, 2014 2 commits
    • Wesley Bland's avatar
      Add test for revoke+shrink · a81ec2ac
      Wesley Bland authored
      This tests the behavior after a failure when using revoke+shrink. Right
      now this test still fails so it is marked as xfail.
      
      See #2198
      
      No reviewer
      a81ec2ac
    • Wesley Bland's avatar
      Cleanup FT tests · f0f2c00a
      Wesley Bland authored
      
      
      Some of the FT tests were not correctly setting their error handlers to
      MPI_ERRORS_RETURN. While this doesn't seem to have caused problems, it's
      safer to do so.
      
      This commit also cleans up some unused variables, reorders communicator
      creation, and correctly frees some variables to avoid some debugging
      output.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      f0f2c00a
  3. 12 Nov, 2014 3 commits
  4. 07 Nov, 2014 1 commit
  5. 06 Nov, 2014 3 commits
  6. 31 Oct, 2014 1 commit
  7. 23 Oct, 2014 2 commits
  8. 22 Oct, 2014 1 commit
    • Wesley Bland's avatar
      Fix typo in bcast macro · e49213d6
      Wesley Bland authored
      
      
      The macro that called the bcast function left out an underscore in the
      mpi_errno return value. This caused the test to always return MPI_ERR_OTHER
      instead of the value being returned by the underlying bcast function.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      e49213d6
  9. 20 Oct, 2014 2 commits
  10. 05 Oct, 2014 1 commit
  11. 03 Oct, 2014 1 commit
  12. 02 Sep, 2014 1 commit
    • Wesley Bland's avatar
      Mark shrink test as xfail. · 17356453
      Wesley Bland authored
      All of the FT tests will stay xfail for a while until I can figure out what's
      causing all of the nasty debug output.
      
      No reviewer
      17356453
  13. 25 Aug, 2014 1 commit
    • Wesley Bland's avatar
      Correctly report the error class in receive queue · 14fd9c43
      Wesley Bland authored
      
      
      The receive queue had some hacky ways of reporting errors related to process
      failure that didn't really match up with the way the codes should be returned
      correctly. This patch sets the correct error class in the correct place and
      doesn't require extra logic in dequeue_and_set_error to set the class itself.
      
      This seems to get a couple of the tests to pass in non-debug mode.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      14fd9c43
  14. 13 Aug, 2014 1 commit
  15. 06 Aug, 2014 1 commit
  16. 31 Jul, 2014 7 commits
    • Wesley Bland's avatar
      Mark new tests as xfail · 29d4c54f
      Wesley Bland authored
      
      
      The new tests don't pass yet due to some corner cases. However, we need to go
      ahead and push this into master, so they'll be xfail for now. This will get
      picked up as part of #1945.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      29d4c54f
    • Wesley Bland's avatar
      Add MPIX_Comm_agree · 1f0ee136
      Wesley Bland authored
      
      
      Adds function implementing an agreement algorithm for the user. This function
      lets the user manually perform an agreement as well as detect unacknowledged
      failures.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      1f0ee136
    • Wesley Bland's avatar
      Add MPIX_Comm_shrink functionality · 5be10ce9
      Wesley Bland authored
      
      
      This adds a new function MPIX_COMM_SHRINK. This is a communicator creation
      function that creates a new communicator based on a previous communicator, but
      excluding any failed processes.
      
      As part of the operation, the shrink call needs to perform an agreement to
      determine the group of failed processes. This is done using the algorithm
      published by Hursey et al. in his EuroMPI '12 paper.
      
      The list of failed processes is collected using a bit array. This happens via
      a few new functions in the CH3 layer to create and send a bitarry to the
      master process and receive an updated bitarray. Obviously, this is not a very
      scalable implementation yet, but something better can easily be plugged in
      here to replace the naïve implementation. This is also a use case for an
      MPI_Recv_reduce for future reference.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5be10ce9
    • Wesley Bland's avatar
      Add MPI_Comm_revoke · 57f6ee88
      Wesley Bland authored
      
      
      MPI_Comm_revoke is a special function because it does not have a matching call
      on the "receiving side". This is because it has to act as an out-of-band,
      resilient broadcast algorithm. Because of this, in this commit, in addition to
      the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.),
      we add a new CH3 packet type that will handle revoking a communicator without
      involving a matching call from the MPI layer (similar to how RMA is currently
      implemented).
      
      The thing that must be handled most carefully when revoking a communicator is
      to ensure that a previously used context ID will eventually be returned to the
      pool of available context IDs and that after this occurs, no old messages will
      match the new usage of the context ID (for instance, if some messages are very
      slow and show up late). To accomplish this, revoke is implemented as an
      all-to-all algorithm. When one process calls revoke, it will send a message to
      all other processes in the communicator, which will trigger that process to
      send a message to all other processes, and so on. Once a process has already
      revoked its communicator locally, it won't send out another wave of messages.
      As each process receives the revoke messages from the other processes, it will
      track how many messages have been received. Once it has either received a
      revoke message or a message about a process failure for each other process, it
      will release its refcount on the communicator object. After the application
      has freed all of its references to the communicator (and all requests, files,
      etc. associated with it), the context ID will be returned to the available
      pool.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      57f6ee88
    • Wesley Bland's avatar
      Add test for anysource handling · 628d2daf
      Wesley Bland authored
      
      
      This test ensures that MPI_ANY_SOURCE receives are handles correctly after a
      failure occurs. It tests both that failures are returned when they should be
      (unacknowledged failures) and not returned when they shouldn't (acknowledged
      failures).
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      628d2daf
    • Wesley Bland's avatar
      Add MPIX_Comm_failure_ack/get_acked · 8652e0ad
      Wesley Bland authored
      
      
      This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and
      MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to
      get the group of failed processes.
      
      Most of the implementation for this is pushed into the MPID layer since some
      systems won't support this (PAMI). The existing function
      MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of
      acknowledged failed processes. There is an inefficiency here in that the list
      of failed processes is retrieved from PMI and parsed every time the user calls
      both failure_ack and get_acked, but this means we don't have to try to cache
      the list that comes back from PMI (which could potentially be expensive, but
      would have some cost even in the failure-free case).
      
      This commit adds a failed to the MPID_Comm structure. There is now a field
      called last_ack_rank. This is a single integer that stores the last
      acknowledged failure for this communicator which is used to determine when to
      stop parsing when getting back the list of acknowledged failed processes.
      
      Lastly, this commit includes a test to make sure that all of the above works
      (test/mpi/ft/failure_ack). This tests that a failure is appropriately included
      in the failed group and excluded if the failure was not previously
      acknowledged.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      8652e0ad
    • Wesley Bland's avatar
      Rename error code to MPIX_ERR_PROC_FAILED · 6ce71547
      Wesley Bland authored
      
      
      Previously, MPICH was using MPIX_ERR_FAIL_STOP as the generic error code for
      process failures. The ULFM document specifies the error code to be
      MPIX_ERR_PROC_FAILED.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      6ce71547
  17. 26 Apr, 2014 1 commit
    • Wesley Bland's avatar
      Mark ft/isendalive as xfail · 00d075e2
      Wesley Bland authored
      This test is behaving badly now as well. Given the overhaul of the ft system
      that's coming, we're marking these tests as xfail now instead of putting a lot
      of work into fixing them. They'll get cleaned up when the ULFM changes go in.
      
      No reviewer
      00d075e2
  18. 30 Jan, 2014 1 commit
  19. 10 Jan, 2014 1 commit
  20. 09 Jan, 2014 1 commit
  21. 11 Nov, 2013 1 commit
  22. 02 Oct, 2013 3 commits
  23. 27 Sep, 2013 2 commits
  24. 20 Aug, 2013 1 commit