-
Wesley Bland authored
This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to get the group of failed processes. Most of the implementation for this is pushed into the MPID layer since some systems won't support this (PAMI). The existing function MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of acknowledged failed processes. There is an inefficiency here in that the list of failed processes is retrieved from PMI and parsed every time the user calls both failure_ack and get_acked, but this means we don't have to try to cache the list that comes back from PMI (which could potentially be expensive, but would have some cost even in the failure-free case). This commit adds a failed to the MPID_Comm structure. There is now a field called last_ack_rank. This is a single integer that stores the last acknowledged failure for this communicator which is used to determine when to stop parsing when getting back the list of acknowledged failed processes. Lastly, this commit includes a test to make sure that all of the above works (test/mpi/ft/failure_ack). This tests that a failure is appropriately included in the failed group and excluded if the failure was not previously acknowledged. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
8652e0ad