• Wesley Bland's avatar
    Add MPIX_Comm_failure_ack/get_acked · 8652e0ad
    Wesley Bland authored
    
    
    This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and
    MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to
    get the group of failed processes.
    
    Most of the implementation for this is pushed into the MPID layer since some
    systems won't support this (PAMI). The existing function
    MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of
    acknowledged failed processes. There is an inefficiency here in that the list
    of failed processes is retrieved from PMI and parsed every time the user calls
    both failure_ack and get_acked, but this means we don't have to try to cache
    the list that comes back from PMI (which could potentially be expensive, but
    would have some cost even in the failure-free case).
    
    This commit adds a failed to the MPID_Comm structure. There is now a field
    called last_ack_rank. This is a single integer that stores the last
    acknowledged failure for this communicator which is used to determine when to
    stop parsing when getting back the list of acknowledged failed processes.
    
    Lastly, this commit includes a test to make sure that all of the above works
    (test/mpi/ft/failure_ack). This tests that a failure is appropriately included
    in the failed group and excluded if the failure was not previously
    acknowledged.
    Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
    8652e0ad
mpidpre.h 18 KB