- 31 Jul, 2014 32 commits
-
-
Wesley Bland authored
If a process is dead, collectives still do all of the communictaions to prevent a deadlock. However, if we just skip the part where the data is updated in the allreduce_group function, we can let it be slightly more resilient to failures and possibly even produce a correct answer in the presence of a failure. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Adds a parameter to MPID_Comm_valid_ptr to take a second parameter that will either cause the macro to ignore the revoke flag or not. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Adds function implementing an agreement algorithm for the user. This function lets the user manually perform an agreement as well as detect unacknowledged failures. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This adds a new function MPIX_COMM_SHRINK. This is a communicator creation function that creates a new communicator based on a previous communicator, but excluding any failed processes. As part of the operation, the shrink call needs to perform an agreement to determine the group of failed processes. This is done using the algorithm published by Hursey et al. in his EuroMPI '12 paper. The list of failed processes is collected using a bit array. This happens via a few new functions in the CH3 layer to create and send a bitarry to the master process and receive an updated bitarray. Obviously, this is not a very scalable implementation yet, but something better can easily be plugged in here to replace the naïve implementation. This is also a use case for an MPI_Recv_reduce for future reference. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Piggybacking on the MPID_Comm_valid_ptr check in the HAVE_ERROR_CHECKING block, this checks to see if the communicator has been revoked and returns MPIX_ERR_REVOKED if so. This probably should move out of the HAVE_ERROR_CHECKING section since it requires the user to have this turned on. If the user leaves it off, they'll never be notified. However, if this moves out of the HAVE_ERROR_CHECKING section, it will probably have performance implications. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
MPI_Comm_revoke is a special function because it does not have a matching call on the "receiving side". This is because it has to act as an out-of-band, resilient broadcast algorithm. Because of this, in this commit, in addition to the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.), we add a new CH3 packet type that will handle revoking a communicator without involving a matching call from the MPI layer (similar to how RMA is currently implemented). The thing that must be handled most carefully when revoking a communicator is to ensure that a previously used context ID will eventually be returned to the pool of available context IDs and that after this occurs, no old messages will match the new usage of the context ID (for instance, if some messages are very slow and show up late). To accomplish this, revoke is implemented as an all-to-all algorithm. When one process calls revoke, it will send a message to all other processes in the communicator, which will trigger that process to send a message to all other processes, and so on. Once a process has already revoked its communicator locally, it won't send out another wave of messages. As each process receives the revoke messages from the other processes, it will track how many messages have been received. Once it has either received a revoke message or a message about a process failure for each other process, it will release its refcount on the communicator object. After the application has freed all of its references to the communicator (and all requests, files, etc. associated with it), the context ID will be returned to the available pool. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This test ensures that MPI_ANY_SOURCE receives are handles correctly after a failure occurs. It tests both that failures are returned when they should be (unacknowledged failures) and not returned when they shouldn't (acknowledged failures). Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
The collectively active field wasn't doing anything anymore so it's been removed. This was a remnant from a previous FT proposal. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
There was a case where an MPI_ANY_SOURCE recv call could complete successfully if there was already a message waiting in the unexpected receive queue when the call to a receive function was processed, even if any_source operations had already been disabled on the communicator because of an unacknowledged failure. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to get the group of failed processes. Most of the implementation for this is pushed into the MPID layer since some systems won't support this (PAMI). The existing function MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of acknowledged failed processes. There is an inefficiency here in that the list of failed processes is retrieved from PMI and parsed every time the user calls both failure_ack and get_acked, but this means we don't have to try to cache the list that comes back from PMI (which could potentially be expensive, but would have some cost even in the failure-free case). This commit adds a failed to the MPID_Comm structure. There is now a field called last_ack_rank. This is a single integer that stores the last acknowledged failure for this communicator which is used to determine when to stop parsing when getting back the list of acknowledged failed processes. Lastly, this commit includes a test to make sure that all of the above works (test/mpi/ft/failure_ack). This tests that a failure is appropriately included in the failed group and excluded if the failure was not previously acknowledged. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This function will take a last_failed value and generate an MPID_Group. If the value is MPI_PROC_NULL, then it will parse the entire list. This function is exposed by MPID so this can be used by any functions that need the list of failed processes. This change necessitated changing the way the list of failed processes is retreived from PMI. Rather than allocating a char array on demand every time we get the list from PMI, this string is allocated at init time and freed at finalize time now. This means that we can cache the value to be used later for things like just querying the list of processes that we already know have failed, rather than also getting the new list (which is important for the failure_ack/get_acked semantics). Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Previously, PMI provided a list of failed processes as a sorted list via a string. This meant the list could look something like this: 1,3-5,7,10 However, in the new fault tolerance specification, the function MPI_COMM_FAILURE_ACK needs to be able to determine the local order of the failures to more efficiently acknowledge them without creating a list per communciator. This requires that PMI not sort or compress the failure notification. So now, the previous string could look like this: 3,1,4,5,10,7 Obviously, this is less efficient if there are lots of failures. Hopefully, this is something that can be fixed in future versions of PMI. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
The MPI_Waitall and MPI_Testall functions should return MPIX_ERR_PROC_FAILED_PENDING when a process failure prevents the operations from completing. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This is a new error code required by the ULFM proposal. This code replaces MPI_ERR_PENDING in cases where the failure that would have otherwise cause MPI_ERR_PENDING is related to process failure (MPIX_ERR_PROC_FAILED). In that case, we return MPIX_ERR_PROC_FAILED_PENDING instead. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Previously, MPICH was using MPIX_ERR_FAIL_STOP as the generic error code for process failures. The ULFM document specifies the error code to be MPIX_ERR_PROC_FAILED. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
When an isendv fails in MPIDI_CH3_EagerSyncNoncontigSend, the request is set to NULL before returning it back to the caller. Unfortunately, the request is not allocated inside this function so if we pass back NULL, we lose the handle on the request. If we're going to return NULL, we need to make sure the request is destroyed before giving it back. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
There were a few places in MPICH where the error class was being checked against MPICH_ERR_LAST_CLASS and being flagged as invalid if it was too large. This is incorrect now that we have a new space for MPIX error codes. Add MPICH_ERR_LAST_MPIX as a way of keeping track of what the actual last valid error class is. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Increase the number of connection command buffer to 64 to avoid dead-lock in command buffer allocation. This is a temporary work-around. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
In stead of reusing a memory pool when all elements of pool are freed, reuse a element when it is freed. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
When the value of outstanding_connection_tx is not 0, it may not be able to send a command for reporting outstanding_tx_empty. Therefore, create a queue and enqueue it, and send it later. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Create a queue to store memory regions which are possibility of deregistration. When ibv_post_send fails with ENOMEM, dequeue some regions from the queu and release release them. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The data format to transmit or receive may be contiguous and have the nonzero lower bound. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
It's not necessary to include the size of ib_netmod_trailer_t in data size. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The value of clz has to be decremented after calculating power of two. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
IB can transmit the size of 'max_msg_sz' at one time. Fragmentation is required when transmitting a message which exceeds the size of it. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Add a command for reporting outstanding_tx is empty. And confirm that mutual outstanding_tx is empty, before closing a connection. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 30 Jul, 2014 3 commits
-
-
Xin Zhao authored
Change default values of MPIR_CVAR_CH3_RMA_NREQUEST_NEW_THRESHOLD, MPIR_CVAR_CH3_RMA_NREQUEST_VISIT_THRESHOLD and MPIR_CVAR_CH3_RMA_NREQUEST_TEST_THRESHOLD for better performance. This experience is from running graph500 on single node on BLUES and breadboard machine, with 16 or 8 processes and problem size is 2^16 to 2^20. We make the number of new requests since the last attempt to complete pending requests to 0, so that the issuing code will always try to complete pending requests. We also disable the threshold of completed requests in GC and make the threshold of tested requests in GC to be 100, so that we have opportunity to find more pending requests in GC. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
A few MPIX functions pertaining to fault tolerance were added to MPICH a while back. These functions are no longer applicable given the new ULFM implementation so they are being removed here. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Junchao Zhang authored
MPI_File_f2c() is expected to return a value of type MPI_File. However, the old code returns MPI_File*. We do a type casting here. Whether the returned value is correct is not important since romio is disabled and we just want to make the compiler happy. Fixes #2140 Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
- 29 Jul, 2014 4 commits
-
-
Wesley Bland authored
Moving the MPIX functions to the end of the mpi.h.in file will help with ABI compatibility. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
MPIX error codes now have their own section. Move MPIX_ERR_FAIL_STOP to that section with a new value. This does not break ABI compatiblity because this error code was prefixed MPIX and therefore is not available in all implementations. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
The mpi.h header already had one MPIX error code in the middle of the regular MPI error codes. This is bad for ABI compatibility since if that error code ever needs to change, it might cause problems. To avoid this, we'll now have a new value called MPI_ERR_FIRST_MPIX which is bigger than MPICH_ERR_LAST_CLASS. All MPIX error classes will be based on this value. If anything above that value ever changes, it's not a problem because it's MPIX and not part of the ABI agreement. There is a gap between MPICH_ERR_LAST_CLASS and MPICH_ERR_FIRST_MPIX because sock is currently using these values for some of its internal error codes. Someday in the future, we could consider removing this gap if sock goes away. This commit also does some minor reordering of the error codes within the file (not their values) for readability reasons. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Antonio Pena Monferrer authored
The current implementation was not taking into account the datatype offset. These changes mainly add the "dt_true_lb" field from the MPIDI_Datatype_get_info function to the user-specified buffer pointer. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 28 Jul, 2014 1 commit
-
-
Rob Latham authored
large numbers of I/O aggregators would result in a value longer than MPI_MAX_INFO_VAL. A guard was in place but my logic for limiting the length of the array was incorrect. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-