- 31 Jul, 2014 26 commits
-
-
Wesley Bland authored
MPI_Comm_revoke is a special function because it does not have a matching call on the "receiving side". This is because it has to act as an out-of-band, resilient broadcast algorithm. Because of this, in this commit, in addition to the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.), we add a new CH3 packet type that will handle revoking a communicator without involving a matching call from the MPI layer (similar to how RMA is currently implemented). The thing that must be handled most carefully when revoking a communicator is to ensure that a previously used context ID will eventually be returned to the pool of available context IDs and that after this occurs, no old messages will match the new usage of the context ID (for instance, if some messages are very slow and show up late). To accomplish this, revoke is implemented as an all-to-all algorithm. When one process calls revoke, it will send a message to all other processes in the communicator, which will trigger that process to send a message to all other processes, and so on. Once a process has already revoked its communicator locally, it won't send out another wave of messages. As each process receives the revoke messages from the other processes, it will track how many messages have been received. Once it has either received a revoke message or a message about a process failure for each other process, it will release its refcount on the communicator object. After the application has freed all of its references to the communicator (and all requests, files, etc. associated with it), the context ID will be returned to the available pool. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This test ensures that MPI_ANY_SOURCE receives are handles correctly after a failure occurs. It tests both that failures are returned when they should be (unacknowledged failures) and not returned when they shouldn't (acknowledged failures). Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
The collectively active field wasn't doing anything anymore so it's been removed. This was a remnant from a previous FT proposal. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
There was a case where an MPI_ANY_SOURCE recv call could complete successfully if there was already a message waiting in the unexpected receive queue when the call to a receive function was processed, even if any_source operations had already been disabled on the communicator because of an unacknowledged failure. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This commit adds the new functions MPI(X)_COMM_FAILURE_ACK and MPI(X)_COMM_FAILURE_GET_ACKED. These two functions together allow the user to get the group of failed processes. Most of the implementation for this is pushed into the MPID layer since some systems won't support this (PAMI). The existing function MPIDI_CH3U_Check_for_failed_procs has been modified to give back the group of acknowledged failed processes. There is an inefficiency here in that the list of failed processes is retrieved from PMI and parsed every time the user calls both failure_ack and get_acked, but this means we don't have to try to cache the list that comes back from PMI (which could potentially be expensive, but would have some cost even in the failure-free case). This commit adds a failed to the MPID_Comm structure. There is now a field called last_ack_rank. This is a single integer that stores the last acknowledged failure for this communicator which is used to determine when to stop parsing when getting back the list of acknowledged failed processes. Lastly, this commit includes a test to make sure that all of the above works (test/mpi/ft/failure_ack). This tests that a failure is appropriately included in the failed group and excluded if the failure was not previously acknowledged. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This function will take a last_failed value and generate an MPID_Group. If the value is MPI_PROC_NULL, then it will parse the entire list. This function is exposed by MPID so this can be used by any functions that need the list of failed processes. This change necessitated changing the way the list of failed processes is retreived from PMI. Rather than allocating a char array on demand every time we get the list from PMI, this string is allocated at init time and freed at finalize time now. This means that we can cache the value to be used later for things like just querying the list of processes that we already know have failed, rather than also getting the new list (which is important for the failure_ack/get_acked semantics). Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Previously, PMI provided a list of failed processes as a sorted list via a string. This meant the list could look something like this: 1,3-5,7,10 However, in the new fault tolerance specification, the function MPI_COMM_FAILURE_ACK needs to be able to determine the local order of the failures to more efficiently acknowledge them without creating a list per communciator. This requires that PMI not sort or compress the failure notification. So now, the previous string could look like this: 3,1,4,5,10,7 Obviously, this is less efficient if there are lots of failures. Hopefully, this is something that can be fixed in future versions of PMI. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
The MPI_Waitall and MPI_Testall functions should return MPIX_ERR_PROC_FAILED_PENDING when a process failure prevents the operations from completing. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
This is a new error code required by the ULFM proposal. This code replaces MPI_ERR_PENDING in cases where the failure that would have otherwise cause MPI_ERR_PENDING is related to process failure (MPIX_ERR_PROC_FAILED). In that case, we return MPIX_ERR_PROC_FAILED_PENDING instead. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
Previously, MPICH was using MPIX_ERR_FAIL_STOP as the generic error code for process failures. The ULFM document specifies the error code to be MPIX_ERR_PROC_FAILED. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
When an isendv fails in MPIDI_CH3_EagerSyncNoncontigSend, the request is set to NULL before returning it back to the caller. Unfortunately, the request is not allocated inside this function so if we pass back NULL, we lose the handle on the request. If we're going to return NULL, we need to make sure the request is destroyed before giving it back. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Wesley Bland authored
There were a few places in MPICH where the error class was being checked against MPICH_ERR_LAST_CLASS and being flagged as invalid if it was too large. This is incorrect now that we have a new space for MPIX error codes. Add MPICH_ERR_LAST_MPIX as a way of keeping track of what the actual last valid error class is. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Increase the number of connection command buffer to 64 to avoid dead-lock in command buffer allocation. This is a temporary work-around. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
In stead of reusing a memory pool when all elements of pool are freed, reuse a element when it is freed. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
When the value of outstanding_connection_tx is not 0, it may not be able to send a command for reporting outstanding_tx_empty. Therefore, create a queue and enqueue it, and send it later. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Create a queue to store memory regions which are possibility of deregistration. When ibv_post_send fails with ENOMEM, dequeue some regions from the queu and release release them. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The data format to transmit or receive may be contiguous and have the nonzero lower bound. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
It's not necessary to include the size of ib_netmod_trailer_t in data size. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
The value of clz has to be decremented after calculating power of two. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
IB can transmit the size of 'max_msg_sz' at one time. Fragmentation is required when transmitting a message which exceeds the size of it. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Add a command for reporting outstanding_tx is empty. And confirm that mutual outstanding_tx is empty, before closing a connection. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 30 Jul, 2014 3 commits
-
-
Xin Zhao authored
Change default values of MPIR_CVAR_CH3_RMA_NREQUEST_NEW_THRESHOLD, MPIR_CVAR_CH3_RMA_NREQUEST_VISIT_THRESHOLD and MPIR_CVAR_CH3_RMA_NREQUEST_TEST_THRESHOLD for better performance. This experience is from running graph500 on single node on BLUES and breadboard machine, with 16 or 8 processes and problem size is 2^16 to 2^20. We make the number of new requests since the last attempt to complete pending requests to 0, so that the issuing code will always try to complete pending requests. We also disable the threshold of completed requests in GC and make the threshold of tested requests in GC to be 100, so that we have opportunity to find more pending requests in GC. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
A few MPIX functions pertaining to fault tolerance were added to MPICH a while back. These functions are no longer applicable given the new ULFM implementation so they are being removed here. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Junchao Zhang authored
MPI_File_f2c() is expected to return a value of type MPI_File. However, the old code returns MPI_File*. We do a type casting here. Whether the returned value is correct is not important since romio is disabled and we just want to make the compiler happy. Fixes #2140 Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
- 29 Jul, 2014 4 commits
-
-
Wesley Bland authored
Moving the MPIX functions to the end of the mpi.h.in file will help with ABI compatibility. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
MPIX error codes now have their own section. Move MPIX_ERR_FAIL_STOP to that section with a new value. This does not break ABI compatiblity because this error code was prefixed MPIX and therefore is not available in all implementations. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Wesley Bland authored
The mpi.h header already had one MPIX error code in the middle of the regular MPI error codes. This is bad for ABI compatibility since if that error code ever needs to change, it might cause problems. To avoid this, we'll now have a new value called MPI_ERR_FIRST_MPIX which is bigger than MPICH_ERR_LAST_CLASS. All MPIX error classes will be based on this value. If anything above that value ever changes, it's not a problem because it's MPIX and not part of the ABI agreement. There is a gap between MPICH_ERR_LAST_CLASS and MPICH_ERR_FIRST_MPIX because sock is currently using these values for some of its internal error codes. Someday in the future, we could consider removing this gap if sock goes away. This commit also does some minor reordering of the error codes within the file (not their values) for readability reasons. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Antonio Pena Monferrer authored
The current implementation was not taking into account the datatype offset. These changes mainly add the "dt_true_lb" field from the MPIDI_Datatype_get_info function to the user-specified buffer pointer. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 28 Jul, 2014 1 commit
-
-
Rob Latham authored
large numbers of I/O aggregators would result in a value longer than MPI_MAX_INFO_VAL. A guard was in place but my logic for limiting the length of the array was incorrect. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
- 23 Jul, 2014 1 commit
-
-
Pavan Balaji authored
No reviewer.
-
- 22 Jul, 2014 5 commits
-
-
Pavan Balaji authored
This reverts commit 9443bde4.
-
- Added cancel_recv and cancel_send netmod calls under ENABLE_COMM_OVERRIDES - Extended MPIDI_CH3I_comm structure with netmode_comm field (this field can store netmod context information related communicator as an example: mxm stores mxm_mq_h value) Change-Id: If89860d44840313bce6f7403190faec302c1bafc Signed-off-by:
Igor Ivanov <Igor.Ivanov@itseez.com>
-
MXM is a MellanoX Messaging library which provides best of breed performance and scalability for HPC applications. Change-Id: Ic40e6ec49571f42506ca5707c770025a5509d565 Signed-off-by:
Igor Ivanov <Igor.Ivanov@itseez.com>
-
Kenneth Raffenetti authored
When making progress on a large send, the source will see events of type PTL_EVENT_GET or PTL_EVENT_ACK, indicating completion of operations at the target. This assertion incorrectly failed on PTL_EVENT_GET, probably from a copy/paste error from the above "if" statement. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
Kenneth Raffenetti authored
Source and destination were in the wrong order. In large contiguous messages, the first chunk of target buffer was not updated. Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-