- 12 Nov, 2014 13 commits
-
-
Full redesign, mainly of the functions in ptl_nm.c and the communications involving the "control" portal. Still some problems with flow control. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Min Si authored
Timeout is reported on some overloaded machines with 10 minutes time limitation. Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu>
-
Huiwei Lu authored
Free the group and communicator created in the test so it does not complain when memory debug is on. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Huiwei Lu authored
Fixes #1945 Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Huiwei Lu authored
Similar to d086ac27, check the state of a VC to see if it is valid before creating a group, request or communicator in MPID_Recv. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Huiwei Lu authored
MPID_Send should first check the state of a VC to see if it is valid before creating a group, request or communicator. In the case of fault tolerance, if VC has already been revoked or marked as terminated (e.g., in test/mpi/ft/senddead). The send operation evolved should exit without creating any memory objects of request, group or communicator. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Wesley Bland authored
The collective FT tests now pass with debug output turned off. See #1945 Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
The MPI collectives get and set the errflag used by the collective helper functions (MPIC_*). The possible values of the errflag changed, so the collective functions need to appropriately set this value using either MPIR_ERR_NONE (MPI_SUCCESS), MPIR_ERR_PROC_FAILED (MPIX_ERR_PROC_FAILED), or MPIR_ERR_OTHER (MPI_ERR_OTHER). This should allow collectives to correctly report process failures when they occur now, fixing the FT tests that use collectives (see #1945). Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
The errflag value being used in the MPIC helper functions only propagated whether or not an error occurred. It did not contain any information about what kind of error occurred, which made returning the correct error code after a process failure impossible. This patch converts the binary value to an enum with three options: MPIR_ERR_NONE MPIR_ERR_PROC_FAILED MPIR_ERR_OTHER The original use of TRUE and false maps to MPIR_ERR_NONE and MPIR_ERR_OTHER. MPIR_ERR_PROC_FAILED indicates that the error occurred because of a process failure. It uses the new bit set aside from the tag space to track such information between processes. This change required modifying lots of function signatures and type declarations to use the new enum type, but these are actually not very intrusive changes and shouldn't be a problem going forward. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
We need to take another bit from the tag space to specify the difference between a generic failure and a process failure. This patch modifies the macros to handle this situation. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Antonio Pena Monferrer authored
These are meant to hit the >1GB message size and hence test the large message case in Portals4. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Antonio Pena Monferrer authored
Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Kenneth Raffenetti authored
All MPI_Sends in the Portals4 netmod will cause some or all of the data to be sent eagerly to the receiver. Canceling a send means having to find the data in the unexpected message queue and removing it in order to preserve matching. Because the message queues exist at the netmod level, it needs its own cancel protocol. The protocol is modeled on a similar case in CH3, but with its own method for searching the unexpected queue. Custom netmod packet handlers are used to receive and process the control messages. Known Issue: Because we are using different PTs for the send and cancel message, it is possible the cancel request could arrive before the message being canceled. Signed-off-by:
Antonio Pena Monferrer <apenya@mcs.anl.gov>
-
- 11 Nov, 2014 11 commits
-
-
Min Si authored
We should never change the ADI which is exposed to MPI layer for CH3 internal implementation. However, commit 3e005f03 changed the ADI of put/get/accumulate/get_accumulate for reusing the routine of normal RMA operations in request-based operations. This patch defines new CH3 internal functions of put/get/accumulate/get_accumulate to be reused by both normal and request-based operations and reverts the ADI change in commit 3e005f03 . Signed-off-by:
Xin Zhao <xinzhao3@illinois.edu> Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
We already moved all functions from src/mpid/ch3/src/ch3u_rma_acc_ops.c to src/mpid/ch3/src/ch3u_rma_ops.c and deleted the previous one from Makefile.mk, here we just delete this file. Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
We already use window states to specify the current state in RMA epoch, therfore the epoch states are no longer used. Here we delete those states. Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
For lock type, we only need one internal value to specify cases when currently there is no passive lock issued from origin side or there is no passive lock imposed on target side. If there are passive locks, we directly use MPI_LOCK_SHARED and MPI_LOCK_EXCLUSIVE to indicate the lock type. This patch deletes redundant enum for lock types and just defines MPID_LOCK_NONE. Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
It is helpful for us to find variables that are not initialized or wrongly initialized. Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
MPIDI_RMA_NONE is the initial value of window state and should not be used with sync flag. The initial value of sync flag should be set to MPIDI_RMA_SYNC_NONE. Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
Instead of overriding malloc functions, set some hook functions only when using netmod-IB.
-
Norio Yamaguchi authored
Corresponding to the implementations of RMA in the upper layer.
-
Norio Yamaguchi authored
-
- 10 Nov, 2014 3 commits
-
-
Xin Zhao authored
We mistakenly deleted CVAR category 'CH3' when deleting unnecessary CVARs in RMA, here we add it back. Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Junchao Zhang authored
Without it, the code is broken in Intel's MPI build Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
Wesley Bland authored
No reviewer
-
- 08 Nov, 2014 2 commits
-
-
Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
If --enable_strictmpi is passed to configure, we need to skip non-MPI-standard tests. Here is how you can do that. Suppose you have an MPIX test foobar, you need 1) In Makefile.am, to skip building foobar, add if BUILD_MPIX_TESTS noinst_PROGRAMS += foobar endif Note: There is no tab indentions before noinst_PROGRAMS 2) In testlist.in (please convert testlist to testlist.in if necessary), to skip running foobar, add @mpix@ foobar 2 Signed-off-by:
Pavan Balaji <balaji@anl.gov>
-
- 07 Nov, 2014 6 commits
-
-
The expressions are wrong, e.g., [test "X$f77dir" = "f77"] should be [test "$f77dir" = "f77"]. Also, these vars are not used. So we just remove them. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Xin Zhao authored
Signed-off-by:
Min Si <msi@il.is.s.u-tokyo.ac.jp>
-
Xin Zhao authored
num_active_issued_win and num_passive_win are counters of windows in active ISSUED mode and in passive mode. It is modified in CH3 and is used in progress engine of nemesis / sock to skip windows that do not need to make progress on. Here we define them in mpidi_ch3_pre.h in nemesis / sock so that they can be exposed to upper layers. Signed-off-by:
Min Si <msi@il.is.s.u-tokyo.ac.jp>
-
Wesley Bland authored
Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
Some of the code to do the matching for requests in the posted queue was missing. This caused local collectives to hang if the communicator had been revoked. See #1945 Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
Set the counter for processes to be revoked before calling sending out the revoke notifications. Clean up some unused code. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
- 06 Nov, 2014 5 commits
-
-
Huiwei Lu authored
When the parameter of 'env' is parsed the first time, it adds an extra space in the front. When the script kicks off each test, this extra space is not a correct form the script want to interpret and it complains in the output: "not in a=b form". Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Huiwei Lu authored
MPIR_CVAR_ENABLE_FT is added to enable/disable fault tolerance related code. For performance consideration, FT is disabled by default. Changes FT related LMT RTS code to use this CVAR. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Huiwei Lu authored
For fault tolerance use, a RTS queue is added in [81b3911a ] to track shm LMT RTS messages. However, the queue is global and static, which may not be scalable. This patch moves the RTS queue to struct MPIDI_CH3I_VC, to be VC specific as the lmt_queue is. Also it improves the queue to use GENERIC_Q and the 'dev.next' field so it does not need to malloc additional space. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Kenneth Raffenetti authored
A recent testsuite update unveiled an issue when unpacking a large noncontiguous message. We need to ignore any previous segment manipulation when unpacking the beginning of the message. Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
Xin Zhao authored
Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-