- 30 Jan, 2015 2 commits
-
-
Wesley Bland authored
The MPIC helper functions have been using MPI_Comm and MPI_Request objects instead of their MPID_* counterparts. This leads to a bunch of unnecessary conversions back and forth between the two types of objects and makes the work incompatible with other parts of the codebase (non-blocking collectives for instance). This patch converts all of the MPIC_* functions to use MPID_Comm and MPID_Request and changes all of the collective calls to use them now too. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
The collective helper functions generally have an errflag that is used when a failure is detected to allow the collective to continue while also communicating that a failure occurred. That flag is now included as a parameter for MPIC_Wait. The rest of this commit is the refactoring necessary in the rest of the helper functions to support the change. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
- 27 Jan, 2015 2 commits
-
-
Jithin Jose authored
Signed-off-by:
Charles J Archer <charles.j.archer@intel.com>
-
Kenneth Raffenetti authored
The tag for send was ignored and recvtag incorrectly used in its place. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
- 23 Jan, 2015 3 commits
-
-
HDF5 folks reported a bug with ROMIO and one of their slightly-strange (but 100% legal) datatypes. git-bisect points to the "promote size of length" change. Seems that MPICH does not like struct datatypes with zero-count elements? Further investigation requred. This change (construct a simpler datatype in more cases) is sufficient to help HDF5 move forward. See #2221 Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
many many places where a 64 bit value is stored in a 32 bit value Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- bump up subtypes from 3 to 6. The limit is arbitrary. I am trying to figure out a type with 4 sub-types. - split up indexed/hindexed lists onto separate lines. MPICH debug output format adds its own newlines, but we have to clean out MPICH's extra debug output anyway: joining a few lines isn't that much more work. - output a name of the digraph that graphviz can actually parse. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 22 Jan, 2015 2 commits
-
-
Huiwei Lu authored
When process fails, fault tolerance scheme takes a different path to deal with MPI object reference counts than the existing one. Some reference counts were not properly set in FT path so when configured with --enable-g=all, some ft tests will show leaked context id, dirty COMM, GROUP and REQUEST objects and so on when exit. This patch fixes ft/shrink and ft/agree with "--enable-g=all". Stack allocated objects of requests, communicators and groups will be freed by FT. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Wesley Bland authored
MPIX_Comm_agree should not return errors if the failed processes have all been acknowledged. Previously, it was returning errors unnecessarily, but this makes sure that the errcode is MPI_SUCCESS when appropriate. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
- 16 Jan, 2015 1 commit
-
-
Su Huang authored
The segfault was caused by the library trying to free an already freed mpid_statp structure. The structure is freed right after the status information is printed. To fix the problem, the mpid_statp is set to NULL after the free is done. (ibm) D202018 Signed-off-by:
Sameh Sharkawi <sssharka@us.ibm.com>
-
- 15 Jan, 2015 1 commit
-
-
For some reason, there was no MPIR_Testall_impl as there is with many of the other MPI_* functions. This causes a linking problem when weak symbols are disabled and another MPI function needs to call MPI_*. This patch moves most of the MPI_Testall code into MPIR_Testall_impl and has MPI_Waitall call that function instead of MPI_Testall. Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 14 Jan, 2015 3 commits
-
-
Rob Latham authored
User on OpenMPI list wanted to create a 259 character file. shared file pointer name construction used the magic '256' value to construct a full path to the hidden shared file pointer file. PATH_MAX already exists for this purpose, so use it. While there, found a few spots checking/setting PATH_MAX, so do that in one place Closes #2212 Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Rob Latham authored
Right now there's only one error condition: file name too long. This change checks return codes of ADIOI_Strncpy and informs caller. Otherwise, really long names result in buffer overruns. See #2212 Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
Charles J Archer authored
Compile time fix required for OFI threading model No semantic changes Signed-off-by:
Yohann Burette <yohann.burette@intel.com>
-
- 13 Jan, 2015 2 commits
-
-
Wesley Bland authored
There was an accidental ADI breakage earlier when MPI level codes would query into the dev part of the MPID request object. This commit removes that breakage by adding a new macro into the mpiimpl.h file to portably check whether a request is anysource. For now, in pamid, this macro always evaluates to 0. This can easily be fixed by overwriting it in the pamid code, but since pamid doesn't support FT, it won't have any functional change either. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
It was pointed out that by putting this in a macro and failing silently when unimplemented, this make things challenging for derivatives which will implement this function in the future. By moving this to an MPID level function, it becomes more obvious that the function should be implemented later. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
- 12 Jan, 2015 9 commits
-
-
Wesley Bland authored
This macro was used inside CH3 to determine if the communicator could be used for anysource communication. With the rewrite of the anysource fault tolerance logic, it is now necessary to use it at the MPI level. Because it is a macro and not a function, the macro is defined in mpiimple.h as (1) and then overwritten in the ch3 device. Future devices can also overwrite it if desired. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
If a blocking recv function (MPI_Recv and MPI_Sendrecv) includes an MPI_ANY_SOURCE and there is a failure, handle it by cleaning up the request and returning MPIX_ERR_PROC_FAILED. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
Test for the specific error code so it doesn't accidentally catch MPI_ERR_OTHER. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
If the first argument is NULL, don't try to set it to MPI_REQUEST_NULL. For blocking functions that want to complete the MPID_Request object, this allows them to reuse the code. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
If a wait operation involves an anysource, we need to first check to make sure that they haven't been disabled. If they have been, convert the wait* function to a test* function to prevent deadlocking inside the progress engine. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
If a failure is detected, even if no request is actually complete, the completion counter will be incremented now as a way to give control back to the MPI layer to let it decide whether or not to continue. This gives the request completion functions a chance to see if they're waiting on an MPI_ANY_SOURCE request and if so, to return an error indicating that the completion function has a MPIX_ERR_PROC_FAILED_PENDING failure that the user needs to acknowledge. All of these functions should go into the progress engine at least once as a way to ensure that even if they will be returning an error, they'll at least give MPI a way to make progress and potentially still complete the request objects even if the user never acknowledges the failure. A follow on commit will add the functionality to keep the progress engine from getting stuck if a failure is discovered before entering the completion function. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
The existing way that we handle non-blocking requests involving wildcard receive operations is incorrect. We're cancelling request operations and trying to recreate them later. In the meantime, it's messing with matching and makes it possible (likely?) that some messages that arrive will never be matched. A new way of handling this is coming next. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Wesley Bland authored
If we had a failure that caused a request to be pending, we were freeing the request before calling the error handler. That caused segfaults. Now we switch the ordering of the two to avoid that. This also moves the assignment of the status_ptr to be a little earlier to avoid another segfault. Signed-off-by:
Huiwei Lu <huiweilu@mcs.anl.gov>
-
Kenneth Raffenetti authored
CH3 ensures that self communication does not go through the netmod, so there is no need for a process to pause/unpause itself. Signed-off-by:
Antonio J. Pena <apenya@mcs.anl.gov>
-
- 08 Jan, 2015 2 commits
-
-
Su Huang authored
Signed-off-by:
Sameh Sharkawi <sssharka@us.ibm.com>
-
OpenMPI uses 'make dist', but MPICH does not. Some recently added (internal) header files were not listed in ROMIO's noinst declaration Note: RobL combined and edited these OpenMPI patches into this patch: - e0927895db8d - 84c41429e9ac Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
- 07 Jan, 2015 2 commits
-
-
Kenneth Raffenetti authored
Adding FCMODOUTFLAG directly to AM_FCFLAGS could cause conflicts with certain libtool flags (-module) during linking. This change allows us to set FCMODOUTFLAG during module creation, but not have it present during linking. Refs #2024 Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
Kenneth Raffenetti authored
Recent versions of ifort on darwin will drop flags intended for the linker unless they are prefixed with "-Wl,". Jeff Hammond checked with the Intel compiler folks, and they confirmed that "-Wl," has been supported since the initial ifort release on OSX (9.1). Closes #2024 Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
- 06 Jan, 2015 1 commit
-
-
Kenneth Raffenetti authored
Previous re-organization of the library symbols resulted in a situation where Fortran programs could no longer be profiled using tools written in C. Functions in libmpifort directly called the PMPI_* versions in libmpi. Now we always call the MPI_* versions from libmpifort. In the case where we are building a separate profiling library, we use a new preprocessor flag to ensure we call PMPI_* from inside libpmpi. Additional bug fix: - always define mpi_conversion_fn_null_, there is no pmpi version Fixes #2209 Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-
- 05 Jan, 2015 6 commits
-
-
William Gropp authored
Adds a way to pass a timelimit argument to the run command, as long as the timelimit is in seconds. This is enough for some of the MPICH versions of mpiexec and for recent versions of the Cray aprun command. Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
William Gropp authored
Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
William Gropp authored
Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
William Gropp authored
Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
William Gropp authored
Signed-off-by:
Wesley Bland <wbland@anl.gov>
-
Instead of using its own versioning system that wasn't getting updated with any regularity, now the test suite will use the same versioning scheme as mainline MPICH. This is consistent with other parts of MPICH that get distributed separately (MPL, ROMIO, Hydra). Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
-
- 04 Jan, 2015 2 commits
-
-
Squashes a warning when using the embedded versions of OPA and MPL. Signed-off-by:
Sangmin Seo <sseo@anl.gov>
-
We were incorrectly adding the build directories for mpl and opa to external_ldflags in Makefile.am, causing them to be listed in the installed libmpi.la libtool file. If a linker does not handle this potentially non-existant build directory gracefully, it could cause an issue. Since the mpl and opa libraries are now embedded in libmpi by default, we simply eliminate the flags unless we are using pre-built, external libraries. Fixes #2208 Thanks to Markus Geimer for the bug report and suggested solution. Signed-off-by:
Sangmin Seo <sseo@anl.gov>
-
- 19 Dec, 2014 1 commit
-
-
Currently in the MPI_File_close there is a barrier in place whenever the ADIO_SHARED_FP feature is enabled AND the ADIO_UNLINK_AFTER_CLOSE feature is disabled right before the code to close the shared file pointer and potentially unlink the shared file itself. PE testing on GPFS revealed a situation using the non-collective MPI_File_read_shared/MPI_File_write_shared where based on this implementation all tasks needed to wait for all other tasks to complete processing before unlinking the shared file pointer or the open of the shared file pointer could fail. This situation is illustrated as follows with the simplest example of 2 tasks that do this: MPI_File_Open MPI_File_set_view MPI_File_Read_shared MPI_File_close So both tasks call MPI_File_Read_shared at the same time which first does the ADIO_Get_shared_fp which does the file open with create mode on the shared file pointer. Only 1 task can actually create the file, so there is a race to see who can get it done first. If task 0 gets it created then he is the winner and goes on to use it, read the file and then MPI_File_close which then unlinks the shared file pointer first and then closes the output file. Meanwhile, task 1 lost the race to create the file and is in error, the error handling in gpfs goes into effect and task 1 now just tries to open the file that task 0 created. The problem is this error handling took longer that task 0 took to read and close the output file, so at the time when task 0 does the close he is the only process with a link since task 1 is still in the create file error handlilng code so therefore gpfs goes ahead and deletes the shared file pointer. Then when the error handling code for task 1 does complete and he tries to do the open, the file is no longer there, so the open fails as does the subsequent read of the shared file pointer. Currently GPFS has the ADIO_UNLINK_AFTER_CLOSE feature enabled, so the fix for this is to remove the additional condition of ADIO_UNLINK_AFTER_CLOSE being disabled for the barrier in the close to be done. Presumably this could be an issue for any parallel file system so this change is being done in the common code. See ticket #2214 Signed-off-by:
Paul Coffman <pkcoff@us.ibm.com> Signed-off-by:
Rob Latham <robl@mcs.anl.gov>
-
- 18 Dec, 2014 1 commit
-
-
Kenneth Raffenetti authored
Signed-off-by:
Junchao Zhang <jczhang@mcs.anl.gov>
-