1. 26 Feb, 2015 2 commits
    • Sangmin Seo's avatar
      Fix async progress problem in NBC I/O. · 6523ad97
      Sangmin Seo authored
      When the async progress thread blocked the progress engine and yielded
      control, if a thread started waiting inside a wait routine, e.g.,
      ADIOI_GEN_iwc_wait_fn, of NBC I/O implementation, a deadlock happened.
      The thread waiting continuously called MPI_Test to make progress, but
      the progress engine did not make progress because it was blocked due to
      the async progress thread.  The async progress thread tried to acquire
      the lock, but the waiting thread did not release the lock because it
      did not finish the wait routine.  Thus, it was a deadlock. This patch
      fixes this deadlock problem by forcing the waiting thread to yield if
      the progress engine has been blocked by another thread.
      Fixes #2202
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
    • Wesley Bland's avatar
      Remove unused variable in alltoallw · 03b12134
      Wesley Bland authored and Rob Latham's avatar Rob Latham committed
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
  2. 25 Feb, 2015 2 commits
  3. 23 Feb, 2015 1 commit
  4. 18 Feb, 2015 1 commit
  5. 13 Feb, 2015 1 commit
  6. 05 Feb, 2015 1 commit
  7. 04 Feb, 2015 1 commit
  8. 03 Feb, 2015 1 commit
    • Rajeev Thakur's avatar
      Changes the sticky lb/ub fields in resized types to 0, since the lb/ub · ebbaad2b
      Rajeev Thakur authored and Rob Latham's avatar Rob Latham committed
      set by type_create_resized are not sticky.
      Changes darray and subarray types to use type_create_resized instead
      of type_struct with explicit lb/ub, because explicit MPI_LB/MPI_UB
      have been removed from MPI in MPI-3 and they also cause other problems
      because they were defined to be sticky in MPI-1.
      Fixes type_create_struct, which was incorrectly setting lb and ub to
      true_lb and true_ub in the non-sticky case.
      Closes #2218
      Closes #2220
      Closes #2224
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
  9. 31 Jan, 2015 1 commit
  10. 30 Jan, 2015 4 commits
    • Wesley Bland's avatar
      Correctly return errcode for NBC · 15262441
      Wesley Bland authored
      The error code set in the status was being ignored for NBC and one-sided
      requests (which wasn't right anyway so it didn't matter). This grabs the
      error code from the status now.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Add MPIC_Issend · 6556f351
      Wesley Bland authored
      Part of converting the NBC code to use the MPIC_* functions requires
      an MPIC_Issend function to exist. This adds it.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Refactor MPIC functions to use the MPID objects · 54362c00
      Wesley Bland authored
      The MPIC helper functions have been using MPI_Comm and MPI_Request
      objects instead of their MPID_* counterparts. This leads to a bunch of
      unnecessary conversions back and forth between the two types of objects
      and makes the work incompatible with other parts of the codebase
      (non-blocking collectives for instance).
      This patch converts all of the MPIC_* functions to use MPID_Comm and
      MPID_Request and changes all of the collective calls to use them now
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Refactor MPIC_Wait to return an errflag · c04f0851
      Wesley Bland authored
      The collective helper functions generally have an errflag that is used
      when a failure is detected to allow the collective to continue while
      also communicating that a failure occurred. That flag is now included
      as a parameter for MPIC_Wait.
      The rest of this commit is the refactoring necessary in the rest of the
      helper functions to support the change.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
  11. 27 Jan, 2015 1 commit
  12. 23 Jan, 2015 2 commits
  13. 22 Jan, 2015 2 commits
    • Huiwei Lu's avatar
      FT: Fixes ref counts in shrink and agree · 93e816cc
      Huiwei Lu authored
      When process fails, fault tolerance scheme takes a different path to
      deal with MPI object reference counts than the existing one. Some
      reference counts were not properly set in FT path so when configured
      with --enable-g=all, some ft tests will show leaked context id, dirty
      COMM, GROUP and REQUEST objects and so on when exit.
      This patch fixes ft/shrink and ft/agree with "--enable-g=all". Stack
      allocated objects of requests, communicators and groups will be freed by
      Signed-off-by: default avatarWesley Bland <wbland@anl.gov>
    • Wesley Bland's avatar
      Fix for MPIX_COMM_AGREE to not return incorrect errors · a3dd5f40
      Wesley Bland authored
      MPIX_Comm_agree should not return errors if the failed processes have
      all been acknowledged. Previously, it was returning errors
      unnecessarily, but this makes sure that the errcode is MPI_SUCCESS when
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
  14. 15 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Refactor MPI_Testall to have an MPIR version · 4be54219
      Wesley Bland authored and Kenneth Raffenetti's avatar Kenneth Raffenetti committed
      For some reason, there was no MPIR_Testall_impl as there is with many of
      the other MPI_* functions. This causes a linking problem when weak
      symbols are disabled and another MPI function needs to call MPI_*.
      This patch moves most of the MPI_Testall code into MPIR_Testall_impl and
      has MPI_Waitall call that function instead of MPI_Testall.
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
  15. 14 Jan, 2015 2 commits
    • Rob Latham's avatar
      use PATH_MAX instead of magic number · a30a4721
      Rob Latham authored
      User on OpenMPI list wanted to create a 259 character file.  shared file
      pointer name construction used the magic '256' value to construct a full
      path to the hidden shared file pointer file.  PATH_MAX already exists
      for this purpose, so use it.
      While there, found a few spots checking/setting PATH_MAX, so do that in
      one place
      Closes #2212
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
    • Rob Latham's avatar
      make ADIOI_Shfp_fname report errors · ed39c901
      Rob Latham authored
      Right now there's only one error condition: file name too long.  This
      change checks return codes of ADIOI_Strncpy and informs caller.
      Otherwise, really long names result in buffer overruns.
      See #2212
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
  16. 13 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Remove ADI breakage introduced earlier · 6f646ca0
      Wesley Bland authored
      There was an accidental ADI breakage earlier when MPI level codes would
      query into the dev part of the MPID request object. This commit removes
      that breakage by adding a new macro into the mpiimpl.h file to portably
      check whether a request is anysource. For now, in pamid, this macro
      always evaluates to 0. This can easily be fixed by overwriting it in the
      pamid code, but since pamid doesn't support FT, it won't have any
      functional change either.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
  17. 12 Jan, 2015 7 commits
    • Wesley Bland's avatar
      Change MPIDI_CH3I_Comm_AS_enabled to be MPID level · 8cbbcae4
      Wesley Bland authored
      This macro was used inside CH3 to determine if the communicator could be
      used for anysource communication. With the rewrite of the anysource
      fault tolerance logic, it is now necessary to use it at the MPI level.
      Because it is a macro and not a function, the macro is defined in
      mpiimple.h as (1) and then overwritten in the ch3 device. Future devices
      can also overwrite it if desired.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Handle anysource in blocking recv functions · 9f2db553
      Wesley Bland authored
      If a blocking recv function (MPI_Recv and MPI_Sendrecv) includes an
      MPI_ANY_SOURCE and there is a failure, handle it by cleaning up the
      request and returning MPIX_ERR_PROC_FAILED.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Allow MPIR_Request_complete to take a NULL request · f6cdb3c8
      Wesley Bland authored
      If the first argument is NULL, don't try to set it to MPI_REQUEST_NULL.
      For blocking functions that want to complete the MPID_Request object,
      this allows them to reuse the code.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Handle anysource in the wait* functions · 0418d495
      Wesley Bland authored
      If a wait operation involves an anysource, we need to first check to
      make sure that they haven't been disabled. If they have been, convert
      the wait* function to a test* function to prevent deadlocking inside the
      progress engine.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Break out of progress for anysource failures · 50d85e51
      Wesley Bland authored
      If a failure is detected, even if no request is actually complete, the
      completion counter will be incremented now as a way to give control back
      to the MPI layer to let it decide whether or not to continue.
      This gives the request completion functions a chance to see if they're
      waiting on an MPI_ANY_SOURCE request and if so, to return an error
      indicating that the completion function has a
      MPIX_ERR_PROC_FAILED_PENDING failure that the user needs to acknowledge.
      All of these functions should go into the progress engine at least once
      as a way to ensure that even if they will be returning an error, they'll
      at least give MPI a way to make progress and potentially still complete
      the request objects even if the user never acknowledges the failure.
      A follow on commit will add the functionality to keep the progress
      engine from getting stuck if a failure is discovered before entering the
      completion function.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Strip out pending ANY_SOURCE request handling · 7a785c84
      Wesley Bland authored
      The existing way that we handle non-blocking requests involving wildcard
      receive operations is incorrect. We're cancelling request operations and
      trying to recreate them later. In the meantime, it's messing with
      matching and makes it possible (likely?) that some messages that arrive
      will never be matched. A new way of handling this is coming next.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
    • Wesley Bland's avatar
      Don't free a request if it still pending · a96ac72e
      Wesley Bland authored
      If we had a failure that caused a request to be pending, we were freeing
      the request before calling the error handler. That caused segfaults. Now
      we switch the ordering of the two to avoid that.
      This also moves the assignment of the status_ptr to be a little earlier
      to avoid another segfault.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
  18. 08 Jan, 2015 1 commit
  19. 19 Dec, 2014 1 commit
    • Paul Coffman's avatar
      barrier in close whenever shared files supported · ef1cf141
      Paul Coffman authored and Rob Latham's avatar Rob Latham committed
      Currently in the MPI_File_close there is a barrier in place whenever the
      ADIO_SHARED_FP feature is enabled AND the ADIO_UNLINK_AFTER_CLOSE
      feature is disabled right before the code to close the shared file
      pointer and potentially unlink the shared file itself.  PE testing on
      GPFS revealed a situation using the non-collective
      where based on this implementation all tasks needed to wait for all
      other tasks to complete processing before unlinking the shared file
      pointer or the open of the shared file pointer could fail.  This
      situation is illustrated as follows with the simplest example of 2 tasks
      that do this:
      So both tasks call MPI_File_Read_shared at the same time which first
      does the ADIO_Get_shared_fp which does the file open with create mode on
      the shared file pointer.   Only 1 task can actually create the file, so
      there is a race to see who can get it done first.  If task 0 gets it
      created then he is the winner and goes on to use it, read the file and
      then MPI_File_close which then unlinks the shared file pointer first and
      then closes the output file.  Meanwhile, task 1 lost the race to create
      the file and is in error, the error handling in gpfs goes into effect
      and task 1 now just tries to open the file that task 0 created.  The
      problem is this error handling took longer that task 0 took to read and
      close the output file, so at the time when task 0 does the close he is
      the only process with a link since task 1 is still in the create file
      error handlilng code so therefore gpfs goes ahead and deletes the shared
      file pointer.  Then when the error handling code for task 1 does
      complete and he tries to do the open, the file is no longer there, so
      the open fails as does the subsequent read of the shared file pointer.
      Currently GPFS has the ADIO_UNLINK_AFTER_CLOSE  feature enabled, so the
      fix for this is to remove the additional condition of
      ADIO_UNLINK_AFTER_CLOSE  being disabled for the barrier in the close to
      be done.  Presumably this could be an issue for any parallel file system
      so this change is being done in the common code.
      See ticket #2214
      Signed-off-by: default avatarPaul Coffman <pkcoff@us.ibm.com>
      Signed-off-by: Rob Latham's avatarRob Latham <robl@mcs.anl.gov>
  20. 08 Dec, 2014 1 commit
  21. 05 Dec, 2014 1 commit
  22. 03 Dec, 2014 2 commits
    • Wesley Bland's avatar
      Fix typo in error code man page · 8672503d
      Wesley Bland authored
      No reviewer
    • James Dinan's avatar
      Fix error class buf in MPI_Error_add_code · 422b06d2
      James Dinan authored
      During error code creation, the error class was erroneously modified by
      applying ERROR_DYN_MASK when.  The dynamic bit is already set for
      user-defined error classes, so this bug had no effect in all existing
      MPICH tests.  However, when a predefined error class was passed during
      error code creation, it would be incorrectly marked as dynamic,
      resulting in an invalid result when the error class of a returned error
      code was returned via MPI_Error_class.
      Signed-off-by: default avatarWesley Bland <wbland@anl.gov>
  23. 28 Nov, 2014 3 commits