1. 08 Feb, 2015 4 commits
    • Xin Zhao's avatar
      Bug-fix: guarantee atomicity for FOP and GACC. · bad898f9
      Xin Zhao authored
      
      
      FOP, CAS and GACC are atomic "read-modify-write" operations,
      which means when the target window is defined on a SHM region,
      we need inter-process lock to guarantee the atomicity of the
      entire "read+OP". The current implementation is correct for
      SHM-based RMA operations, but not correct for AM-based RMA
      operations: for SHM-based operations, it protects the entire
      "read+OP", but for AM-based operations, it only protects the
      "OP" part.
      
      This patch fixes this issue by protecting the memory copy to
      temporary buffer and computation together for AM-based operations.
      
      Fix ticket 2226
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      bad898f9
    • Xin Zhao's avatar
    • Xin Zhao's avatar
      Bug-fix: making processes with SHM and without SHM win work corrrectly. · 8c5cb1e6
      Xin Zhao authored
      
      
      In commit 7d71278, if node_comm is NULL (only self process is on that
      node), we call allocate_no_shm() in CH3 to allocate window. If
      node_comm is not NULL (more than one process is on the same node), we
      call allocate_shm() in Nemesis to allocate SHM window. However,
      the exchanged information amount (in MPI_Allgather) is different
      in allocate_no_shm() and allocate_shm(), which leads to wrong execution
      when both SHM window and non-SHM window exist. This patch fixes this issue.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      8c5cb1e6
    • Xin Zhao's avatar
      Delete unnecessary code in SHM allocate / free. · 346050ea
      Xin Zhao authored
      We allocate / free SHM regions only when node_comm exists,
      which means there are more than one processes on the same
      node. When node_comm is NULL (only self process is on that
      node), we call default allocate / free functions in CH3.
      (Please refer to commit f02eed5b
      
      )
      
      Here we delete unnecessary code dealing with node_comm being
      NULL in SHM allocate / free functions.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      346050ea
  2. 05 Feb, 2015 1 commit
  3. 04 Feb, 2015 3 commits
  4. 03 Feb, 2015 2 commits
  5. 02 Feb, 2015 1 commit
    • Wesley Bland's avatar
      Adds mpiimpl.h include to sock.c · 4d676d9b
      Wesley Bland authored
      This include was present but commented out. Normally, it wasn't needed,
      but to pick up the definition of MPIX_ERR_PROC_FAILED correctly, it
      needs to be there.
      
      No reviewer
      4d676d9b
  6. 31 Jan, 2015 1 commit
  7. 30 Jan, 2015 8 commits
  8. 27 Jan, 2015 2 commits
  9. 23 Jan, 2015 3 commits
  10. 22 Jan, 2015 2 commits
    • Huiwei Lu's avatar
      FT: Fixes ref counts in shrink and agree · 93e816cc
      Huiwei Lu authored
      
      
      When process fails, fault tolerance scheme takes a different path to
      deal with MPI object reference counts than the existing one. Some
      reference counts were not properly set in FT path so when configured
      with --enable-g=all, some ft tests will show leaked context id, dirty
      COMM, GROUP and REQUEST objects and so on when exit.
      
      This patch fixes ft/shrink and ft/agree with "--enable-g=all". Stack
      allocated objects of requests, communicators and groups will be freed by
      FT.
      Signed-off-by: default avatarWesley Bland <wbland@anl.gov>
      93e816cc
    • Wesley Bland's avatar
      Fix for MPIX_COMM_AGREE to not return incorrect errors · a3dd5f40
      Wesley Bland authored
      
      
      MPIX_Comm_agree should not return errors if the failed processes have
      all been acknowledged. Previously, it was returning errors
      unnecessarily, but this makes sure that the errcode is MPI_SUCCESS when
      appropriate.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      a3dd5f40
  11. 16 Jan, 2015 1 commit
  12. 15 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Refactor MPI_Testall to have an MPIR version · 4be54219
      Wesley Bland authored and Kenneth Raffenetti's avatar Kenneth Raffenetti committed
      
      
      For some reason, there was no MPIR_Testall_impl as there is with many of
      the other MPI_* functions. This causes a linking problem when weak
      symbols are disabled and another MPI function needs to call MPI_*.
      
      This patch moves most of the MPI_Testall code into MPIR_Testall_impl and
      has MPI_Waitall call that function instead of MPI_Testall.
      Signed-off-by: Kenneth Raffenetti's avatarKen Raffenetti <raffenet@mcs.anl.gov>
      4be54219
  13. 14 Jan, 2015 3 commits
  14. 13 Jan, 2015 2 commits
    • Wesley Bland's avatar
      Remove ADI breakage introduced earlier · 6f646ca0
      Wesley Bland authored
      
      
      There was an accidental ADI breakage earlier when MPI level codes would
      query into the dev part of the MPID request object. This commit removes
      that breakage by adding a new macro into the mpiimpl.h file to portably
      check whether a request is anysource. For now, in pamid, this macro
      always evaluates to 0. This can easily be fixed by overwriting it in the
      pamid code, but since pamid doesn't support FT, it won't have any
      functional change either.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      6f646ca0
    • Wesley Bland's avatar
      Move MPID_Comm_AS_enabled to MPID layer function · d1ab9e68
      Wesley Bland authored
      
      
      It was pointed out that by putting this in a macro and failing silently
      when unimplemented, this make things challenging for derivatives which
      will implement this function in the future. By moving this to an MPID
      level function, it becomes more obvious that the function should be
      implemented later.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      d1ab9e68
  15. 12 Jan, 2015 6 commits
    • Wesley Bland's avatar
      Change MPIDI_CH3I_Comm_AS_enabled to be MPID level · 8cbbcae4
      Wesley Bland authored
      
      
      This macro was used inside CH3 to determine if the communicator could be
      used for anysource communication. With the rewrite of the anysource
      fault tolerance logic, it is now necessary to use it at the MPI level.
      Because it is a macro and not a function, the macro is defined in
      mpiimple.h as (1) and then overwritten in the ch3 device. Future devices
      can also overwrite it if desired.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      8cbbcae4
    • Wesley Bland's avatar
      Handle anysource in blocking recv functions · 9f2db553
      Wesley Bland authored
      
      
      If a blocking recv function (MPI_Recv and MPI_Sendrecv) includes an
      MPI_ANY_SOURCE and there is a failure, handle it by cleaning up the
      request and returning MPIX_ERR_PROC_FAILED.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      9f2db553
    • Wesley Bland's avatar
      Make anysource test more accurate · 648cd48c
      Wesley Bland authored
      
      
      Test for the specific error code so it doesn't accidentally catch
      MPI_ERR_OTHER.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      648cd48c
    • Wesley Bland's avatar
      Allow MPIR_Request_complete to take a NULL request · f6cdb3c8
      Wesley Bland authored
      
      
      If the first argument is NULL, don't try to set it to MPI_REQUEST_NULL.
      For blocking functions that want to complete the MPID_Request object,
      this allows them to reuse the code.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      f6cdb3c8
    • Wesley Bland's avatar
      Handle anysource in the wait* functions · 0418d495
      Wesley Bland authored
      
      
      If a wait operation involves an anysource, we need to first check to
      make sure that they haven't been disabled. If they have been, convert
      the wait* function to a test* function to prevent deadlocking inside the
      progress engine.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      0418d495
    • Wesley Bland's avatar
      Break out of progress for anysource failures · 50d85e51
      Wesley Bland authored
      
      
      If a failure is detected, even if no request is actually complete, the
      completion counter will be incremented now as a way to give control back
      to the MPI layer to let it decide whether or not to continue.
      
      This gives the request completion functions a chance to see if they're
      waiting on an MPI_ANY_SOURCE request and if so, to return an error
      indicating that the completion function has a
      MPIX_ERR_PROC_FAILED_PENDING failure that the user needs to acknowledge.
      
      All of these functions should go into the progress engine at least once
      as a way to ensure that even if they will be returning an error, they'll
      at least give MPI a way to make progress and potentially still complete
      the request objects even if the user never acknowledges the failure.
      
      A follow on commit will add the functionality to keep the progress
      engine from getting stuck if a failure is discovered before entering the
      completion function.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      50d85e51