1. 12 Nov, 2014 1 commit
  2. 31 Jul, 2014 3 commits
    • Wesley Bland's avatar
      Add MPIX_Comm_agree · 1f0ee136
      Wesley Bland authored
      
      
      Adds function implementing an agreement algorithm for the user. This function
      lets the user manually perform an agreement as well as detect unacknowledged
      failures.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      1f0ee136
    • Wesley Bland's avatar
      Add MPIX_Comm_shrink functionality · 5be10ce9
      Wesley Bland authored
      
      
      This adds a new function MPIX_COMM_SHRINK. This is a communicator creation
      function that creates a new communicator based on a previous communicator, but
      excluding any failed processes.
      
      As part of the operation, the shrink call needs to perform an agreement to
      determine the group of failed processes. This is done using the algorithm
      published by Hursey et al. in his EuroMPI '12 paper.
      
      The list of failed processes is collected using a bit array. This happens via
      a few new functions in the CH3 layer to create and send a bitarry to the
      master process and receive an updated bitarray. Obviously, this is not a very
      scalable implementation yet, but something better can easily be plugged in
      here to replace the naïve implementation. This is also a use case for an
      MPI_Recv_reduce for future reference.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      5be10ce9
    • Wesley Bland's avatar
      Add MPI_Comm_revoke · 57f6ee88
      Wesley Bland authored
      
      
      MPI_Comm_revoke is a special function because it does not have a matching call
      on the "receiving side". This is because it has to act as an out-of-band,
      resilient broadcast algorithm. Because of this, in this commit, in addition to
      the usual functions to implement MPI communication calls (MPI/MPID/CH3/etc.),
      we add a new CH3 packet type that will handle revoking a communicator without
      involving a matching call from the MPI layer (similar to how RMA is currently
      implemented).
      
      The thing that must be handled most carefully when revoking a communicator is
      to ensure that a previously used context ID will eventually be returned to the
      pool of available context IDs and that after this occurs, no old messages will
      match the new usage of the context ID (for instance, if some messages are very
      slow and show up late). To accomplish this, revoke is implemented as an
      all-to-all algorithm. When one process calls revoke, it will send a message to
      all other processes in the communicator, which will trigger that process to
      send a message to all other processes, and so on. Once a process has already
      revoked its communicator locally, it won't send out another wave of messages.
      As each process receives the revoke messages from the other processes, it will
      track how many messages have been received. Once it has either received a
      revoke message or a message about a process failure for each other process, it
      will release its refcount on the communicator object. After the application
      has freed all of its references to the communicator (and all requests, files,
      etc. associated with it), the context ID will be returned to the available
      pool.
      Signed-off-by: default avatarJunchao Zhang <jczhang@mcs.anl.gov>
      57f6ee88
  3. 10 Oct, 2012 1 commit
  4. 17 Jul, 2012 1 commit
  5. 24 Apr, 2012 2 commits
  6. 28 Oct, 2010 1 commit
  7. 09 Oct, 2008 1 commit
  8. 02 Nov, 2007 1 commit