1. 04 Mar, 2015 10 commits
    • Xin Zhao's avatar
      Change name from data_size to buf_size. · 45cdb282
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      When the lock is not satisfied, we queue up
      the lock request and op data in a lock entry
      queue. In the struct of lock entry, we use 'data_size'
      to remember the size of buffer for storing the
      data. Since the size of buffer is not type_size*count
      but might be type_extent*extent, here we change
      its name from 'data_size' to 'buf_size'.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      45cdb282
    • Xin Zhao's avatar
      Bug-fix: make RMA work correctly with pair basic type. · ce8bc310
      Xin Zhao authored and Pavan Balaji's avatar Pavan Balaji committed
      
      
      The original implementation of RMA does not consider pair basic
      types (e.g. MPI_FLOAT_INT, MPI_DOUBLE_INT). It only
      works correctly with builtin datatypes (e.g. MPI_INT, MPI_FLOAT).
      This patch makes the RMA work correctly with pair basic types.
      
      The bug is that: (1) when performing the ACC computation, the original
      implementation uses 'eltype' in the datatype structure, which is set
      when all basic elements in this datatype have the same builtin
      datatype. When basic elements have different builtin datatypes, like
      pair datatypes, the 'eltype' is set to MPI_DATATYPE_NULL. This makes
      the ACC computation be unable to work with pair types; (2) for all
      basic type of data, the original implementation assumes that
      they are all contiguous and issues them in an unpacked manner
      with length of data size (count*type_size). This is incorrect for
      pair datatypes, because most pair datatypes are non-contiguous
      (type_extent != type_size).
      
      In the previous patch, we already made 'eltype' to store basic
      type instead of builtin type. In this patch, we fixed this
      bug by (1) modify ACC computation to treat 'eltype' as basic
      type; (2) For non-contiguous basic type data, we use the noncontig
      API so that it will be issued in a packed manner.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ce8bc310
    • Xin Zhao's avatar
      7899a602
    • Xin Zhao's avatar
      Use function hook instead of function pointer for win_free. · 42b5fcf1
      Xin Zhao authored
      
      
      The original implementation of win_free is not correct. The
      problem is described as follows:
      
      It uses a function pointer which is initially set to the CH3
      implementation, and can be overridden by the channel layer if
      the channel provides an specific implementation.  In the CH3
      win_free implementation, it first checks if all RMA
      communication is finished and epoch states is reset, then
      performs a global barrier, then frees the window resources
      that are allocated in CH3, and finally returns. In the Nemesis
      win_free implementation, it directly frees the window resources
      that are allocated in Nemesis, and calls the CH3 win_free at last.
      This makes no sense because we free the window resources before
      checking if the RMA communication is completed.
      
      To fix this issue, we add a function hook for channel layer
      to free its own resources, the the function hook is called from
      the CH3 win_free.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      42b5fcf1
    • Xin Zhao's avatar
      Allow the channel layer to implement Win_gather_info function. · 9dbcae0c
      Xin Zhao authored
      
      
      In this patch, we first add a function pointer of Win_gather_info
      in CH3 to allow different channel layers to implement their own
      version of Win_gather_info function. The function pointer is
      initially set to the default implementation in CH3 layer. If the
      channel layer provides an implementation of Win_gather_info, it
      will override the function pointer.
      
      Secondly, we provide an implementation of Win_gather_info in the
      Nemesis layer. In this implementation, we allocate basic_info_table[]
      in the SHM region, so that processes on the same node can share the
      same base_info_table[].
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      9dbcae0c
    • Xin Zhao's avatar
      Add a function hook to initialize window attributes in channel layer. · 7c1a8fb1
      Xin Zhao authored
      
      
      There are some window attributes in the channel layer that
      needs to be initialized during window creation. In this
      patch, we first add a win_hooks table that contains pointers
      to the channel's implementation of the function hooks. Secondly,
      we add a function hook 'win_init' to allow the channel layer to
      initialize its own attributes. The hook is called from the
      CH3 win_init function.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7c1a8fb1
    • Xin Zhao's avatar
      Reduce size of shm_base_addrs[] from comm_size to node_size. · eddd8b91
      Xin Zhao authored
      
      
      Given one process, shm_base_addrs[] is used to store the base
      addresses (in the address space of this process) of SHM window
      on other processes. The original size of it is comm_size. However,
      the maximum number of SHM windows that this process can access
      to is node_size instead of comm_size, which results in a waste
      of memory since most slots in the array is NULL. In this patch
      we reduce the size of shm_base_addrs[] from comm_size to node_size.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      eddd8b91
    • Xin Zhao's avatar
      Store window basic attributes into a struct on window. · 9404e953
      Xin Zhao authored
      
      
      In this patch, we gather window basic attributes of other
      processes (base_addr, size, disp_unit, win_handle) using a
      struct called "basic_info_table". By doing this, we can use
      one contiguous memory region to store them.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      9404e953
    • Xin Zhao's avatar
      Change name of MPIDI_CH3U_Win_create_gather to MPIDI_CH3U_Win_gather_info. · 131e06ef
      Xin Zhao authored
      
      
      Function MPIDI_CH3U_Win_create_gather exchanges the window
      information among processes. It does not create new window.
      Here we change the function name to a more suitable one.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      131e06ef
    • Xin Zhao's avatar
      Add CH3 APIs and macros to allow channel to implement Alloc_mem/Free_mem. · 03d4c77b
      Xin Zhao authored
      
      
      Originally MPIDI_Alloc_mem(size, info) and MPIDI_Free_mem(base_ptr)
      in CH3 layer are implemented by calling MPIU_Malloc(size) and
      MPIU_Free(base_ptr) internally. This makes the underlying hardware
      be unable to develop a specific implementation of Alloc_mem and Free_mem,
      which is necessary when registering memory for RDMA operations.
      
      This patch defines new APIs, MPIDI_CH3I_Alloc_mem(size, info)
      and MPIDI_CH3I_Free_mem(base_ptr), to allow channels to implement
      their own memory allocators. If the channel does not have its own
      implementation, MPICH will fallback to the default implementation
      in CH3 layer which uses MPIU_Malloc and MPIU_Free.
      
      Thanks to Steffen Christgau <christgau@cs.uni-potsdam.de> for
      this contribution.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      03d4c77b
  2. 03 Mar, 2015 1 commit
  3. 13 Feb, 2015 13 commits
    • Xin Zhao's avatar
      Delete comments that no longer make sense. · 21126e9e
      Xin Zhao authored
      
      
      The comments are no longer significant for
      new RMA infrastructure.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      21126e9e
    • Xin Zhao's avatar
      Delete unnecessary code. · e3ccad1f
      Xin Zhao authored
      
      
      Here req->dev.user_count is used when receiving FOP/CAS response
      data on origin in PktHandler_FOPResp and PktHandler_CASResp. Since
      the count always be 1, we did not set rma_op->result_count, and
      we directly set req->dev.user_count to 1 in packet handlers.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      e3ccad1f
    • Xin Zhao's avatar
      Simplify code of issuing RMA packets. · e3fc7e70
      Xin Zhao authored
      
      
      When issuing RMA packets, we do not need to
      store target_win_handle in the request on
      origin side but only need to store source_win_handle.
      Because when the response data is back, we
      only needs to use source_win_handle on origin
      size. This patch simplifies the code in this way.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      e3fc7e70
    • Xin Zhao's avatar
      Remove source_win_handle from GET-like RMA packets. · 80a71e11
      Xin Zhao authored
      
      
      For GET-like RMA packets and response packets (GACC,
      GET, FOP, CAS, GACC_RESP, GET_RESP, FOP_RESP, CAS_RESP),
      originally we carry source_win_handle in packet struct
      in order to locate window handle on origin side in the
      packet handler of response packets. However, this is
      not necessary because source_win_handle can be stored
      in the request on the origin side. This patch delete
      source_win_handle from those packets to reduce the size
      of packet union.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      80a71e11
    • Xin Zhao's avatar
      Use memcpy for structure assignment. · 59afc29c
      Xin Zhao authored
      
      
      In this patch we replace "=" with memcpy function
      when assigning structure content to another struct.
      Using "=" in this case is not compatible for llvm
      compiler.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      59afc29c
    • Xin Zhao's avatar
      Change argument of function finish_op_on_target. · 1b30ab19
      Xin Zhao authored
      
      
      In this patch, we replace one argument of function
      finish_op_on_target, "packet(op) type", with "has_response_data".
      Since finish_op_on_target does not care what specific
      packet(op) type it is processing on, but only cares
      about if the current op has response data (like GET/GACC),
      changing the argument in this way can simplify the
      code by avoiding acquiring packet(op) type everytime
      before calling finish_op_on_target.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      1b30ab19
    • Xin Zhao's avatar
      Rewrite code of piggybacking IMMED data with RMA packets. · de9d0f21
      Xin Zhao authored
      
      
      Originally we add "immed_data" and "immed_len" areas to RMA packets,
      in order to piggyback small amount of data with packet header to
      reduce number of packets (Note that "immed_len" is necessary when
      the piggybacked data is not the entire data). However, those areas
      potentially increase the packet union size and worsen the two-sided
      communication. This patch fixes this issue.
      
      In this patch, we remove "immed_data" and "immed_len" from normal
      "MPIDI_CH3_Pkt_XXX_t" operation type (e.g. MPIDI_CH3_Pkt_put_t), and
      we introduce new "MPIDI_CH3_Pkt_XXX_immed_t" packt type for each
      operation (e.g. MPIDI_CH3_Pkt_put_immed_t).
      
      "MPIDI_CH3_Pkt_XXX_immed_t" is used when (1) both origin and target
      are basic datatypes, AND, (2) the data to be sent can be entirely fit
      into the header. By doing this, "MPIDI_CH3_Pkt_XXX_immed_t" needs
      "immed_data" area but can drop "immed_len" area. Also, since it only
      works with basic target datatype, it can drop "dataloop_size" area
      as well. All operations that do not satisfy (1) or (2) will use
      normal "MPIDI_CH3_Pkt_XXX_t" type.
      
      Originally we always piggyback FOP data into the packet header,
      which makes the packet size too large. In this patch we split the
      FOP operaton into IMMED packets and normal packets.
      
      Because CAS only work with 2 basic datatype and non-complex
      elements, the data amount is relatively small, we always piggyback
      the data with packet header and only use "MPIDI_CH3_Pkt_XXX_immed_t"
      packet type for CAS.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      de9d0f21
    • Xin Zhao's avatar
      Remove lock_type and origin_rank areas from RMA packet. · 81e2b274
      Xin Zhao authored
      
      
      Originally we added lock_type and origin_rank areas
      in RMA packet, in order to piggyback passive lock request
      with RMA operations. However, those areas potentially
      enlarged the packet union size, and actually they are
      not necessary and can be completetly avoided.
      
      "Lock_type" is used to remember what types of lock (shared or
      exclusive) the origin wants to acquire on the target. To remove
      it from RMA packet, we use flags (already exists in RMA packet)
      to remember such information.
      
      "Origin_rank" is used to remember which origin has sent lock
      request to the target, so that when the lock is granted to this
      origin later, the target can send ack to that origin. Actually
      the target does not need to store origin_rank but can only store
      origin_vc, which is known from progress engine on target side.
      Therefore, we can completely remove origin_rank from RMA packet.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      81e2b274
    • Xin Zhao's avatar
      Add comments about RMA packet wrappers. · d46b848a
      Xin Zhao authored
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      d46b848a
    • Xin Zhao's avatar
      Modify packet wrappers to make them complete. · 064e60ce
      Xin Zhao authored
      
      
      Some packet wrappers did not include all packet types,
      this patch adds missed packet types to those wrappers.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      064e60ce
    • Xin Zhao's avatar
      Re-apply modifications on mpidpkt.h. · fa958833
      Xin Zhao authored
      This patch re-apply modifications on mpidpkt.h that is
      temporarily reverted in bb3f9623
      
      .
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fa958833
    • Xin Zhao's avatar
      Revert "Code-refactor: arrange RMA pkt structure." · 2cbc9180
      Xin Zhao authored
      This reverts commit 389aab16
      
      .
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      2cbc9180
    • Xin Zhao's avatar
      Temporarily revert commits for src/mpid/ch3/include/mpidpkt.h · bb3f9623
      Xin Zhao authored
      We are going to revert the commit 389aab16 because it re-ordered
      the attributes in RMA packet structs in mpidpkt.h and messed up
      the alignments.
      
      This commit temporarily reverts the following commits, which
      only reverts modification on mpidpkt.h after commit 389aab16.
      
      e36203c3, 45afd1fd, 3a05784f, 87acbbbe, b155e7e0
      
      We will re-apply those modifications after we revert 389aab16
      
      .
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      bb3f9623
  4. 03 Feb, 2015 1 commit
  5. 30 Jan, 2015 1 commit
  6. 13 Jan, 2015 1 commit
    • Wesley Bland's avatar
      Move MPID_Comm_AS_enabled to MPID layer function · d1ab9e68
      Wesley Bland authored
      
      
      It was pointed out that by putting this in a macro and failing silently
      when unimplemented, this make things challenging for derivatives which
      will implement this function in the future. By moving this to an MPID
      level function, it becomes more obvious that the function should be
      implemented later.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      d1ab9e68
  7. 12 Jan, 2015 2 commits
    • Wesley Bland's avatar
      Change MPIDI_CH3I_Comm_AS_enabled to be MPID level · 8cbbcae4
      Wesley Bland authored
      
      
      This macro was used inside CH3 to determine if the communicator could be
      used for anysource communication. With the rewrite of the anysource
      fault tolerance logic, it is now necessary to use it at the MPI level.
      Because it is a macro and not a function, the macro is defined in
      mpiimple.h as (1) and then overwritten in the ch3 device. Future devices
      can also overwrite it if desired.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      8cbbcae4
    • Wesley Bland's avatar
      Strip out pending ANY_SOURCE request handling · 7a785c84
      Wesley Bland authored
      
      
      The existing way that we handle non-blocking requests involving wildcard
      receive operations is incorrect. We're cancelling request operations and
      trying to recreate them later. In the meantime, it's messing with
      matching and makes it possible (likely?) that some messages that arrive
      will never be matched. A new way of handling this is coming next.
      Signed-off-by: default avatarHuiwei Lu <huiweilu@mcs.anl.gov>
      7a785c84
  8. 16 Dec, 2014 11 commits