1. 30 May, 2015 7 commits
  2. 20 Apr, 2015 2 commits
    • Xin Zhao's avatar
      Set size of IMMED data in RMA packets to 8 bytes. · de0412c2
      Xin Zhao authored
      
      
      Originally the size of IMMED data in RMA packets is 16 bytes
      which makes the size of CH3 packet be 56 bytes. Here we reduce
      the size of IMMED data in RMA packets to 8 bytes, so that the
      size of CH3 packet is reduced to 48 bytes, the same with
      mpich-3.1.4 (the old RMA infrastructure).
      Signed-off-by: default avatarMin Si <msi@il.is.s.u-tokyo.ac.jp>
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
      de0412c2
    • Xin Zhao's avatar
      Move 'stream_offset' out of RMA packet struct. · 19f29078
      Xin Zhao authored
      
      
      'stream_offset' is used to specify the starting position
      (on target window) of the current streaming unit in ACC-like
      operations. It is originally put in the RMA packet struct,
      which potentially increases the size of CH3 packet size.
      
      In this patch, we move 'stream_offset' out of the RMA
      packet as follows: 1. when target data is basic datatype,
      we use 'stream_offset' and the starting address for the entire
      operation to calculate the starting address for current
      streaming unit, and rewrite 'addr' in RMA packet with that
      value; 2. when target data is derived datatype, we cannot do
      the same thing as basic datatype because the target needs to
      know both the starting address for the entire operation and
      the starting address for the current streaming unit. Therefore,
      we send 'stream_offset' separately to the target side.
      Signed-off-by: default avatarMin Si <msi@il.is.s.u-tokyo.ac.jp>
      Signed-off-by: default avatarAntonio J. Pena <apenya@mcs.anl.gov>
      19f29078
  3. 09 Mar, 2015 1 commit
  4. 04 Mar, 2015 30 commits
    • Xin Zhao's avatar
      Rename predefined_type / predef_type to basic_type. · 04deb880
      Xin Zhao authored
      
      
      In MPI standard, predefined datatype is called as basic type.
      It is better to make the name same with the standard in the
      code.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      04deb880
    • Xin Zhao's avatar
      98c76f78
    • Xin Zhao's avatar
      Modify SHM ACC/GACC to avoid allocate large buffer. · 7c890ab2
      Xin Zhao authored
      
      
      The original implementation of ACC/GACC on SHM first
      allocates a temporary buffer which has the same data
      layout as the target data, copies the entire origin
      data to that temporary buffer, and then performs the
      ACC computation between the temporary buffer and the
      target buffer. The temporary buffer can use potentially
      large amount of memory.
      
      This patch fixes this issue as follows: (1) SHM ACC/GACC
      routines directly call do_accumulate_op() function, which
      requires the origin data to be in a 'packed manner';
      (2) if the origin data is basic type, we directly perform
      do_accumulate_op() between origin buffer and target buffer;
      if the origin data is derived, we stream the origin data
      by copying partial of origin data into a packed streaming
      buffer and performing do_accumulate_op() between the
      streaming buffer and target buffer each time.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7c890ab2
    • Xin Zhao's avatar
      Allocate buffer with stream size for ACC/GACC data piggybacked with LOCK. · 002ce8c8
      Xin Zhao authored
      
      
      For queued ACC/GACC data piggybacked with LOCK, we do not
      need to allocate the buffer for the entire operation, but
      only need to allocate a buffer with stream unit size. This
      patch fixes this issue.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      002ce8c8
    • Xin Zhao's avatar
      Modify do_accumulate_op to allow for packed basic type data as input. · 0d5146ba
      Xin Zhao authored
      
      
      Originally, do_accumulate_op() is used to perform the ACC
      computation on target between data from origin side and
      data on the target window. It requires that the target side
      must first unpack the received origin data into the same data
      layout as the target data before calling this function, which
      may consume potentially large of memory.
      
      This patch fixes do_accumulate_op() function in the following
      aspects:
      
      (1) It requires that the origin data passed to the function
      must be "in a packed manner", which means it looks as if all
      basic type elements in the origin data is placed one by one.
      Note that the origin data is not necessarily contiguous, since
      we may use non-contiguous basic type. If the basic type
      is contiguous, then the origin data must be contiguous.
      
      (2) It adds a new function argument, stream_offset, which
      specifies a starting location in the target data. This allows
      the origin data to work with partial of target data with stream
      size.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      0d5146ba
    • Xin Zhao's avatar
      Bug-fix: add FOP req types. · c9435750
      Xin Zhao authored
      
      
      This patch adds req types for FOP operation, and calls FOP req handler
      after SRBuf is unpacked.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c9435750
    • Xin Zhao's avatar
      Correct the name of RMA requests types. · f75eb4eb
      Xin Zhao authored
      
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      f75eb4eb
    • Xin Zhao's avatar
      Add stream_offset to ACC-related packets and request struct. · d8eb8de2
      Xin Zhao authored
      
      
      Add stream_offset area into ACC-related packets and request struct
      to remember current stream unit's starting position in the entire
      target data.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      d8eb8de2
    • Xin Zhao's avatar
      Add counter in op struct to remember number of stream units issued. · c986b927
      Xin Zhao authored
      
      
      Add a counter in op struct to remember number of stream units
      that have already been issued. For example, when the first stream
      unit piggybacked with LOCK is issued out, we temporarily stop
      issuing the following units. After the origin receives the ACK
      from the target, it can continue to issue the following units.
      This counter helps avoid issuing the first unit again.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      c986b927
    • Xin Zhao's avatar
      Reset flags for stream unit within one RMA operation. · 421f4359
      Xin Zhao authored
      
      
      For all stream units within one RMA operation, we only
      needs to piggyback flags for the first operation to the
      first stream unit, and piggyback flags for the last
      operation to the last stream unit.
      
      Note that for operations piggybacked with LOCK flag, we
      should just issue the first stream unit, and wait until
      we receive ACK from the target to decide if we continue
      to issue the following units, or  re-transmit the first
      unit.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      421f4359
    • Xin Zhao's avatar
      0641e2f1
    • Xin Zhao's avatar
      Cutting ACC and GACC messages. · 382b04c4
      Xin Zhao authored
      
      
      In this patch, we define the size of streaming unit the same
      as the SRBuf size (256 * 1024 bytes), and cut the ACC/GACC packet
      according to this size. The streaming unit always contains
      complete basic type data and does not contain partial basic
      type data.
      
      Note that we also increment the ref counter of the pointer
      to the derived datatype since multiple streaming units within
      one RMA operation will refer to it.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      382b04c4
    • Xin Zhao's avatar
      Split issue_from_origin_buffer into normal and stream version. · d3cbeab3
      Xin Zhao authored
      
      
      The stream version of issue_from_origin_buffer is used in ACC/GACC
      operations. It allows the user to stream the data by passing
      stream_offset and stream_size to the function.
      
      The normal version of issue_from_origin_buffer is used in other
      RMA operations. It issue all the data as a whole.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      d3cbeab3
    • Xin Zhao's avatar
      Make create_datatype be able to create more general new datatype. · ca223da0
      Xin Zhao authored
      
      
      The original implementation of create_datatype can only generate
      a new datatype that describes 'dtype_info + dataloop + one data
      layout'. It does not support generating 'dtype_info + dataloop +
      multiple data layouts'. This patch makes create_datatype function
      to achieve that purpose.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ca223da0
    • Xin Zhao's avatar
      Use a request array in RMA operation. · ab8386e7
      Xin Zhao authored
      
      
      Because we may cut one RMA operation into multiple packets,
      and each packet needs a request object to track the completion,
      here we use a request array instead of single request in
      RMA operation structure.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ab8386e7
    • Xin Zhao's avatar
      Increment active_req_cnt when issuing the packet. · 1a3e661f
      Xin Zhao authored
      
      
      Increment active_req_cnt when actually issuing the packet
      instead of issuing the operation, since we may cut one
      operation into multiple packets.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      1a3e661f
    • Xin Zhao's avatar
      Return request pointer from issue_from_origin_buffer function. · 6c81f6cd
      Xin Zhao authored
      
      
      In the original implementation, issue_from_origin_buffer
      is used to issue out one RMA packet. Since each RMA operation
      only has one packet, it just attaches the returned request
      pointer to the RMA operation structure. Now since we are going
      to cut one RMA operation into multiple stream packets,
      this function will be used to issue each streamed packets,
      and each RMA operation may have multiple requests. Therefore,
      we make this function returns the request pointer and let
      the caller store the request in the request array of op
      structure.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      6c81f6cd
    • Xin Zhao's avatar
      9fa6582a
    • Xin Zhao's avatar
      Code refactoring: setting ureq after issuing op in a function. · a36fd9dd
      Xin Zhao authored
      
      
      After (1) issuing an op (no LOCK flag), or (2) issuing an op
      (with LOCK flag) and receiving an ACK that LOCK is granted or
      queued, we should set the user request (ureq) to be completed.
      This patch wraps up the work of setting ureq into a function,
      and call that function after (1) and (2) happens.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      a36fd9dd
    • Xin Zhao's avatar
      Modify location of setting next_op_to_issue and sync_flag to NONE · fd92b7bc
      Xin Zhao authored
      
      
      After we issue an op, we set the next_op_to_issue to the next op,
      and if next op is NULL, we set sync_flag to NONE. When we receive
      the lock ACK saying that lock request is discarded, we set the
      next_op_to_issue back to the current op, we reset the sync_flag
      from NONE to corresponding flag, since we need to re-transmit the
      current op.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      fd92b7bc
    • Xin Zhao's avatar
      Change name from data_size to buf_size. · 45cdb282
      Xin Zhao authored
      
      
      When the lock is not satisfied, we queue up
      the lock request and op data in a lock entry
      queue. In the struct of lock entry, we use 'data_size'
      to remember the size of buffer for storing the
      data. Since the size of buffer is not type_size*count
      but might be type_extent*extent, here we change
      its name from 'data_size' to 'buf_size'.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      45cdb282
    • Xin Zhao's avatar
      Bug-fix: make RMA work correctly with pair basic type. · ce8bc310
      Xin Zhao authored
      
      
      The original implementation of RMA does not consider pair basic
      types (e.g. MPI_FLOAT_INT, MPI_DOUBLE_INT). It only
      works correctly with builtin datatypes (e.g. MPI_INT, MPI_FLOAT).
      This patch makes the RMA work correctly with pair basic types.
      
      The bug is that: (1) when performing the ACC computation, the original
      implementation uses 'eltype' in the datatype structure, which is set
      when all basic elements in this datatype have the same builtin
      datatype. When basic elements have different builtin datatypes, like
      pair datatypes, the 'eltype' is set to MPI_DATATYPE_NULL. This makes
      the ACC computation be unable to work with pair types; (2) for all
      basic type of data, the original implementation assumes that
      they are all contiguous and issues them in an unpacked manner
      with length of data size (count*type_size). This is incorrect for
      pair datatypes, because most pair datatypes are non-contiguous
      (type_extent != type_size).
      
      In the previous patch, we already made 'eltype' to store basic
      type instead of builtin type. In this patch, we fixed this
      bug by (1) modify ACC computation to treat 'eltype' as basic
      type; (2) For non-contiguous basic type data, we use the noncontig
      API so that it will be issued in a packed manner.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      ce8bc310
    • Xin Zhao's avatar
      7899a602
    • Xin Zhao's avatar
      Use function hook instead of function pointer for win_free. · 42b5fcf1
      Xin Zhao authored
      
      
      The original implementation of win_free is not correct. The
      problem is described as follows:
      
      It uses a function pointer which is initially set to the CH3
      implementation, and can be overridden by the channel layer if
      the channel provides an specific implementation.  In the CH3
      win_free implementation, it first checks if all RMA
      communication is finished and epoch states is reset, then
      performs a global barrier, then frees the window resources
      that are allocated in CH3, and finally returns. In the Nemesis
      win_free implementation, it directly frees the window resources
      that are allocated in Nemesis, and calls the CH3 win_free at last.
      This makes no sense because we free the window resources before
      checking if the RMA communication is completed.
      
      To fix this issue, we add a function hook for channel layer
      to free its own resources, the the function hook is called from
      the CH3 win_free.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      42b5fcf1
    • Xin Zhao's avatar
      Allow the channel layer to implement Win_gather_info function. · 9dbcae0c
      Xin Zhao authored
      
      
      In this patch, we first add a function pointer of Win_gather_info
      in CH3 to allow different channel layers to implement their own
      version of Win_gather_info function. The function pointer is
      initially set to the default implementation in CH3 layer. If the
      channel layer provides an implementation of Win_gather_info, it
      will override the function pointer.
      
      Secondly, we provide an implementation of Win_gather_info in the
      Nemesis layer. In this implementation, we allocate basic_info_table[]
      in the SHM region, so that processes on the same node can share the
      same base_info_table[].
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      9dbcae0c
    • Xin Zhao's avatar
      Add a function hook to initialize window attributes in channel layer. · 7c1a8fb1
      Xin Zhao authored
      
      
      There are some window attributes in the channel layer that
      needs to be initialized during window creation. In this
      patch, we first add a win_hooks table that contains pointers
      to the channel's implementation of the function hooks. Secondly,
      we add a function hook 'win_init' to allow the channel layer to
      initialize its own attributes. The hook is called from the
      CH3 win_init function.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      7c1a8fb1
    • Xin Zhao's avatar
      Reduce size of shm_base_addrs[] from comm_size to node_size. · eddd8b91
      Xin Zhao authored
      
      
      Given one process, shm_base_addrs[] is used to store the base
      addresses (in the address space of this process) of SHM window
      on other processes. The original size of it is comm_size. However,
      the maximum number of SHM windows that this process can access
      to is node_size instead of comm_size, which results in a waste
      of memory since most slots in the array is NULL. In this patch
      we reduce the size of shm_base_addrs[] from comm_size to node_size.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      eddd8b91
    • Xin Zhao's avatar
      Store window basic attributes into a struct on window. · 9404e953
      Xin Zhao authored
      
      
      In this patch, we gather window basic attributes of other
      processes (base_addr, size, disp_unit, win_handle) using a
      struct called "basic_info_table". By doing this, we can use
      one contiguous memory region to store them.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      9404e953
    • Xin Zhao's avatar
      Change name of MPIDI_CH3U_Win_create_gather to MPIDI_CH3U_Win_gather_info. · 131e06ef
      Xin Zhao authored
      
      
      Function MPIDI_CH3U_Win_create_gather exchanges the window
      information among processes. It does not create new window.
      Here we change the function name to a more suitable one.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      131e06ef
    • Xin Zhao's avatar
      Add CH3 APIs and macros to allow channel to implement Alloc_mem/Free_mem. · 03d4c77b
      Xin Zhao authored
      
      
      Originally MPIDI_Alloc_mem(size, info) and MPIDI_Free_mem(base_ptr)
      in CH3 layer are implemented by calling MPIU_Malloc(size) and
      MPIU_Free(base_ptr) internally. This makes the underlying hardware
      be unable to develop a specific implementation of Alloc_mem and Free_mem,
      which is necessary when registering memory for RDMA operations.
      
      This patch defines new APIs, MPIDI_CH3I_Alloc_mem(size, info)
      and MPIDI_CH3I_Free_mem(base_ptr), to allow channels to implement
      their own memory allocators. If the channel does not have its own
      implementation, MPICH will fallback to the default implementation
      in CH3 layer which uses MPIU_Malloc and MPIU_Free.
      
      Thanks to Steffen Christgau <christgau@cs.uni-potsdam.de> for
      this contribution.
      Signed-off-by: Pavan Balaji's avatarPavan Balaji <balaji@anl.gov>
      03d4c77b