Perf-optimize: support piggybacking LOCK on large RMA operations.
Originally we only allows LOCK request to be piggybacked with small RMA operations (all data can be fit in packet header). This brings communication overhead for larger operations since origin side needs to wait for the LOCK ACK before it can transmit data to the target. In this patch we add support of piggybacking LOCK with RMA operations with arbitrary size. Note that (1) this only works with basic datatypes; (2) if the LOCK cannot be satisfied, we temporarily buffer this operation on the target side. No reviewer.
Showing with 835 additions and 157 deletions