Handle caching is incompatible with shared memory routing
Note: This issue only applies to situations where automatic shared memory routing is used.
tl;dr: Margo handle cache is initialized with handles of the user-defined RPC protocol. When
margo_create() is called with an na+sm address, it causes
margo_handle_cache_get() to always fail because
HG_Reset() expects an na+sm handle which margo's handle cache never provides. Therefore, local communication never benefits from margo's handle cache.
For the explanation below we exemplarily use ofi+tcp as the main RPC protocol which was used to initialize the margo client. In addition,
auto_sm is enabled. The margo server is started on the same machine and accepts ofi+tcp connections as well as implicitly na+sm connections because
auto_sm is enabled as well. Since we only use one node in this example na+sm is always used automatically.
margo_create() is called, margo acquires a handle which has been initialized in
margo_handle_cache_init() with ofi+tcp from the handle cache. This handle is then used with
HG_Reset() and with an address which was automatically set to na+sm by
margo_addr_lookup() earlier due to automatic shared memory routing in mercury. Therefore,
HG_Reset() is called with an ofi+tcp handle and an na+sm address. Because
HG_Reset() can only reuse handles when its address type doesn't change, it causes the following mercury error as the handle cannot be reset with the given address:
# HG -- Error -- /home/evie/adafs/git/mercury/src/mercury_core.c:4570 # HG_Core_reset(): Cannot reset handle to a different address NA class # HG -- Error -- /home/evie/adafs/git/mercury/src/mercury.c:1966 # HG_Reset(): Could not reset core HG handle
Then margo proceeds with putting the same handle back into the margo cache and calling
HG_Create() to manually create a shared memory handle. Once this handle will be destroyed in
margo_destroy(), it will also be discarded and not used in the cache because it was manually created.
This cycle then repeats forever and no handle in margo's cache can ever be used for na+sm communication, accompanied with above error message in each call.
A possible solution would be to use two handle caches if
auto_sm is enabled and then use the correct cache based on the incoming address.