High memory usage of Margo servers that use dedicated progress threads
I am currently investigating an issue where we noticed an unusually high memory consumption of the Margo server process. Essentially, if dedicated progress threads are used for the Margo server process, i.e.
use_progress_thread is true, each thread will require a maximum of ~500 MB of memory (though this upper bound seems to differ). The required memory increases to that maximum by simply handling incoming RPCs. Because the number of used Margo server threads seems to be directly connected to the memory usage, I suspect Margo or Argobots to be the cause.
We noticed this behavior on our clusters where we currently use a total of 32 threads and the server process consumed more than 14 GB of memory. The behavior can also be reproduced on a local desktop computer and it seems to be independent of the used Mercury NA layer (we checked CCI+verbs, bmi+tcp, and na+sm). Interestingly, if no dedicated thread is used with
margo_wait_for_finalize(); the memory footprint is only a few megabytes.
My Margo server and Margo client test applications are derived from the Margo examples (without the bulk transfer) in which the Margo client sends a large number of minimal RPCs (one int back and forth) in a loop to the server. On that note, I also noticed that the throughput of RPCs per second on a local machine increases by the factor of 3 if the progress thread of the Margo server runs in the caller's thread context compared to a single dedicated thread. Perhaps these two observations are connected.
Are there any solutions or explanations for these observations?