memory corruption in ofi_rxm with rapid client connections
Reported by Osamu Tatebe, can be triggered with tcp;ofi_rxm or verbs;ofi_rxm providers in libfabric, but not by sockets provider or bmi/tcp.
(modify .c files to initialize Mercury with appropriate transport for testing)
Run with a command similar to the following:
PORT=46027; i=0; while [ $i -lt 200 ]; do ./clienta tcp://220.127.116.11:$PORT & : ; i=$((i + 1)); done; ./clientb tcp://18.104.22.168:$PORT &
... which produces a stack trace similar to the following in valgrind:
==130661== Invalid read of size 4 ==130661== at 0x682122F: tcpx_pep_reject (in /home/tatebe/bootcamp/spack/opt/spack/linux-rhel7-x86_64/gcc-8.2.0/libfabric-1.8.0-7q5rafqefvgjhgkvhi2o6rigfi4dbeu2/lib/libfabric.so.1.11.0) ==130661== Address 0xb3fde90 is 32 bytes inside a block of size 40 free'd ==130661== at 0x4C2ACBD: free (vg_replace_malloc.c:530) ==130661== Block was alloc'd at ==130661== at 0x4C2B955: calloc (vg_replace_malloc.c:711)
A deeper stack trace can be produced by compiling the stack with cflags="-g -fno-omit-frame-pointer" and enabling valgrind support in argobots.
This is likely a libfabric rxm provider bug, but we need to confirm and make sure how to trigger so we can report if we need help fixing it.