Initial draft of flow-control in the portals4 netmod.
Portals4 by itself does not provide any flow-control. This needs to
be managed by an upper-layer, such as MPICH. Before this patch we
were relying on a bunch of unexpected buffers that were posted to the
portals library to manage unexpected messages. However, since portals
asynchronously pulls out messages from the network, if the application
is delayed, it might result in the unexpected buffers being filled out
and the portal disabled. This would cause MPICH to abort.
In this patch, we implement an initial version of flow-control that
allows us to reenable the portal when it gets disabled. All this is
done in the context of the "rportals" wrappers that are implemented in
the rptl.* files. We create an extra control portal that is only used
by rportals. When the primary data portal gets disabled, the target
sends PAUSE messages to all other processes. Once each process
confirms that it has no outstanding packets on the wire (i.e., all
packets have either been ACKed or NACKed), it sends a PAUSE-ACK
message. When the target receives PAUSE-ACK messages from all
processes (thus confirming that the network traffic to itself has been
quiesced), it reenables the portal and sends an UNPAUSE message to all
processes.
This patch still does not deal with origin-side resource exhaustion.
This can happen, for example, if we run out of space on the event
queue on the origin side.
Signed-off-by:
Ken Raffenetti <raffenet@mcs.anl.gov>
This diff is collapsed.