Problem with ssg_group_destroy and ssg_group_unobserve
I have a program that uses
ssg_group_id_load to load a group from a file, then does
ssg_group_observe to get the most recent view of the group. If later I try to do
ssg_group_destroy, the program blocks indefinitely in this function. If I do
ssg_group_destroy, it doesn't block anymore, but
ssg_group_destroy fails with the following error:
Error: SSG unable to find expected group ID. I'm not sure whether
ssg_group_unobserve also destroys the group. I suppose it does, but its name doesn't sound like it should.
I remember a similar problem with
ssg_group_leave would also destroy the group (even though leaving the group doesn't necessarily mean that we want to destroy it). This calls for more clarity and probably a change of API.
I would suggest the following changes:
ssg_group_refreshfor a non-member to get the most recent view of the group from one of the group members;
ssg_group_refreshshould not fail if the caller is a member, it should simply be a no-op;
ssg_group_leaveshould not destroy the group id. It should simply send a notifications to other processes indicating that the caller is leaving the group. The group id can still be used to get information about the group (e.g. calling
ssg_group_unobservewould no longer exist and
ssg_group_leavewould not destroy the group,
ssg_group_destroywould be required no matter how the group id was created (loaded from a file, deserialized, created by MPI or PMIx...), and would be the only way to destroy a group.
- If a process calls
ssg_group_destroywhile being a member and without having called
ssg_group_leave, the group should be destroyed without notifying other processes (I think this is what happens now). Other processes will act as if the member had failed.
int ssg_group_self_is_member(ssg_group_id_t gid)function would be useful to ask whether the calling process is a member of the group (as opposed to an observer).
Some other modifications I would suggest:
- Right now a process' member id is not related to the group, it's simply a hash of the address and for this reason it is returned by
ssg_get_self_id. I would suggest deprecating this function and add an
ssg_group_get_self_id, which would also take a group id as argument (and return
SSG_MEMBER_ID_INVALIDif the caller is not a member). By default the member id would still be the hash of the address, but we could eventually add support for re-joining a group simply by changing the process's member id (taking the next one, for instance), and keeping track of the member id on a per-group basis.