Commit 6b5993af authored by Su Huang's avatar Su Huang Committed by Michael Blocksome
Browse files

pamid: task 0 hang in MPI_Init() if MP_PRINTENV=yes



In MPIDI_Print_mpenv(), when calling MPIR_Gather_impl to gather all MP environment variables
from all tasks in a job, the errflag parameter was not initialized to 0 before it was
passed to the routine:
       mpi_errno = MPIR_Gather_impl(&sender, sizeof(MPIDI_printenv_t), MPI_BYTE, gatherer,
                                    sizeof(MPIDI_printenv_t),MPI_BYTE, 0,comm_ptr,
                                    (int *) &errflag);

To process the Gather collective call, each task issued MPIC_Recv, MPIC_Send and MPIC_Wait.

MPIC_Send() sends a message with MPIR_GATHER_TAG (defined as 0x3). Since the routine had a
non-zero errflag passed in,

    if (*errflag && MPIR_CVAR_ENABLE_COLL_FT_RET)
        MPIR_TAG_SET_ERROR_BIT(tag);

the 30th bit of the tag was set to 1 :(1 << 30) (MPIR_TAG_ERROR_BIT). Therefore, the tag was
changed from 0x3 to 0x40000003.

On task 1, a message with this modified tag was sent to task 0. When the message arrived at
task 0, the receive for the message with the original tag of 0x3 had been posted.
However, the tag in the arrived message differed from the tag from the posted receive.
So no match was found for the arrived message which was the root cause of the hang.

MPIR_TAG_SET_ERROR_BIT was added for MPI 3.0 (pe rbrew and beyond) which explains why
the job does not fail with prior releases.

 (ibm) D197745
Signed-off-by: default avatarMichael Blocksome <blocksom@us.ibm.com>
parent 98b5e585
......@@ -35,7 +35,7 @@
#include "mpidi_util.h"
#define PAMI_TUNE_MAX_ITER 2000
#define _DEBUG 1
/* Short hand for sizes */
#define ONE (1)
#define ONEK (1<<10)
......@@ -461,7 +461,7 @@ int MPIDI_Print_mpenv(int rank,int size)
char *popenptr;
char tempstr[128];
int mpi_errno;
int errflag;
int errflag=0;
MPIDI_Set_mpich_env(rank,size);
memset(&sender,0,sizeof(MPIDI_printenv_t));
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment