deadlock with Cray compiler on CLE6 inside tcmalloc/mmap
The AMG mini-app reproducibly deadlocks as soon as it is started under certain conditions on NERSC Cori. The problem only occurs on Cray Linux Environment v6 (CLE6) and with the Cray compiler environment, and it occurs on both Haswell and Knight's Landing processors.
The problem manifests as all processes going to sleep; attaching a debugger reveals that the processes are stuck at a futex within an mmap call. This mmap is wrapped by Darshan, and when AMG is compiled in the absence of Darshan, it executes normally.
That being said, there are issues we have been seeing with mmap in CLE6 with hugepages enabled that are independent of this issue, and it isn't clear if they are related. The fact that this problem manifests even without Cray hugepages enabled suggests it is not, but they may share a common root cause.
A tarball containing the AMG source code and this bug report can be found at
NERSC in /global/project/projectdirs/m888/glock/amg-deadlock.tar.gz
.
Unfortunately the problem does not manifest on Edison (CLE5), so reproducing
this problem without access to Cori may be difficult.
Reproducing on Cori/Knight's Landing
Build AMG with
module swap PrgEnv-intel PrgEnv-cray
module swap craype-haswell craype-mic-knl
module load darshan/3.0.1
make clean
make -j 32
Then request an allocation with
salloc -N 1 -p regular_knl -t 30:00
and run with
srun -n 64 ./amg2013-cori-knl -laplace -P 4 4 4 -n 150 150 150 -solver 2
Reproducing on Cori/Haswell
Build AMG with
module swap PrgEnv-intel PrgEnv-cray
module load darshan/3.0.1
make clean
make -j 32
Then request an allocation with
salloc -N 1 -p regular -t 30:00
and run with
srun -n 32 ./amg2013-cori-hsw -laplace -P 4 4 2 -n 150 150 150 -solver 2 &
Diagnosis
The problem manifests as all MPI processes falling asleep almost immediately after job launch. Nothing appears on stdout, and ps indicates that no forward progress is happening:
glock@nid03192:/global/cscratch1/sd/glock/corip2/AMG.knl/run$ ps -U glock ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
glock 257982 0.1 0.0 23896 3956 pts/0 Ss 11:34 0:00 /bin/bash
glock 258166 0.7 0.0 368216 8888 pts/0 Sl 11:36 0:00 srun --ntasks 64 --cpu_bind=cores numactl --preferred 1 ./amg2013 -laplace -P 4 4 4 -n 150 150 150
glock 258167 0.0 0.0 97880 1084 pts/0 S 11:36 0:00 srun --ntasks 64 --cpu_bind=cores numactl --preferred 1 ./amg2013 -laplace -P 4 4 4 -n 150 150 150
glock 258180 7.5 0.0 548072 1372 ? S 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock 258181 7.6 0.0 554212 1048 ? R 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock 258182 7.9 0.0 554212 1036 ? S 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock 258183 8.0 0.0 554212 1044 ? S 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock 258184 8.0 0.0 554212 1028 ? S 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock 258185 8.2 0.0 554212 1052 ? S 11:36 0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
...
glock 258282 0.0 0.0 35440 1576 pts/0 R+ 11:37 0:00 ps -U glock ux
The deadlock occurs here (example taken from running on KNL):
(gdb) bt
#0 0x00000000204419ec in sys_futex ()
#1 0x0000000020441b13 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#2 0x0000000020441ec5 in SpinLock::SlowLock() ()
#3 0x00000000204e8702 in tc_calloc ()
#4 0x0000000020115440 in __wrap_mmap ()
#5 0x000000002043c1f8 in HugetlbSysAllocator::Alloc(unsigned long, unsigned long*, unsigned long) ()
#6 0x000000002043bd60 in TCMalloc_SystemAlloc(unsigned long, unsigned long*, unsigned long) ()
#7 0x000000002043d9e9 in tcmalloc::PageHeap::GrowHeap(unsigned long) [clone .part.5] ()
#8 0x000000002043dd3b in tcmalloc::PageHeap::New(unsigned long) ()
#9 0x00000000204393f8 in (anonymous namespace)::do_memalign(unsigned long, unsigned long) ()
#10 0x00000000204ec04c in tc_posix_memalign ()
#11 0x0000000020107734 in hypre_CAlloc () at hypre_memory.c:135
#12 0x000000002004808b in GenerateLaplacian () at par_laplace.c:166
#13 0x000000002001bc0b in BuildParLaplacian () at amg2013.c:2851
#14 0x0000000020019ee3 in main () at amg2013.c:1754
It is worth noting that this deadlock happens even if AMG is compiled in the absence of Cray's hugepages module. The stack trace is the same in both cases and on both Haswell and KNL processors.
Scope
The problem is limited to the Cray compiler on CLE6 (Cori). It does not appear with Intel compilers or on CLE5 (Edison).
System | Compiler | HugePages | Darshan Version | Works? |
---|---|---|---|---|
Cori/KNL | Cray | no | 3.0.1 | NO |
Cori/Haswell | Cray | no | 3.0.1 | NO |
Cori/KNL | Cray | 8 MB | 3.0.1 | NO |
Cori/Haswell | Cray | 8 MB | 3.0.1 | NO |
Cori/KNL | Intel | 8 MB | 3.0.1 | yes |
Cori/Haswell | Intel | 8 MB | 3.0.1 | yes |
Edison | Cray | 8 MB | 2.3.1 | yes |
Edison | Intel | 8 MB | 3.1.1 | yes |