Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in
D
darshan
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 72
    • Issues 72
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 5
    • Merge Requests 5
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • darshan
  • darshan
  • Issues
  • #210

Closed
Open
Opened Oct 17, 2016 by Glenn K. Lockwood@glock
  • Report abuse
  • New issue
Report abuse New issue

deadlock with Cray compiler on CLE6 inside tcmalloc/mmap

The AMG mini-app reproducibly deadlocks as soon as it is started under certain conditions on NERSC Cori. The problem only occurs on Cray Linux Environment v6 (CLE6) and with the Cray compiler environment, and it occurs on both Haswell and Knight's Landing processors.

The problem manifests as all processes going to sleep; attaching a debugger reveals that the processes are stuck at a futex within an mmap call. This mmap is wrapped by Darshan, and when AMG is compiled in the absence of Darshan, it executes normally.

That being said, there are issues we have been seeing with mmap in CLE6 with hugepages enabled that are independent of this issue, and it isn't clear if they are related. The fact that this problem manifests even without Cray hugepages enabled suggests it is not, but they may share a common root cause.

A tarball containing the AMG source code and this bug report can be found at NERSC in /global/project/projectdirs/m888/glock/amg-deadlock.tar.gz. Unfortunately the problem does not manifest on Edison (CLE5), so reproducing this problem without access to Cori may be difficult.

Reproducing on Cori/Knight's Landing

Build AMG with

module swap PrgEnv-intel PrgEnv-cray
module swap craype-haswell craype-mic-knl
module load darshan/3.0.1

make clean
make -j 32

Then request an allocation with

salloc -N 1 -p regular_knl -t 30:00

and run with

srun -n 64 ./amg2013-cori-knl -laplace -P 4 4 4 -n 150 150 150 -solver 2

Reproducing on Cori/Haswell

Build AMG with

module swap PrgEnv-intel PrgEnv-cray
module load darshan/3.0.1

make clean
make -j 32

Then request an allocation with

salloc -N 1 -p regular -t 30:00

and run with

srun -n 32 ./amg2013-cori-hsw -laplace -P 4 4 2 -n 150 150 150 -solver 2 &

Diagnosis

The problem manifests as all MPI processes falling asleep almost immediately after job launch. Nothing appears on stdout, and ps indicates that no forward progress is happening:

glock@nid03192:/global/cscratch1/sd/glock/corip2/AMG.knl/run$ ps -U glock ux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
glock    257982  0.1  0.0  23896  3956 pts/0    Ss   11:34   0:00 /bin/bash
glock    258166  0.7  0.0 368216  8888 pts/0    Sl   11:36   0:00 srun --ntasks 64 --cpu_bind=cores numactl --preferred 1 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 
glock    258167  0.0  0.0  97880  1084 pts/0    S    11:36   0:00 srun --ntasks 64 --cpu_bind=cores numactl --preferred 1 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 
glock    258180  7.5  0.0 548072  1372 ?        S    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock    258181  7.6  0.0 554212  1048 ?        R    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock    258182  7.9  0.0 554212  1036 ?        S    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock    258183  8.0  0.0 554212  1044 ?        S    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock    258184  8.0  0.0 554212  1028 ?        S    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
glock    258185  8.2  0.0 554212  1052 ?        S    11:36   0:02 ./amg2013 -laplace -P 4 4 4 -n 150 150 150 -solver 2
...
glock    258282  0.0  0.0  35440  1576 pts/0    R+   11:37   0:00 ps -U glock ux

The deadlock occurs here (example taken from running on KNL):

(gdb) bt
#0  0x00000000204419ec in sys_futex ()
#1  0x0000000020441b13 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#2  0x0000000020441ec5 in SpinLock::SlowLock() ()
#3  0x00000000204e8702 in tc_calloc ()
#4  0x0000000020115440 in __wrap_mmap ()
#5  0x000000002043c1f8 in HugetlbSysAllocator::Alloc(unsigned long, unsigned long*, unsigned long) ()
#6  0x000000002043bd60 in TCMalloc_SystemAlloc(unsigned long, unsigned long*, unsigned long) ()
#7  0x000000002043d9e9 in tcmalloc::PageHeap::GrowHeap(unsigned long) [clone .part.5] ()
#8  0x000000002043dd3b in tcmalloc::PageHeap::New(unsigned long) ()
#9  0x00000000204393f8 in (anonymous namespace)::do_memalign(unsigned long, unsigned long) ()
#10 0x00000000204ec04c in tc_posix_memalign ()
#11 0x0000000020107734 in hypre_CAlloc () at hypre_memory.c:135
#12 0x000000002004808b in GenerateLaplacian () at par_laplace.c:166
#13 0x000000002001bc0b in BuildParLaplacian () at amg2013.c:2851
#14 0x0000000020019ee3 in main () at amg2013.c:1754

It is worth noting that this deadlock happens even if AMG is compiled in the absence of Cray's hugepages module. The stack trace is the same in both cases and on both Haswell and KNL processors.

Scope

The problem is limited to the Cray compiler on CLE6 (Cori). It does not appear with Intel compilers or on CLE5 (Edison).

System Compiler HugePages Darshan Version Works?
Cori/KNL Cray no 3.0.1 NO
Cori/Haswell Cray no 3.0.1 NO
Cori/KNL Cray 8 MB 3.0.1 NO
Cori/Haswell Cray 8 MB 3.0.1 NO
Cori/KNL Intel 8 MB 3.0.1 yes
Cori/Haswell Intel 8 MB 3.0.1 yes
Edison Cray 8 MB 2.3.1 yes
Edison Intel 8 MB 3.1.1 yes
Assignee
Assign to
3.1.3
Milestone
3.1.3
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: darshan/darshan#210