Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • D darshan
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 72
    • Issues 72
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • Deployments
    • Deployments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • darshan
  • darshan
  • Issues
  • #272
Closed
Open
Issue created Apr 28, 2020 by Shane Snyder@ssnyderOwner

Darshan log writes failing to Lustre filesystem

Reported by André Carneiro on the Darshan users mailing list:

*Using OpenMPI 3.1.5 and GCC 7

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0  0x7f4f6759f27f in ???
#1  0x7f4f687ababe in ???
#2  0x7f4f687add06 in ???
#3  0x7f4f687db6c0 in ???
#4  0x7f4f687dbddb in ???
#5  0x7f4f6879d6f1 in ???
#6  0x7f4f6871892b in ???
#7  0x7f4f691d0ae1 in MPI_File_write_at_all
at lib/darshan-mpiio.c:536
#8  0x7f4f691bea7f in darshan_log_append_all
at lib/darshan-core.c:1800
#9  0x7f4f691c1907 in darshan_log_write_name_record_hash
at lib/darshan-core.c:1761
#10  0x7f4f691c1907 in darshan_core_shutdown
at lib/darshan-core.c:546
#11  0x7f4f691be402 in MPI_Finalize
at lib/darshan-core-init-finalize.c:82
#12  0x7f4f68b6a798 in ???
#13  0x4023bb in ???
#14  0x401ae6 in ???
#15  0x7f4f6758b3d4 in ???
#16  0x401b16 in ???
#17  0xffffffffffffffff in ???
--------------------------------------------------------------------------

*Using Intel PSXE 2018 with Intel MPI

forrtl: severe (71): integer divide by zero
Image              PC                Routine            Line        Source            
exec.exe           000000000045282E  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B8B5A5FE5D0  Unknown               Unknown  Unknown
libmpi_lustre.so.  00002B8B659D4FDF  ADIOI_LUSTRE_Get_     Unknown  Unknown
libmpi_lustre.so.  00002B8B659CFFD9  ADIOI_LUSTRE_Writ     Unknown  Unknown
libmpi.so.12.0     00002B8B59A4C15C  Unknown               Unknown  Unknown
libmpi.so.12       00002B8B59A4D1D5  PMPI_File_write_a     Unknown  Unknown
libdarshan.so      00002B8B58F90312  MPI_File_write_at     Unknown  Unknown
libdarshan.so      00002B8B58F7E63A  Unknown               Unknown  Unknown
libdarshan.so      00002B8B58F815B0  darshan_core_shut     Unknown  Unknown
libdarshan.so      00002B8B58F7DFF3  MPI_Finalize          Unknown  Unknown
libmpifort.so.12.  00002B8B592414DA  pmpi_finalize__       Unknown  Unknown
exec.exe           00000000004490A5  Unknown               Unknown  Unknown
exec.exe           00000000004032DE  Unknown               Unknown  Unknown
libc-2.17.so       00002B8B5AB2F3D5  __libc_start_main     Unknown  Unknown
exec.exe           00000000004031E9  Unknown               Unknown  Unknown

So, two different MPI implementations hit the same problem.

The user can work around by writing to a non-Lustre file system. Having the user export DARSHAN_LOGHINTS="" also works around the problem, so seems related to hint interaction with Lustre.

Assignee
Assign to
Time tracking