Darshan log writes failing to Lustre filesystem
Reported by André Carneiro on the Darshan users mailing list:
*Using OpenMPI 3.1.5 and GCC 7
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7f4f6759f27f in ???
#1 0x7f4f687ababe in ???
#2 0x7f4f687add06 in ???
#3 0x7f4f687db6c0 in ???
#4 0x7f4f687dbddb in ???
#5 0x7f4f6879d6f1 in ???
#6 0x7f4f6871892b in ???
#7 0x7f4f691d0ae1 in MPI_File_write_at_all
at lib/darshan-mpiio.c:536
#8 0x7f4f691bea7f in darshan_log_append_all
at lib/darshan-core.c:1800
#9 0x7f4f691c1907 in darshan_log_write_name_record_hash
at lib/darshan-core.c:1761
#10 0x7f4f691c1907 in darshan_core_shutdown
at lib/darshan-core.c:546
#11 0x7f4f691be402 in MPI_Finalize
at lib/darshan-core-init-finalize.c:82
#12 0x7f4f68b6a798 in ???
#13 0x4023bb in ???
#14 0x401ae6 in ???
#15 0x7f4f6758b3d4 in ???
#16 0x401b16 in ???
#17 0xffffffffffffffff in ???
--------------------------------------------------------------------------
*Using Intel PSXE 2018 with Intel MPI
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
exec.exe 000000000045282E Unknown Unknown Unknown
libpthread-2.17.s 00002B8B5A5FE5D0 Unknown Unknown Unknown
libmpi_lustre.so. 00002B8B659D4FDF ADIOI_LUSTRE_Get_ Unknown Unknown
libmpi_lustre.so. 00002B8B659CFFD9 ADIOI_LUSTRE_Writ Unknown Unknown
libmpi.so.12.0 00002B8B59A4C15C Unknown Unknown Unknown
libmpi.so.12 00002B8B59A4D1D5 PMPI_File_write_a Unknown Unknown
libdarshan.so 00002B8B58F90312 MPI_File_write_at Unknown Unknown
libdarshan.so 00002B8B58F7E63A Unknown Unknown Unknown
libdarshan.so 00002B8B58F815B0 darshan_core_shut Unknown Unknown
libdarshan.so 00002B8B58F7DFF3 MPI_Finalize Unknown Unknown
libmpifort.so.12. 00002B8B592414DA pmpi_finalize__ Unknown Unknown
exec.exe 00000000004490A5 Unknown Unknown Unknown
exec.exe 00000000004032DE Unknown Unknown Unknown
libc-2.17.so 00002B8B5AB2F3D5 __libc_start_main Unknown Unknown
exec.exe 00000000004031E9 Unknown Unknown Unknown
So, two different MPI implementations hit the same problem.
The user can work around by writing to a non-Lustre file system. Having the user export DARSHAN_LOGHINTS=""
also works around the problem, so seems related to hint interaction with Lustre.