python's multiprocessing produces inconsistent logs
Python applications which use
multiprocessing result in Darshan logs with really strange outputs; I think this is a result of multiprocessing using clone(2) with
child_stack=0, forcing copy-on-write semantics for the process stack.
import os import multiprocessing def f(x): fname = 'output.%d' % x with open(fname, 'w') as fp: fp.write("my pid is %d" % os.getpid()) return fname if __name__ == '__main__': print("I am the parent process and my pid is %d" % os.getpid()) with multiprocessing.Pool(5) as p: fnames = p.map(f, range(10)) for fname in fnames: print("Created %s with contents: %s" % (fname, open(fname, 'r').read()))
$ LD_PRELOAD="/global/homes/g/glock/apps.cori-haswell/darshan.forksafe/lib/libdarshan.so" DARSHAN_ENABLE_NONMPI=1 python test_multiprocessing.py I am the parent process and my pid is 9617 Created output.0 with contents: my pid is 9618 Created output.1 with contents: my pid is 9619 Created output.2 with contents: my pid is 9620 Created output.3 with contents: my pid is 9621 Created output.4 with contents: my pid is 9622 Created output.5 with contents: my pid is 9618 Created output.6 with contents: my pid is 9619 Created output.7 with contents: my pid is 9620 Created output.8 with contents: my pid is 9621 Created output.9 with contents: my pid is 9622
Five child processes are created, each writing to two files. The parent process then serially opens and reads the contents of those files. However the Darshan log seems pretty wild:
$ darshan-parser glock_python_id*.darshan | egrep -o 'POSIX_(READS|WRITES).*output..' POSIX_READS 2 /global/u2/g/glock/output.0 POSIX_WRITES 5 /global/u2/g/glock/output.0 POSIX_READS 2 /global/u2/g/glock/output.1 POSIX_WRITES 5 /global/u2/g/glock/output.1 POSIX_READS 2 /global/u2/g/glock/output.2 POSIX_WRITES 0 /global/u2/g/glock/output.2 POSIX_READS 2 /global/u2/g/glock/output.3 POSIX_WRITES 0 /global/u2/g/glock/output.3 POSIX_READS 2 /global/u2/g/glock/output.4 POSIX_WRITES 0 /global/u2/g/glock/output.4 POSIX_READS 2 /global/u2/g/glock/output.5 POSIX_WRITES 0 /global/u2/g/glock/output.5 POSIX_READS 2 /global/u2/g/glock/output.6 POSIX_WRITES 0 /global/u2/g/glock/output.6 POSIX_READS 2 /global/u2/g/glock/output.7 POSIX_WRITES 0 /global/u2/g/glock/output.7 POSIX_READS 2 /global/u2/g/glock/output.8 POSIX_WRITES 0 /global/u2/g/glock/output.8 POSIX_READS 2 /global/u2/g/glock/output.9 POSIX_WRITES 0 /global/u2/g/glock/output.9
All the writes from the child processes are missing except for two files whose writes appear over-represented.
It looks like the constructor/destructor hooks we use function with clone(2), but the children are somehow able to affect the memory space of the parent (or something). I don't have a good enough grasp of the semantics of clone to devise a workaround.