Darshan is an application-level I/O characterization tool that has been traditionally used in the HPC community for understanding file access characteristics of MPI applications. However, in recent years Darshan has been redesigned to relax it’s dependence on MPI so that it can support instrumentation of other programming models and runtime environments that are gaining traction in HPC. In this article, we will cover some of these new improvements to Darshan and cover best practices for general instrumentation of applications that don’t use MPI, ranging from serial applications to Python multiprocessing frameworks (e.g., PyTorch, Dask, etc.).
- Darshan enhancements for non-MPI usage
- Best practices for non-MPI instrumentation in Darshan
- Example Darshan runtime library configuration
- Future work
Darshan enhancements for non-MPI usage
Support for non-MPI instrumentation in Darshan began starting with our 3.2.0 release (thanks in large part to contributions from Glenn Lockwood, Microsoft). These changes revolved around adopting new mechanisms for bootstrapping the Darshan library when a process launches and shutting down the Darshan library when a process terminates. Traditionally, this was handled by intercepting MPI_Init
/MPI_Finalize
routines that MPI applications conveniently call at application startup/shutdown. To support more general mechanisms for this, Darshan adopted the usage of GCC constructor/destructor attributes1 for its startup/shutdown routines.
Beyond this initial redesign, additional changes have recently been made to the Darshan library based on our experiences in instrumenting various non-MPI applications (e.g., workflow systems, Python multiprocessing packages). These changes are outlined below.
- Processes that call
fork()
- Problem: Child processes from
fork()
calls inherit their parent’s memory, including Darshan library state. This can lead to duplicate accounting of the parent’s I/O statistics in the child process’s log. - Solution: Use
pthread_atfork()
handlers to get hooks into child process initialization, allowing Darshan library state to be reinitialized. Initial support provided in Darshan’s 3.3.1 release.
- Problem: Child processes from
- Processes that terminate abruptly using
_exit()
calls- Problem: Some multiprocessing frameworks use fork-join models that call “immediate” exit routines (i.e.,
_exit()
). For example, we have observed this behavior in some configurations of Python’smultiprocessing
package, which is commonly used by PyTorch and other frameworks. This immediate exit routine is generally used to prevent child processes from interfering with resources that may still be used by the parent process (e.g., by flushing buffers, callingatexit
handlers, etc.). But, immediate exit also bypasses the Darshan library’s destructor routine which finalizes Darshan and writes out its log file. - Solution: Darshan has been updated to intercept calls to
_exit()
in the same way it would traditionally interceptMPI_Finalize()
for MPI applications. This change enables Darshan to cleanly shutdown before the process starts its immediate termination. Initial support provided in Darshan’s 3.4.5 release.
- Problem: Some multiprocessing frameworks use fork-join models that call “immediate” exit routines (i.e.,
- Processes that terminate abruptly via kill signals
- Problem: Some multiprocessing frameworks use fork-join models that simply terminate child processes via kill signals. We have also observed this behavior in some configurations of Python’s
multiprocessing
package. Termination via kill signals (i.e.,SIGTERM
) similarly bypasses Darshan’s typical shutdown procedure. Unfortunately, the only mechanism to interpose Darshan’s shutdown before this signal is using a signal handler, but the Darshan shutdown procedure is not async-signal-safe (i.e., it cannot be safely called in a signal handler). - Solution: Darshan actually has a longstanding optional feature to store its log data in memory-mapped files as the application executes, instead of storing this data on the heap and writing it out to a log file at process termination time. This feature was originally envisioned to support cases where MPI applications don’t call
MPI_Finalize()
(e.g., because they hit their wall-time limit on a batch scheduled system), but it actually helps preserve Darshan data in cases like this where processes are abruptly terminated. Initial support provided in Darshan’s 3.1.0 release.
- Problem: Some multiprocessing frameworks use fork-join models that simply terminate child processes via kill signals. We have also observed this behavior in some configurations of Python’s
Best practices for non-MPI instrumentation in Darshan
- To take advantage of all of the extensions detailed above, use a Darshan release version >= 3.4.5.
- When building darshan-runtime, enable the mmap logs feature to help protect against processes that abruptly terminate via kill signals.
- For Spack builds, use the
+mmap_logs
variant. - For darshan-runtime source builds, use the
--enable-mmap-logs
configure option.
- For Spack builds, use the
- To interpose the Darshan library, you have two options2:
- Set
LD_PRELOAD=/path/to/darshan/lib/libdarshan.so
to ensure Darshan instrumentation wrappers can intercept application I/O routines.- This option is necessary for Python applications, as there is no way to directly link the Darshan library into the Python binary.
- Directly link the Darshan library on the command line using
-ldarshan
when building your application.- Darshan should precede all other libraries to ensure it’s first in link ordering, otherwise it may not intercept application I/O calls.
- Set
- Enable Darshan’s non-MPI mode by setting
DARSHAN_ENABLE_NONMPI=1
in your environment.- non-MPI mode requires this variable to be explicitly set so Darshan doesn’t inadvertently generate log files for extraneous commands (e.g.,
ls
,git
, etc.). - Instrumenting specific applications can then be accomplished by simply running a command like:
DARSHAN_ENABLE_NONMPI=1 <binary> <cmd_args>
- non-MPI mode requires this variable to be explicitly set so Darshan doesn’t inadvertently generate log files for extraneous commands (e.g.,
- If necessary, consider using Darshan library configuration files to increase Darshan’s default memory/record limits, to enable/disable certain Darshan modules, or to limit Darshan instrumentation to files matching some pattern (e.g., a mount point prefix, a file extension suffix).
- This is particularly helpful for Python applications, which tend to access tons of shared libraries (.so), Python compiled code (.pyc), etc., which can quickly exhaust Darshan’s record memory.
- If using traditional Darshan tools like
darshan-parser
or the PyDarshan job summary tool, an error message is reported if Darshan ran out of memory, in which case this configuration file is needed to help ensure Darshan allocates and uses a sufficient amount of memory. - See the next section for example usage of config files.
- After your Darshan instrumented application terminates, check the
/tmp
directory (the default output location for Darshan mmap log files) for any Darshan logs generated by processes that terminate abruptly.- We recommend copying these log files somewhere permanent and compressing them in Darshan’s standard compressed format to save space using the
darshan-convert
utility, e.g.:darshan-convert /tmp/logfile.darshan /path/to/darshan/log/dir/logfile.darshan
- Processes that terminate normally do not output logs to
/tmp
and instead output the logs in standard compressed format in your standard Darshan log output directory.
- We recommend copying these log files somewhere permanent and compressing them in Darshan’s standard compressed format to save space using the
Example Darshan runtime library configuration
Darshan runtime library configuration options can be expressed using a configuration file that can be passed to Darshan at runtime by setting the following environment variable: DARSHAN_CONFIG_PATH=/path/to/darshan.conf
An example configuration file is given below that demonstrates the types of settings you can control within the Darshan runtime library. Not all settings may be needed, depending on your workload and your use case, and often times some experimentation is needed to determine appropriate settings. This is a necessary trade-off as Darshan is designed for low-overhead, comprehensive instrumentation of applications — increasing default memory limits or restricting scope of instrumentation are not our default operational modes.
# allocate 4096 file records for POSIX and MPI-IO modules
# (Darshan only allocates 1024 per-module by default)
# NOTE: MODMEM setting may need to be bumped independent of this setting,
# as it does not force Darshan to use a larger instrumentation buffer
MAX_RECORDS 4096 POSIX,MPI-IO
# in this case, we want all modules to ignore record names
# with a ".pyc" or a ".so" file extension
# NOTE: multiple regex patterns can be provided at once, separated by commas
# NOTE: the '*' specifier can be used to apply settings for all modules
NAME_EXCLUDE \.pyc$,\.so$ *
# bump up Darshan's default record memory usage to 8 MiB
MODMEM 8
# bump up Darshan's default name record memory usage to 2 MiB
# NOTE: Darshan uses separate memory for storing record names (i..e, file names)
# that can also be exhausted, so this must be bumped independently of
# MODMEM in the case where lots of file name data is captured
NAMEMEM 2
# default modules not of interest can be disabled like this
MOD_DISABLE STDIO
# non-default modules like DXT tracing modules can be enabled like this
MOD_ENABLE DXT_POSIX,DXT_MPIIO
More extensive details on Darshan configuration file format is provided HERE.
Future work
To help with the analysis of Darshan log data from multiprocessing frameworks that generate numerous Darshan logs, we are working to extend Darshan analysis tools to support aggregation of this data into single summary outputs. This will enable more comprehensive analysis of these frameworks, similar to how Darshan provides summaries of all processes in MPI applications in a single, concise summary. We expect our next release (3.4.7) to have some capabilities for analyzing data from multiple logs. Stay tuned for updates on this ongoing work.
- https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html ↩︎
- Darshan’s non-MPI mode only works for dynamically-linked executables and requires a compiler that supports GCC constructor/destructor attributes (most do). ↩︎