A Case Study in Debugging the OpenMPI MPI-IO Implementation with Darshan

The Darshan team recently encountered a sequence of bugs that produced corrupt log files when Darshan was linked against OpenMPI. This article provides some broader background, how we used Darshan itself to diagnose the problem, and what we learned from the experience.

Background

Darshan is an I/O characterization tool that produces concise summaries of how applications use a variety of different I/O interfaces. One of the interfaces that it instruments is MPI-IO (the part of the MPI specification that provides an abstraction for accessing files in parallel). MPI-IO is unique among the interfaces that Darshan instruments, however, because Darshan is itself also a user of MPI-IO. When an application terminates, Darshan writes the final compressed version of its instrumentation to a log file using MPI-IO.

MPI-IO is crucial to Darshan’s efficiency and portability because it has the discretion to reshape I/O traffic into more efficient access patterns for whatever platform you are running your code on. For example, data from collective writes is often aggregated into intermediate buffers that can satisfy the optimal concurrency, access size, and lock boundaries of an underlying storage system. This optimization is called “collective buffering”. The Darshan code is greatly simplified because it does not have to detect the underlying file system parameters or implement this optimization itself; it simply describes the data to be written, and MPI-IO does the rest.

Darshan also uses hints to further optimize how collective buffering is performed in Darshan’s specific use case. MPI-IO must optimize for the general case, but Darshan log files have specific properties: they are usually very small (far less than a megabyte), and they usually contain small amounts of data contributed by every process. For most file systems, this means that the time to write a Darshan log at scale is dominated by the cost of concurrent open() traffic, not by the cost of the actual data transfer. We therefore set hints when the log file is opened to give MPI-IO some clues. Specifically, Darshan sets cb_nodes=4 to suggest that no more than 4 processes are needed to aggregate data, and romio_no_indep_rw=true to indicate that most Darshan ranks will not perform independent I/O operations. Taken together, an MPI implementation can use this information to activate a “deferred open” mode, in which at most 4 processes open the Darshan log file and write data on behalf of all of the other processes. This greatly reduces the cost of writing small log files onto a parallel file system at scale.

Darshan was originally developed at Argonne National Laboratory using MPI implementations derived from MPICH, which internally uses an MPI-IO implementation called “ROMIO.” Darshan also works equally well with MPI implementations derived from OpenMPI, which internally uses an MPI-IO implementation called “OMPIO”. Different MPI-IO implementations do not support the same hints, but there is no harm in attempting to set them for any MPI implementation. They are advisory parameters that implementations are not obligated to honor, and they do not impact data correctness.

The problem

Over time, the Darshan team has occasionally received bug reports (via Slack, mailing list, or GitHub) of corrupted Darshan logs. Darshan does not produce any runtime error messages in these cases, but subsequent analysis tools are unable to parse the log files. The most frustrating part of this problem has been our inability to independently reproduce it. The reports all involved OpenMPI, but that didn’t really tell us anything (or even indicate that OpenMPI had anything to do with the problem; OpenMPI is broadly used, and most of Darshan log files produced with it look perfectly fine). The applications were different, the scales were different, the OpenMPI versions were different, and our attempts to locally reproduce the problem always failed.

The breakthrough

Wei-keng Liao of Northwestern University (a recent addition to the core Darshan development team after a long history of contributions) recently implemented expanded coverage of the MPI-IO API to account for large integer types in https://github.com/darshan-hpc/darshan/pull/1060. As part of this work, Wei-keng added GitHub CI tests to exercise the instrumentation and validate corresponding Darshan counters.

Although it had nothing to do with the feature being implemented, Wei-keng happened to find a permutation that reliably caused Darshan log file corruption every time we executed the GitHub CI action! The problem (as in previous reports) occurred with OpenMPI, but this time within a CI environment with 4 processes executing on a single virtual node.

We weren’t quite done yet, because despite having the code and configuration clearly documented, the same problem still didn’t necessarily occur on other (non-GitHub) environments.

Isolating the problem

Wei-keng looked at the precise offsets and sizes being written by Darshan and re-created these in a standalone MPI program (without the Darshan library) that self-validated the data that it wrote using a predetermined pattern so that we could more easily try different permutations. We used strace to observe the system calls that it produced, and were surprised to see that it generated read/modify/write operations at the file system level. This meant that OpenMPI was performing “data sieving”. Data sieving is another popular MPI-IO optimization; if you need to write multiple discontiguous regions in a file, sometimes it is faster to read in the full span, modify the regions of interest, and then write the full span out rather than issues many small write operations. However, we confirmed that the Darshan log file was densely populated (no gaps) and was written with non-overlapping writes. There was no apparent reason for data sieving to be activated.

With this knowledge in hand, we identified that the problem only occurs in OpenMPI 5.0.5 or earlier, and that it was likely resolved by the bug fix to the ompio file locking strategy in https://github.com/open-mpi/ompi/pull/12759. Our CI action happened to use an old enough version of OpenMPI and issued a particular combination of writes operations that triggered the bug.

This still left us with two issues: a) Why was OpenMPI performing data sieving in the first place? (regardless of the locking strategy) and b) What could we do in Darshan to mitigate the problem? Darshan must operate correctly even on production systems that have deployed older versions of OpenMPI.

OpenMPI issues

Using the information we learned above, we further simplified the reproducer, used Darshan to observe the access patterns it generated, and opened an issue on the OpenMPI GitHub repository describing some precise scenarios in which adversarial access patterns are generated (https://github.com/open-mpi/ompi/issues/13376). We do not have sufficient expertise in the code base to attempt a fix, but it is clear that under certain configurations, ompio distributes data to aggregators in a way that results in interleaved discontiguous regions at some ranks. This, in conjunction with an automatic data sieving optimization, forces ompio to write data using conflicting write operations that are serialized with advisory locks. While recent versions of OpenMPI have an improved file locking algorithm that avoids data corruption in this scenario, it is still not an efficient access pattern.

While exploring potential mitigations to use in Darshan, we considered that maybe we should simply turn off collective buffering (via another MPI_Info hint) in some situations to avoid the problem. While checking the impact of this potential mitigation, we discovered a second problem, reported in https://github.com/open-mpi/ompi/issues/13377, in that the logic for the collective_buffering= hint appears to be inverted. If that hint is provided, then it must be set to false to enable collective buffering. This counterintuitive behavior makes usage of this hint particularly dangerous, because the semantics could be corrected in the future. We had to continue looking for a different mitigation.

Darshan mitigation

In addition to alerting the OpenMPI developers of the ompio problems described above, we did eventually identify a Darshan mitigation that allows Darshan to work correctly with any OpenMPI version. If we detect that OpenMPI is present, then we simply set our default hints to cb_nodes=1 rather than cb_nodes=4 (the cb_nodes hint is honored by both MPICH/ROMIO and OpenMPI/OMPIO). When we set it to 1, OpenMPI will still perform collective buffering, but the data is buffered at a single process so that there is no risk of interleaved/overlapping/conflicting write operations.

This mitigation will be included in the next Darshan minor release (probably Darshan 3.5.0).

Takeaways

Although the OpenMPI bugs we observed were originally triggered by Darshan’s internal logging code, we found Darshan itself to be very helpful in understanding the nature of the bugs. In particular, we used Darshan’s optional DXT tracing mode, which can be activated in any Darshan build by setting an environment variable, to compare the access patterns expressed at the MPI-IO level to the access patterns enacted at the POSIX level of the HPC I/O stack.

We also learned not to take MPI-IO behavior for granted. The API and semantics are defined by the MPI specification, but implementors have the discretion to implement I/O transformations how they see fit. In our case, the Darshan log write access pattern happened to produce an adversarial access pattern in OMPIO that requires special handling to mitigate.

Finally, continuous integration proved itself to be an invaluable tool for supporting complex system software. We would not have been able to isolate this problem (much less had any clue what the underlying causes were) without a pull request that introduced a thorough CI test for an entirely unrelated capability. Darshan is a mature software project that has gained CI capability in a piecemeal fashion; this case study encourages us to pursue a more comprehensive CI strategy in the future.

Continuously updated Darshan log repository now available from the ALCF Polaris system

The Darshan team is pleased to announce the public availability of the ALCF Polaris Darshan Log Repository. This is an anonymized, continuously updated collection of all production logs captured on the 560 node Polaris system at the ALCF. We hope that continuous publication of this data helps the computer science community to better understand current production workloads.

As of this writing (June 5, 2025) the repository contains over 1.2 million log files and is growing at an average rate of roughly 3,000 logs per day, though coverage rates and job workloads vary considerably from day to day.

See the Zenodo report for more information about how to download, analyze, and acknowledge use of the data in publications.

See the CUG 2025 presentation (corresponding paper pending proceedings publication) for more information about how the repository was created, examples of analyses that can be performed on it, and links to tools that can be used to reproduce those examples.

Darshan 3.2.1 bugfix release available

Due to a reported bug in last week’s 3.2.0 release of Darshan, we have decided to quickly release Darshan 3.2.1 for our users. It is available for download here.

This bugfix is somewhat critical, particularly in production environments, as it is can lead to corrupted Darshan log file data and, potentially, application crashes (though we have not triggered any crashes in our testing). The issue was originally detected by noticing bogus values in the COMMON_ACCESS counters reported by the POSIX, MPIIO, and H5 modules.

In any case, we highly recommend any 3.2.0 users upgrade to this version to avoid any potential for crashes or corrupted Darshan log file data.

Please report any additional questions, issues, or concerns using the Darshan-users mailing list, or by opening an issue on the Darshan GitLab page.

Darshan version 3.2.0 is now officially available

Darshan 3.2.0 is now available for download here.

This release contains a number of new features, bug fixes, and other changes to Darshan. Some of the more notable changes that may be of interest to users:

  • Added detailed instrumentation of HDF5 file (H5F) and dataset (H5D) interfaces.
    • Must be explicitly enabled by passing “–enable-hdf5-mod=/path/to/hdf5/install” when configuring Darshan.
    • Due to ABI incompatibility from HDF5 version 1.8.x -> 1.10.x, special care must be taken to ensure users do not link applications with HDF5 versions that are incompatible with the version the Darshan library was built with (i.e., both HDF5 library versions must be either >=1.10 or <1.10). Using two incompatible HDF5 versions will lead to either link or runtime failures.
    • Support only intended for HDF5 versions 1.8.0+.
  • Added new feature allowing for instrumentation of non-MPI applications.
    • Darshan no longer strictly requires that instrumented applications use MPI, extending coverage to a breadth of new contexts.
    • Note that this feature is only functional in dynamic linking use cases.
    • Thanks to Glenn Lockwood (NERSC) for his help in implementing/testing this feature.
  • Added MPI-IO offset information to Darshan’s DXT tracing mechanism.
  • Updated Darshan compiler wrappers and Cray software modules to transparently and uniformly support dynamic and static linking cases. These methods previously only supported static linking uses cases.
  • Re-implemented Darshan’s PMPI/MPI wrappers to help avoid deadlock with other monitoring tools that rely on PMPI.
  • Added new “–log-path” option to darshan-config utility to allow users to more easily query the directory Darshan logs are stored in.

Please review darshan-runtime and darshan-util documentation for more details on the new HDF5 instrumentation module and the experimental non-MPI instrumentation mechanism. Additionally, consult the ChangeLog in the top-level of the source for a full list of changes associated with this release.

Note that we are currently aware of and looking into a couple of issues related to Lustre file systems that have been reported by Darshan users:

  • Crashes in Darshan’s Lustre module in newer Lustre versions (2.11.x in one reported case). Typically results in additional errors stating: “using old ioctl(LL_IOC_LOV_GETSTRIPE)”.
    • If you experience this problem with Darshan, a temporary workaround would be to just disable the Lustre module — this can only be done at configure time by passing “–disable-lustre-mod”.
  • Floating point exceptions or other warnings related to dividing by zero when writing Darshan log to a Lustre file system (at Darshan shutdown time).
    • We are still working out what combinations of MPI and Lustre libraries exhibit this problem, but a simple workaround in the time being is to run the command “export DARSHAN_LOGHINTS=” before running your application.

We hope to resolve these bugs quickly and intend to release an updated version of Darshan once they are.

Please report any additional questions, issues, or concerns using the Darshan-users mailing list, or by opening an issue on the Darshan GitLab page.

New experimental version of Darshan available for instrumenting non-MPI applications

An experimental pre-release of Darshan is now available that enables instrumentation of non-MPI workloads. It can be downloaded here. It is NOT recommended to use this version in production until we have had more time for users to test it.

See the darshan-runtime documentation (located in darshan-runtime/docs from the top-level Darshan repo) for more information on how to build Darshan without MPI support and also how to enable non-MPI instrumentation at application runtime.

Note that this instrumentation method only works on dynamically-linked executables — Darshan still does not support instrumentation of statically-linked non-MPI executables.

We encourage users that are interested in characterizing I/O in non-MPI contexts to try out this new functionality and let us know about any issues or questions you might have! Depending on user experience, we will try to get a release of this software suitable for production deployment soon.

Darshan at SC19 recap

In case you missed any of it, here’s a list of Darshan-related activities from SC that maybe of interest to the community:

Darshan version 3.1.8 now available

Darshan 3.1.8 is now available for download here.

This release introduces a new trace triggering mechanism that allows users to specify triggers that dictate which files are traced using Darshan’s tracing module, DXT. Users just need to provide Darshan a configuration file describing the triggers and Darshan will decide at runtime which files to store trace data for. Types of triggers include file- and rank-based triggers (based on regex patterns), as well as file access characteristics triggers (to trace based on frequency of small or unaligned I/O accesses). Please refer to darshan-runtime documentation on the DXT module for more details.

Note that full tracing is disabled by default in Darshan and this release does not change that — this is just a mechanism to allow DXT users more control over tracing.

Please report any questions, issues, or concerns using the Darshan-users mailing list, or by opening an issue on the Darshan GitLab page.

Darshan 3.1.7 release is now available

Darshan version 3.1.7 is now available for release HERE! This version addresses a few bug fixes in the prior Darshan release and also contains a couple of new features:

  • Bug fix in handling of DXT module data in the darshan-convert utility
    • Reported by Mahzad Khoshlessan
  • Bug fix in darshan-parser backwards compatibility: Darshan logs generated by Darshan versions prior to 3.1.0 may have included STDIO counters that were not properly up-converted
    • Reported by Teng Wang
  • Bug fix to MiB reported in I/O performance estimate of darshan-job-summary when both POSIX and STDIO data present
    • Reported/fixed by Glenn Lockwood
  • Added Darshan wrapper for ‘__open_2()’ call, needed for properly instrumenting open operations with some versions of gcc/glibc
    • Reported by Cormac Garvey
  • Added an instrumentation module for the MDHIM key/val storage system
  • Added support for properly handling ‘rename()’, ‘dup()’, ‘fileno()’, and ‘fdopen()’ operations in Darshan

Please report any questions, issues, or concerns using the Darshan-users mailing list, or by opening an issue on the Darshan GitLab page.