{"id":1696,"date":"2025-08-25T15:21:20","date_gmt":"2025-08-25T15:21:20","guid":{"rendered":"https:\/\/wordpress.cels.anl.gov\/darshan\/?p=1696"},"modified":"2025-08-25T15:51:34","modified_gmt":"2025-08-25T15:51:34","slug":"a-case-study-in-debugging-the-openmpi-mpi-io-implementation-with-darshan","status":"publish","type":"post","link":"https:\/\/wordpress.cels.anl.gov\/darshan\/2025\/08\/25\/a-case-study-in-debugging-the-openmpi-mpi-io-implementation-with-darshan\/","title":{"rendered":"A Case Study in Debugging the OpenMPI MPI-IO Implementation with Darshan"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">The Darshan team recently encountered a sequence of bugs that produced corrupt log files when Darshan was linked against OpenMPI. This article provides some broader background, how we used Darshan itself to diagnose the problem, and what we learned from the experience.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Background<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.mcs.anl.gov\/research\/projects\/darshan\/\">Darshan<\/a> is an I\/O characterization tool that produces concise summaries of how applications use a variety of different I\/O interfaces. One of the interfaces that it instruments is MPI-IO (the part of the MPI specification that provides an abstraction for accessing files in parallel). MPI-IO is unique among the interfaces that Darshan instruments, however, because Darshan is itself also a <em>user<\/em> of MPI-IO. When an application terminates, Darshan writes the final compressed version of its instrumentation to a log file using MPI-IO.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MPI-IO is crucial to Darshan&#8217;s efficiency and portability because it has the discretion to reshape I\/O traffic into more efficient access patterns for whatever platform you are running your code on. For example, data from collective writes is often aggregated into intermediate buffers that can satisfy the optimal concurrency, access size, and lock boundaries of an underlying storage system. This optimization is called &#8220;collective buffering&#8221;. The Darshan code is greatly simplified because it does not have to detect the underlying file system parameters or implement this optimization itself; it simply describes the data to be written, and MPI-IO does the rest.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Darshan also uses hints to further optimize how collective buffering is performed in Darshan&#8217;s specific use case. MPI-IO must optimize for the general case, but Darshan log files have specific properties: they are usually very small (far less than a megabyte), and they usually contain small amounts of data contributed by every process. For most file systems, this means that the time to write a Darshan log at scale is dominated by the cost of concurrent <code>open()<\/code> traffic, not by the cost of the actual data transfer. We therefore set hints when the log file is opened to give MPI-IO some clues. Specifically, Darshan sets <code>cb_nodes=4<\/code> to suggest that no more than 4 processes are needed to aggregate data, and <code>romio_no_indep_rw=true<\/code> to indicate that most Darshan ranks will not perform independent I\/O operations. Taken together, an MPI implementation can use this information to activate a <a href=\"https:\/\/wordpress.cels.anl.gov\/romio\/2003\/08\/05\/deferred-open\/\">&#8220;deferred open&#8221;<\/a> mode, in which at most 4 processes open the Darshan log file and write data on behalf of all of the other processes. This greatly reduces the cost of writing small log files onto a parallel file system at scale.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Darshan was originally developed at Argonne National Laboratory using MPI implementations derived from <a href=\"https:\/\/www.mpich.org\/\">MPICH<\/a>, which internally uses an MPI-IO implementation called <a href=\"https:\/\/wordpress.cels.anl.gov\/romio\/\">&#8220;ROMIO.&#8221;<\/a> Darshan also works equally well with MPI implementations derived from <a href=\"https:\/\/www.open-mpi.org\/\">OpenMPI<\/a>, which internally uses an MPI-IO implementation called &#8220;OMPIO&#8221;. Different MPI-IO implementations do not support the same hints, but there is no harm in attempting to set them for any MPI implementation. They are advisory parameters that implementations are not obligated to honor, and they do not impact data correctness.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The problem<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Over time, the Darshan team has occasionally received bug reports (via Slack, mailing list, or GitHub) of corrupted Darshan logs. Darshan does not produce any runtime error messages in these cases, but subsequent analysis tools are unable to parse the log files. The most frustrating part of this problem has been our inability to independently reproduce it. The reports all involved OpenMPI, but that didn&#8217;t really tell us anything (or even indicate that OpenMPI had anything to do with the problem; OpenMPI is broadly used, and most of Darshan log files produced with it look perfectly fine). The applications were different, the scales were different, the OpenMPI versions were different, and our attempts to locally reproduce the problem always failed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The breakthrough<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Wei-keng Liao of Northwestern University (a recent addition to the core Darshan development team after a long history of contributions) recently implemented expanded coverage of the MPI-IO API to account for large integer types in <a href=\"https:\/\/github.com\/darshan-hpc\/darshan\/pull\/1060\">https:\/\/github.com\/darshan-hpc\/darshan\/pull\/1060<\/a>. As part of this work, Wei-keng added GitHub CI tests to exercise the instrumentation and validate corresponding Darshan counters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Although it had nothing to do with the feature being implemented, Wei-keng happened to find a permutation that reliably caused Darshan log file corruption every time we executed the GitHub CI action! The problem (as in previous reports) occurred with OpenMPI, but this time within a CI environment with 4 processes executing on a single virtual node.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We weren&#8217;t quite done yet, because despite having the code and configuration clearly documented, the same problem <em>still<\/em> didn&#8217;t necessarily occur on other (non-GitHub) environments.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Isolating the problem<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Wei-keng looked at the precise offsets and sizes being written by Darshan and re-created these in a standalone MPI program (without the Darshan library) that self-validated the data that it wrote using a predetermined pattern so that we could more easily try different permutations. We used <code>strace<\/code> to observe the system calls that it produced, and were surprised to see that it generated read\/modify\/write operations at the file system level. This meant that OpenMPI was performing &#8220;data sieving&#8221;. Data sieving is another popular MPI-IO optimization; if you need to write multiple discontiguous regions in a file, sometimes it is faster to read in the full span, modify the regions of interest, and then write the full span out rather than issues many small write operations. However, we confirmed that the Darshan log file was densely populated (no gaps) and was written with non-overlapping writes. There was no apparent reason for data sieving to be activated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With this knowledge in hand, we identified that the problem only occurs in OpenMPI 5.0.5 or earlier, and that it was likely resolved by the bug fix to the ompio file locking strategy in <a href=\"https:\/\/github.com\/open-mpi\/ompi\/pull\/12759\">https:\/\/github.com\/open-mpi\/ompi\/pull\/12759<\/a>.  Our CI action happened to use an old enough version of OpenMPI and issued a particular combination of writes operations that triggered the bug.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This <em>still<\/em> left us with two issues: a) Why was OpenMPI performing data sieving in the first place? (regardless of the locking strategy) and b) What could we do in Darshan to mitigate the problem? Darshan must operate correctly even on production systems that have deployed older versions of OpenMPI.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OpenMPI issues<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Using the information we learned above, we further simplified the reproducer, used Darshan to observe the access patterns it generated, and opened an issue on the OpenMPI GitHub repository describing some precise scenarios in which adversarial access patterns are generated (<a href=\"https:\/\/github.com\/open-mpi\/ompi\/issues\/13376\">https:\/\/github.com\/open-mpi\/ompi\/issues\/13376<\/a>). We do not have sufficient expertise in the code base to attempt a fix, but it is clear that under certain configurations, ompio distributes data to aggregators in a way that results in interleaved discontiguous regions at some ranks. This, in conjunction with an automatic data sieving optimization, forces ompio to write data using conflicting write operations that are serialized with advisory locks. While recent versions of OpenMPI have an improved file locking algorithm that avoids data corruption in this scenario, it is still not an efficient access pattern.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While exploring potential mitigations to use in Darshan, we considered that maybe we should simply turn off collective buffering (via another <code>MPI_Info<\/code> hint) in some situations to avoid the problem. While checking the impact of this potential mitigation, we discovered a second problem, reported in <a href=\"https:\/\/github.com\/open-mpi\/ompi\/issues\/13377\">https:\/\/github.com\/open-mpi\/ompi\/issues\/13377<\/a>, in that the logic for the <code>collective_buffering=<\/code> hint appears to be inverted. If that hint is provided, then it must be set to <code>false<\/code> to enable collective buffering. This counterintuitive behavior makes usage of this hint particularly dangerous, because the semantics could be corrected in the future. We had to continue looking for a different mitigation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Darshan mitigation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In addition to alerting the OpenMPI developers of the ompio problems described above, we did eventually identify a Darshan mitigation that allows Darshan to work correctly with any OpenMPI version. If we detect that OpenMPI is present, then we simply set our default hints to <code>cb_nodes=1<\/code> rather than <code>cb_nodes=4<\/code> (the <code>cb_nodes<\/code> hint is honored by both MPICH\/ROMIO and OpenMPI\/OMPIO). When we set it to 1, OpenMPI will still perform collective buffering, but the data is buffered at a single process so that there is no risk of interleaved\/overlapping\/conflicting write operations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This mitigation will be included in the next Darshan minor release (probably Darshan 3.5.0).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Takeaways<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Although the OpenMPI bugs we observed were originally triggered by Darshan&#8217;s internal logging code, we found Darshan itself to be very helpful in understanding the nature of the bugs. In particular, we used Darshan&#8217;s optional <a href=\"https:\/\/darshan.readthedocs.io\/en\/latest\/darshan-runtime\/doc\/darshan-runtime.html#using-the-darshan-extended-tracing-dxt-module\">DXT tracing mode<\/a>, which can be activated in any Darshan build by setting an environment variable, to compare the access patterns expressed at the MPI-IO level to the access patterns enacted at the POSIX level of the HPC I\/O stack.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We also learned not to take MPI-IO behavior for granted. The API and semantics are defined by the MPI specification, but implementors have the discretion to implement I\/O transformations how they see fit. In our case, the Darshan log write access pattern happened to produce an adversarial access pattern in OMPIO that requires special handling to mitigate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, continuous integration proved itself to be an invaluable tool for supporting complex system software. We would not have been able to isolate this problem (much less had any clue what the underlying causes were) without a pull request that introduced a thorough CI test for an entirely unrelated capability. Darshan is a mature software project that has gained CI capability in a piecemeal fashion; this case study encourages us to pursue a more comprehensive CI strategy in the future.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Darshan team recently encountered a sequence of bugs that produced corrupt log files when Darshan was linked against OpenMPI. This article provides some broader background, how we used Darshan itself to diagnose the problem, and what we learned from the experience. Background Darshan is an I\/O characterization tool that produces concise summaries of how &#8230; <a title=\"A Case Study in Debugging the OpenMPI MPI-IO Implementation with Darshan\" class=\"read-more\" href=\"https:\/\/wordpress.cels.anl.gov\/darshan\/2025\/08\/25\/a-case-study-in-debugging-the-openmpi-mpi-io-implementation-with-darshan\/\" aria-label=\"Read more about A Case Study in Debugging the OpenMPI MPI-IO Implementation with Darshan\">Read more<\/a><\/p>\n","protected":false},"author":444,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1696","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"acf":[],"_links":{"self":[{"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/posts\/1696","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/users\/444"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/comments?post=1696"}],"version-history":[{"count":7,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/posts\/1696\/revisions"}],"predecessor-version":[{"id":1707,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/posts\/1696\/revisions\/1707"}],"wp:attachment":[{"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/media?parent=1696"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/categories?post=1696"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/darshan\/wp-json\/wp\/v2\/tags?post=1696"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}