Archive for the ‘Uncategorized’ Category

Hintdump: a small utility for poking at MPI implementations.

April 7th, 2023
Comments Off on Hintdump: a small utility for poking at MPI implementations.

For years I’ve had a little tool in my home directory that does an MPI_File_open and reads out all the hints associated with that file. It was so simple and dumb I never put it on github. Now a new crop of machines has me realizing it might be useful to more people. So here you go:

I can use this simple little tool to do a few things:

  • Verify ROMIO is using the ADIO driver I think it’s using (the romio_filesystem_type hint)
  • See what file systems this MPI implementation might support (try out different prefixes like ufs: or lustre:)
  • Verify stripe counts — I found one MPI implementation that would not honor stripe counts set by info objects, and would only honor stripe counts set with that file system’s utilities.

I hope you find this useful.


Quobyte file system

October 6th, 2020
Comments Off on Quobyte file system

I don’t have a lot of first-hand experience with the Quobyte file system but the developers contributed a file system driver for ROMIO, making them my new best friends.

I think a lot about how to converge the worlds of HPC storage and data center storage. Looks like this new ROMIO driver might be a small step in that convergence.

Note to anyone out there considering a ROMIO driver — I’ll accept just about any contribution . ROMIO drivers are self-contained and opt-in. Not much risk that a new ROMIO driver will break MPICH. All I need is a plausible case that the contributor will stick around for bugs and updates, and at least one user out there somewhere.

The quobyte file system will be in the next release of MPICH ( MPICH-3.4). Configure ROMIO with the --with-file-system=quobytefs+... flag (including other file system drivers you might want). You can explicitly request the quobyte driver at runtime (assuming ROMIO was configured to support it) with the ‘quobyte:’ prefix to your file name.


Useful Environment Variables

February 20th, 2019
Comments Off on Useful Environment Variables

I was surprised to find I hadn’t written anything about the environment variables one can set in ROMIO.
I don’t think I have a well-thought-out theory of when to use an environment variable vs when to use an MPI_Info key.
In this post, I’ll talk about ROMIO environment variables in general. The GPFS, PVFS2, and XFS drivers also have environment variables one can set, but those variables need to be set only in very specific instances.

  • ROMIO_FSTYPE_FORCE: ROMIO will pick a file system driver by calling stat(2) and looking at the fs type field. You can also prefix the path with a driver name (e.g. pvfs2:/path/to/file or ufs:/home/me/stuff), which bypasses the stat check. This environment variable provides a third way. Set the value to the prefix (e.g. export ROMIO_FSTYPE_FORCE="ufs:") and ROMIO will treat every file as if it resides on that file system. You are likely to get some strange behavior if you for instance try to make lustre-specific ioctl() calls on a plain unix file. I added the facility a while back in cases where it might be hard to modify the path or if I wanted to rule out a bug in a ROMIO driver
  • ROMIO_PRINT_HINTS: dump out the hints ROMIO is going to use on this file. Sometimes file systems will override user hints or otherwise communicate something back to the user through hints. Helpful to confirm what you think is going on (as in “hey, I requested this other optimization. Why is that optimization happening?”)
  • ROMIO_HINTS: used to select a custom “system hints” file. See system hints


New ROMIO features in MPICH-3.3

December 3rd, 2018
Comments Off on New ROMIO features in MPICH-3.3

It has been a while since the last official MPICH release, but shortly before US Thanksgiving, the MPICH team released MPICH-3.3.
ROMIO’s most noteworthy changes include

  • wholesale reformatting of all the code to the same coding styles. Sorry about your fork or branch.
  • use the MPL utility library already used in MPICH. Eliminates a lot of duplicated code
  • We started analyzing the code with Coverity. It found “a few” things for us to fix. Valgrind found a few things too.
  • Internal datatype representation (the “flattened” representation) now stored as an attribute on the datatype and not in an internal global linked list.
  • continued work to make ROMIO 64 bit clean
  • Deleted a bunch of unused file system drivers
  • Added DDN’s ‘IME’ driver.
  • Warn if a file view voilates the MPI-IO rule “monotonically non-decreasing file offsets”
  • added support for Lustre lockahead optimization


ROMIO at SC 2018

November 6th, 2018
Comments Off on ROMIO at SC 2018

If you would like to learn more about I/O in High Performance Computing, come check out our SC 2018 tutorial Parallel I/O in Practice. We will cover the hardware and software that makes up the software stack on large parallel computers. MPI-IO takes up a big chunk of time, as do the I/O libraries which typically sit on top of MPI-IO.
If you are interested in ROMIO, you hopefully are familiar with Darshan for collecting statistics and otherwise characterizing you your I/O patterns. These other Darshan-related events will likely be of interest to you:

And HDF5, which frequently sits atop ROMIO, will be having a Birds of a Feather session Wedesday.


Collective I/O to overlapping regions

September 6th, 2018
Comments Off on Collective I/O to overlapping regions

It is an error for multiple MPI processes to write to the same or to overlapping regions of a file. ROMIO will let you get away with this but if your processes are writing different data, I can’t tell you what will end up in the file in the end.
What about reads, though?
ROMIO’s two phase collective buffering algorithm handles overlapping read requests the way you would hope: I/O aggregators read from the file and send the data to the right MPI process. N processes reading a config file, for example, will result in one read followed by a network exchange.
As an aside, ROMIO’s two-phase algorithm is general and so not as good as a “read and broadcast” — if you the application/library writer know you are going to have every process read the file, here is one spot (maybe the only spot) where I’d encourage you to (independently) read from one processor and broadcast to everyone else.
I bet you are excited to go try this out on some code. Maybe you will have every process read the same elements out of an array. Did you get the performance you expected? Probably not. ROMIO tries to be clever. If the requests from processes are interleaved, ROMIO will carry on with two phase collective I/O. If the requests are not interleaved, then ROMIO will fall back to independent I/O on the assumption that the expense of the two-phase algorithm will not be worth it.
You can see the check here:
In 2018, two-phase is almost always a good idea — even if the requests are well-formed, collective buffering will map request sizes to underlying file system quirks, reduce the overall client count thanks to I/O aggregators, and probably place those aggregators strategically.
You can force ROMIO to always use collective buffering by setting the hint "romio_cb_read" to “enable” . On Blue Gene systems, that is the default setting already. On other platforms, the default is “automatic”, which triggers that check we mentioned.


New driver for DDN's "Infinite Memory Engine" device

November 2nd, 2017
Comments Off on New driver for DDN's "Infinite Memory Engine" device

Data Direct has a storage product called “Infinte Memory Engine“.   You can access this accelerated storage through POSIX but there is also a library-level native interface.
DDN recently contributed a ROMIO driver to use the IME “native” interfaces, and I have merged this into MPICH master for an upcoming release.  Thanks!
For most people, this new feature will only be exciting if you have a DDN storage device with IME.  If you do have such a piece of hardware, you can ask your DDN rep where to get the RPMs for the IME-native library.
I wrote a mocked version for anyone who wants to compile-time test this driver.  All it does is directly invoke the POSIX versions.  You can get ime-mockup from my gitlab repository 


Lustre tuning

March 7th, 2017
Comments Off on Lustre tuning

Getting the best performance out of Lustre can be a bit of a challenge. Here are some things to check if you tried out ROMIO on a Lustre file system and did not see the performance you were expecting.
The zeroth step in tuning Lustre is “consult your site-specific documentation”. Your admins will have information about how many Lustre servers are deployed, how much performance you should expect, and any site-specific utilities they have provided to make your life easier. Here are some of the more popular sites:

First, are you using the Lustre file system driver? Nowadays, you would have to go out of your way not to. One can read the romio_filesystem_type hint to confirm.
Next, what is the stripe count? Lustre typically defaults to a stripe count of 1, which means all reads and writes will go to just one server (OST in Lustre parlance). Most systems have tens of OSTs, so the default stripe size is really going to kill performance!
The ‘lfs’ utility can be used to get and set lustre file information.

$ lfs getstripe /path/to/directory
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  2
    obdidx       objid       objid       group
         2        14114525       0xd75edd      0x280000400

This directory has a stripe_count of 1. That means any files created in this directory will also have a stripe count of one. This directory would be good for hosting small config files, but large HPC input decks or checkpoint files will not see good performance.
When reading a file, there’s no way to adjust the stripe count. When the file is created, the striping is locked in place. You would have to create a directory with a large stripe count and copy the files into this new directory.

$ lfs setstripe -c 60  /my/new/directory

Now any new file created in “my/new/directory” will have stripe count of 60
If you care creating a new file, you can set the stripe size in ROMIO with the “striping_factor” hint:

    MPI_Info_set(info, "striping_factor", "32");
    MPI_File_open(MPI_COMM_WORLD, "foo.chkpt", MPI_MODE_CREATE, info, &fh);


HDF5-1.10.0 and more scalable metadata

June 9th, 2016
Comments Off on HDF5-1.10.0 and more scalable metadata

While not exactly ROMIO, the new HDF5 release comes with a nice optimization that benefits all MPI-IO implementations.
In order to know where the objects, datastets, and other information in an HDF5 file is located on disk, a process needs to read the HDF5 metadata. This metadata is scattered across the file, so to find out where everything is located a process will have to issue many tiny read requests. For a long time, each HDF5 process needed to issue these reads. There was no way for one process to examine the file and then tell the other processes about the file layout. When HDF5 programs were tens or hundreds of MPI processes, this read overhead was not so bad. As process counts get larger and larger in scale, as on for example Blue Gene, these reads started taking up a huge amount of time.
The HDF Group has implemented collective metadata in HDF5-1.10.0. With collective metadata, only one process will read the metadata and broadcast to the other processes. This optimization has worked quite well for Parallel-NetCDF and we’re glad to see it in HDF5. Hopefully, other I/O libraries will learn this lesson and adopt similar scalable approaches.
If you do any reading of HDF5 datasets in parallel, go upgrade to HDF5-1.10.0 .


Cleaning out old ROMIO file system drivers

January 5th, 2016
Comments Off on Cleaning out old ROMIO file system drivers

I’m itching to discard some of the little-used file system drivers in ROMIO.

  • GPFS: IBM’s GPFS file system still sees several key deployments
  • NFS: It’s everywhere, even though implementing MPI-IO consistency
    semantics over NFS is difficult at best

  • TESTFS: I find this debugging-oriented file system useful occasionally.
  • UFS: the generic Unix file system driver will be useful for as long as
    POSIX APIs are present.

  • PANFS: Panasas still contributes patches.
  • XFS: SGI’s XFS file system is part of SGI’s MPT, and they still contribute

  • PVFS2: Recent versions are called “OrangeFS”, but the API is still the same and still provides several optimizations not available in other file system drivers.
  • Lustre: deployed on a big chunk of the fastest supercomputers.


  • PIOFS: IBM’s old parallel file system for the SP/2 machine.
  • BlueGene/L: superseded by the BlueGene driver, itself superseded by GPFS.
  • BlueGene: The architecture-specific pieces were merged into a “flavor” of
    gpfs for Blue Gene.

  • BGLockless: this hack (see the bglockess page) lived on far longer than it
    should have.

  • GridFTP: I don’t know if this even compiles any more.
  • NTFS: MPICH has dropped Windows support for several years now.
  • PVFS: superseded by PVFS2 ten years ago.
  • HFS: The HP/Convex (remember convex?) parallel file system. I found a
    mention of a machine deployed in 1995.

  • PFS: the paragon (!) file system.
  • SFS: the “Supercomputing File System” from NEC.
  • ZoidFS: an old research project for a filesystem-independent protocol for
    I/O forwarding. While the ZoidFS driver might work, we know that folks
    trying to resurrect the old IOFSL project in 2015 are finding it…

Do you use a file system on the Deprecate/Delete list? Please let me ([email protected]) know!