For years I’ve had a little tool in my home directory that does an MPI_File_open
and reads out all the hints associated with that file. It was so simple and dumb I never put it on github. Now a new crop of machines has me realizing it might be useful to more people. So here you go:
https://github.com/roblatham00/hintdump
I can use this simple little tool to do a few things:
- Verify ROMIO is using the ADIO driver I think it’s using (the
romio_filesystem_type
hint)
- See what file systems this MPI implementation might support (try out different prefixes like
ufs:
or lustre:
)
- Verify stripe counts — I found one MPI implementation that would not honor stripe counts set by info objects, and would only honor stripe counts set with that file system’s utilities.
I hope you find this useful.
Uncategorized
I don’t have a lot of first-hand experience with the Quobyte file system but the developers contributed a file system driver for ROMIO, making them my new best friends. https://www.quobyte.com/product
I think a lot about how to converge the worlds of HPC storage and data center storage. Looks like this new ROMIO driver might be a small step in that convergence.
Note to anyone out there considering a ROMIO driver — I’ll accept just about any contribution . ROMIO drivers are self-contained and opt-in. Not much risk that a new ROMIO driver will break MPICH. All I need is a plausible case that the contributor will stick around for bugs and updates, and at least one user out there somewhere.
The quobyte file system will be in the next release of MPICH ( MPICH-3.4). Configure ROMIO with the --with-file-system=quobytefs+...
flag (including other file system drivers you might want). You can explicitly request the quobyte driver at runtime (assuming ROMIO was configured to support it) with the ‘quobyte:’ prefix to your file name.
Uncategorized
I was surprised to find I hadn’t written anything about the environment variables one can set in ROMIO.
I don’t think I have a well-thought-out theory of when to use an environment variable vs when to use an MPI_Info key.
In this post, I’ll talk about ROMIO environment variables in general. The GPFS, PVFS2, and XFS drivers also have environment variables one can set, but those variables need to be set only in very specific instances.
- ROMIO_FSTYPE_FORCE: ROMIO will pick a file system driver by calling
stat(2)
and looking at the fs type field. You can also prefix the path with a driver name (e.g. pvfs2:/path/to/file
or ufs:/home/me/stuff
), which bypasses the stat check. This environment variable provides a third way. Set the value to the prefix (e.g. export ROMIO_FSTYPE_FORCE="ufs:"
) and ROMIO will treat every file as if it resides on that file system. You are likely to get some strange behavior if you for instance try to make lustre-specific ioctl() calls on a plain unix file. I added the facility a while back in cases where it might be hard to modify the path or if I wanted to rule out a bug in a ROMIO driver
- ROMIO_PRINT_HINTS: dump out the hints ROMIO is going to use on this file. Sometimes file systems will override user hints or otherwise communicate something back to the user through hints. Helpful to confirm what you think is going on (as in “hey, I requested this other optimization. Why is that optimization happening?”)
- ROMIO_HINTS: used to select a custom “system hints” file. See system hints
Uncategorized
It has been a while since the last official MPICH release, but shortly before US Thanksgiving, the MPICH team released MPICH-3.3.
ROMIO’s most noteworthy changes include
- wholesale reformatting of all the code to the same coding styles. Sorry about your fork or branch.
- use the MPL utility library already used in MPICH. Eliminates a lot of duplicated code
- We started analyzing the code with Coverity. It found “a few” things for us to fix. Valgrind found a few things too.
- Internal datatype representation (the “flattened” representation) now stored as an attribute on the datatype and not in an internal global linked list.
- continued work to make ROMIO 64 bit clean
- Deleted a bunch of unused file system drivers
- Added DDN’s ‘IME’ driver.
- Warn if a file view voilates the MPI-IO rule “monotonically non-decreasing file offsets”
- added support for Lustre lockahead optimization
Uncategorized
If you would like to learn more about I/O in High Performance Computing, come check out our SC 2018 tutorial Parallel I/O in Practice. We will cover the hardware and software that makes up the software stack on large parallel computers. MPI-IO takes up a big chunk of time, as do the I/O libraries which typically sit on top of MPI-IO.
If you are interested in ROMIO, you hopefully are familiar with Darshan for collecting statistics and otherwise characterizing you your I/O patterns. These other Darshan-related events will likely be of interest to you:
And HDF5, which frequently sits atop ROMIO, will be having a Birds of a Feather session Wedesday.
Uncategorized
It is an error for multiple MPI processes to write to the same or to overlapping regions of a file. ROMIO will let you get away with this but if your processes are writing different data, I can’t tell you what will end up in the file in the end.
What about reads, though?
ROMIO’s two phase collective buffering algorithm handles overlapping read requests the way you would hope: I/O aggregators read from the file and send the data to the right MPI process. N processes reading a config file, for example, will result in one read followed by a network exchange.
proce
As an aside, ROMIO’s two-phase algorithm is general and so not as good as a “read and broadcast” — if you the application/library writer know you are going to have every process read the file, here is one spot (maybe the only spot) where I’d encourage you to (independently) read from one processor and broadcast to everyone else.
I bet you are excited to go try this out on some code. Maybe you will have every process read the same elements out of an array. Did you get the performance you expected? Probably not. ROMIO tries to be clever. If the requests from processes are interleaved, ROMIO will carry on with two phase collective I/O. If the requests are not interleaved, then ROMIO will fall back to independent I/O on the assumption that the expense of the two-phase algorithm will not be worth it.
You can see the check here: https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/common/ad_read_coll.c#L149
In 2018, two-phase is almost always a good idea — even if the requests are well-formed, collective buffering will map request sizes to underlying file system quirks, reduce the overall client count thanks to I/O aggregators, and probably place those aggregators strategically.
You can force ROMIO to always use collective buffering by setting the hint "romio_cb_read"
to “enable” . On Blue Gene systems, that is the default setting already. On other platforms, the default is “automatic”, which triggers that check we mentioned.
Uncategorized
Data Direct has a storage product called “Infinte Memory Engine“. You can access this accelerated storage through POSIX but there is also a library-level native interface.
DDN recently contributed a ROMIO driver to use the IME “native” interfaces, and I have merged this into MPICH master for an upcoming release. Thanks!
For most people, this new feature will only be exciting if you have a DDN storage device with IME. If you do have such a piece of hardware, you can ask your DDN rep where to get the RPMs for the IME-native library.
I wrote a mocked version for anyone who wants to compile-time test this driver. All it does is directly invoke the POSIX versions. You can get ime-mockup from my gitlab repository
Uncategorized
Getting the best performance out of Lustre can be a bit of a challenge. Here are some things to check if you tried out ROMIO on a Lustre file system and did not see the performance you were expecting.
The zeroth step in tuning Lustre is “consult your site-specific documentation”. Your admins will have information about how many Lustre servers are deployed, how much performance you should expect, and any site-specific utilities they have provided to make your life easier. Here are some of the more popular sites:
First, are you using the Lustre file system driver? Nowadays, you would have to go out of your way not to. One can read the romio_filesystem_type hint to confirm.
Next, what is the stripe count? Lustre typically defaults to a stripe count of 1, which means all reads and writes will go to just one server (OST in Lustre parlance). Most systems have tens of OSTs, so the default stripe size is really going to kill performance!
The ‘lfs’ utility can be used to get and set lustre file information.
$ lfs getstripe /path/to/directory
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 2
obdidx objid objid group
2 14114525 0xd75edd 0x280000400
This directory has a stripe_count of 1. That means any files created in this directory will also have a stripe count of one. This directory would be good for hosting small config files, but large HPC input decks or checkpoint files will not see good performance.
When reading a file, there’s no way to adjust the stripe count. When the file is created, the striping is locked in place. You would have to create a directory with a large stripe count and copy the files into this new directory.
$ lfs setstripe -c 60 /my/new/directory
Now any new file created in “my/new/directory” will have stripe count of 60
If you care creating a new file, you can set the stripe size in ROMIO with the “striping_factor” hint:
MPI_Info_create(&info);
MPI_Info_set(info, "striping_factor", "32");
MPI_File_open(MPI_COMM_WORLD, "foo.chkpt", MPI_MODE_CREATE, info, &fh);
Uncategorized
While not exactly ROMIO, the new HDF5 release comes with a nice optimization that benefits all MPI-IO implementations.
In order to know where the objects, datastets, and other information in an HDF5 file is located on disk, a process needs to read the HDF5 metadata. This metadata is scattered across the file, so to find out where everything is located a process will have to issue many tiny read requests. For a long time, each HDF5 process needed to issue these reads. There was no way for one process to examine the file and then tell the other processes about the file layout. When HDF5 programs were tens or hundreds of MPI processes, this read overhead was not so bad. As process counts get larger and larger in scale, as on for example Blue Gene, these reads started taking up a huge amount of time.
The HDF Group has implemented collective metadata in HDF5-1.10.0. With collective metadata, only one process will read the metadata and broadcast to the other processes. This optimization has worked quite well for Parallel-NetCDF and we’re glad to see it in HDF5. Hopefully, other I/O libraries will learn this lesson and adopt similar scalable approaches.
If you do any reading of HDF5 datasets in parallel, go upgrade to HDF5-1.10.0 .
Uncategorized
I’m itching to discard some of the little-used file system drivers in ROMIO.
Keep:
- GPFS: IBM’s GPFS file system still sees several key deployments
- NFS: It’s everywhere, even though implementing MPI-IO consistency
semantics over NFS is difficult at best
- TESTFS: I find this debugging-oriented file system useful occasionally.
- UFS: the generic Unix file system driver will be useful for as long as
POSIX APIs are present.
- PANFS: Panasas still contributes patches.
- XFS: SGI’s XFS file system is part of SGI’s MPT, and they still contribute
patches
- PVFS2: Recent versions are called “OrangeFS”, but the API is still the same and still provides several optimizations not available in other file system drivers.
- Lustre: deployed on a big chunk of the fastest supercomputers.
Deprecate/Delete:
- PIOFS: IBM’s old parallel file system for the SP/2 machine.
- BlueGene/L: superseded by the BlueGene driver, itself superseded by GPFS.
- BlueGene: The architecture-specific pieces were merged into a “flavor” of
gpfs for Blue Gene.
- BGLockless: this hack (see the bglockess page) lived on far longer than it
should have.
- GridFTP: I don’t know if this even compiles any more.
- NTFS: MPICH has dropped Windows support for several years now.
- PVFS: superseded by PVFS2 ten years ago.
- HFS: The HP/Convex (remember convex?) parallel file system. I found a
mention of a machine deployed in 1995.
- PFS: the paragon (!) file system.
- SFS: the “Supercomputing File System” from NEC.
- ZoidFS: an old research project for a filesystem-independent protocol for
I/O forwarding. While the ZoidFS driver might work, we know that folks
trying to resurrect the old IOFSL project in 2015 are finding it…
challenging.
Do you use a file system on the Deprecate/Delete list? Please let me ([email protected]) know!
Uncategorized
Recent Comments