Archive for the ‘Uncategorized’ Category

New ROMIO optimizations for Blue Gene /Q

June 5th, 2014
Comments Off on New ROMIO optimizations for Blue Gene /Q

The IBM and Argonne teams have been digging into ROMIO’s collective I/O performance on the Mira supercomputer. These optimizations made it into the MPICH-3.1.1 release, so it seemed like a good time to write up a bit about these optimizations.
no more “bglockless: for Blue Gene /L and Blue Gene /P we wrote a ROMIO driver that never called fcntl-style user-space locks.  This approach worked great for PVFS, which did not support locks anyway, but had a pleasant side effect of improving performance on GPFS too (as long as you did not care about specific workloads and MPI-IO features).  Now, we removed all the extraneous locks from the default I/O driver.  Even better, we kept the locks in the few cases they were needed: shared file pointers and data sieving writes.  Now one does not need to prefix the file name with ‘bglockless:’ or set the BGLOCKLESSMPIO_F_TYPE  environment variable.   It’s the way it should have been 5 years ago.
Alternate Aggregator Selection:  Collective I/O on Blue Gene has long been the primary way to extract maximum performance.  One good optimization is how ROMIO’s two-phase optimization will deal with GPFS file system block alignment.   Even better is how it selects a subset of MPI processes to carry out I/O.  The other MPI processes route their I/O through these “I/O aggregators”.    On Blue Gene, there are some new ways to select which MPI processes should be aggregators:

  • Default: the N I/O aggregators are assigned depth-first based on connections to the I/O forwarding node.    If a file is not very large, we can end up with many active I/O aggregators assigned to one of these I/O nodes, and some I/O nodes with only idle I/O aggregators.
  • “Balanced”:  set the environment variable GPFSMPIO_BALANCECONTIG to 1 and the I/O aggregators will be selected in a more balanced fashion.  With this setting, even small files will be assigned I/O aggregators across as many I/O nodes as possible.  (there’s a limit: we don’t split file domains any smaller than the GPFS block size)
  • “Point-to-point”:  The general two-phase algorithm is built to handle the case where any process might want to send data to or receive data from  any I/O aggregator.  For simple I/O cases we want the benefits of collective I/O — aggregation to a subset of processes, file system alignment — but don’t need the full overhead of potential “all to all” traffic.   Set the environment variable “GPFSMPIO_P2PCONTIG”  to “1” and if certain workload conditions are met — contiguous data, ranks are writing to the file in order (lower mpi ranks write to earlier parts of the file), and data has no holes — then ROMIO will carry out point-to-point communication among an I/O aggregator and the much smaller subset of processes assigned to it.

We don’t have MPI Info hints for these yet, since they are so new.  Once we have some more experience using them, we can provide hints and guidance on when the hints might make sense.   For now, they are only used if  environment variables are set.
Deferred Open revisited: The old “deferred open” optimization, where specifying some hints would have only the I/O aggregators open the file, has not seen a lot of testing over the years.  Turns out it was not working on Blue Gene. We re-worked the deferred open logic, and now it works again.   Codes that open a file only to do a small amount of I/O should see an improvement in open times with this approach.  Oddly, IOR does not show any benefit.  We’re still trying to figure that one out.
no more seeks: An individual lseek() system call is not so expensive on Blue Gene /Q.  However, if you have tens of thousands of lseek() system calls, they  interact with the outstanding read() and write() calls and can sometimes stall for a long time.  We have replaced ‘lseek() + read()’ and ‘lseek() + write()’ with pread() and pwrite().



August 5th, 2013
Comments Off on bglockless

Update: in MPICH-3.1.1 we finally scrapped bglockless, (see this writeup on 3.1.1 and Blue Gene enhancements)  but it’s still part of the system software on any BG /L BG /P or BG /Q machines.  The following writeup is perhaps of historical interest, but it will be a while (maybe never) before mpich-3.1.1 is the default MPI on Bue Gene /Q.

The IBM BGP MPI-IO implementation is designed to the “lowest common denominator”: NFS. So they’re performing some very conservative locking in their ADIO file system driver in order to try to get correct MPI-IO semantics out of what might be an NFS volume underneath.  It’s possible, though, to select an alternate driver that gives better performance in most cases — and terrible, terrible performance in one specific case.
The MPI routine MPI_File_open takes a string “filename” argument. Normally, ROMIO does a stat of the file system to figure out what kind of file system that file lives on, and then selects a “file system driver” (one of the ADIO modules) that might contain file system specific optimizations.
If you provide a prefix, like “ufs:” for traditional unix files, or “pvfs2:” or even “gridftp:”, then that prefix overrides whatever magic detection routines ROMIO would run, and the corresponding “ADIO driver” will be selected.
For Blue Gene /L (L, I tell you!) I wrote a ROMIO driver that made no explicit fcntl() lock calls.  Those lock calls are normally not a big deal, but PVFS v2 did not support fcntl() locks.   I called this driver ‘bglockless’.
our friends at IBM, in a conservative effort to ensure correctness for all possible file systems, wrapped every I/O operation in an fcntl() lock.  90% of these locks were unnecessary and served only to slow down I/O.
so, the half-day “driver with no locks” project I wrote for PVFS takes on a second life as the “make I/O go fast” driver.
Now here’s the catch, and why we can’t just make “bglockless” the default: certain I/O workloads, if locks are not available, must be carried out in a extremely inefficient manner.  Specifically, strided independent writes to  a file.   Certain rarely used functionality, like shared file pointers and ordered mode operations, are not implemented when locks are disabled.
For Blue Gene /P and /Q, one can set the environment variable BGLOCKLESSMPIO_F_TYPE to 0x47504653 (the GPFS file system magic number). ROMIO will then pretend GPFS is like PVFS and not issue any fcntl() lock commands.


system hints: hints via config file

September 26th, 2008
Comments Off on system hints: hints via config file

In ROMIO, setting hints looks like this:

MPI_Info info;
MPI_Info_set(“cb_buffer_size”, “8388608”);

Setting these hints in the program  can make sense in many cases — for example, you know something specific about the workload and wish to guide ROMIO’s optimizations a bit.  But what if you want to explore the impact of hints on your program?  There are a few options to do so:

  •  Modify your program to look at an environment variable and use that as the value for your hint.
  •  Take a command line parameter.
  • Repeatedly edit and re-compile your program.

While good practice, the approaches require additional work.  It also assumes access to the source code — common, but not a guarantee.
Additionally, we notice very few users set hints on their own.  They will gladly do so if we suggest it, but what would be great is if every application on a system ran with the best hints for that system.   Sometimes you can count on the system’s vendor to set the defaults, but it is our experience that vendor defaults are exceedingly conservative.
We added a new feature in ROMIO called “system hints“.   You can now populate a config file with the same key-value pairs you would pass to MPI_Info_set and ROMIO will add those hints to your program.
Here’s an example of what that file might look like:

$ cat romio_hints
romio_cb_read enable
romio_cb_write enable
cb_config_list *:2

By default, ROMIO will look for /etc/romio-hints , but you can set the environment variable ROMIO_HINTS to select a different location (for example, your application’s working directory)


citing ROMIO

February 1st, 2002
Comments Off on citing ROMIO

To cite all of ROMIO, use “A Case for Using MPI’s Derived Datatypes to Improve I/O”

author = {Rajeev Thakur and William Gropp and Ewing Lusk},
title = {A Case for Using {MPI’s} Derived Datatypes to Improve {I/O}
booktitle = {Proceedings of SC98: High Performance Networking and Computing},
year = {1998},
month = {November},
publisher = {ACM Press},
earlier = {thakur:mpi-tr},
URL = {},
keywords = {MPI, parallel I/O, pario-bib}

To cite specific optimizations such as data sieving or collective buffering, cite “Optimizing Noncontiguous Accesses in MPI-IO,”:

author = {Rajeev Thakur and William Gropp and Ewing Lusk},
title = {Optimizing Noncontiguous Accesses in {MPI-IO}},
journal = {Parallel Computing},
year = {2002},
month = {January},
volume = {28},
number = {1},
pages = {83–105},
URL = {},
keywords = {parallel I/O, parallel I/O, MPI-IO, collective I/O, data sieving,