New ROMIO optimizations for Blue Gene /Q

June 5, 2014 by Latham, Robert J.

The IBM and Argonne teams have been digging into ROMIO’s collective I/O performance on the Mira supercomputer. These optimizations made it into the MPICH-3.1.1 release, so it seemed like a good time to write up a bit about these optimizations.
no more “bglockless“: for Blue Gene /L and Blue Gene /P we wrote a ROMIO driver that never called fcntl-style user-space locks. This approach worked great for PVFS, which did not support locks anyway, but had a pleasant side effect of improving performance on GPFS too (as long as you did not care about specific workloads and MPI-IO features). Now, we removed all the extraneous locks from the default I/O driver. Even better, we kept the locks in the few cases they were needed: shared file pointers and data sieving writes. Now one does not need to prefix the file name with ‘bglockless:’ or set the BGLOCKLESSMPIO_F_TYPE environment variable. It’s the way it should have been 5 years ago.
Alternate Aggregator Selection: Collective I/O on Blue Gene has long been the primary way to extract maximum performance. One good optimization is how ROMIO’s two-phase optimization will deal with GPFS file system block alignment. Even better is how it selects a subset of MPI processes to carry out I/O. The other MPI processes route their I/O through these “I/O aggregators”. On Blue Gene, there are some new ways to select which MPI processes should be aggregators:

Default: the N I/O aggregators are assigned depth-first based on connections to the I/O forwarding node. If a file is not very large, we can end up with many active I/O aggregators assigned to one of these I/O nodes, and some I/O nodes with only idle I/O aggregators.
“Balanced”: set the environment variable GPFSMPIO_BALANCECONTIG to 1 and the I/O aggregators will be selected in a more balanced fashion. With this setting, even small files will be assigned I/O aggregators across as many I/O nodes as possible. (there’s a limit: we don’t split file domains any smaller than the GPFS block size)
“Point-to-point”: The general two-phase algorithm is built to handle the case where any process might want to send data to or receive data from any I/O aggregator. For simple I/O cases we want the benefits of collective I/O — aggregation to a subset of processes, file system alignment — but don’t need the full overhead of potential “all to all” traffic. Set the environment variable “GPFSMPIO_P2PCONTIG” to “1” and if certain workload conditions are met — contiguous data, ranks are writing to the file in order (lower mpi ranks write to earlier parts of the file), and data has no holes — then ROMIO will carry out point-to-point communication among an I/O aggregator and the much smaller subset of processes assigned to it.

We don’t have MPI Info hints for these yet, since they are so new. Once we have some more experience using them, we can provide hints and guidance on when the hints might make sense. For now, they are only used if environment variables are set.
Deferred Open revisited: The old “deferred open” optimization, where specifying some hints would have only the I/O aggregators open the file, has not seen a lot of testing over the years. Turns out it was not working on Blue Gene. We re-worked the deferred open logic, and now it works again. Codes that open a file only to do a small amount of I/O should see an improvement in open times with this approach. Oddly, IOR does not show any benefit. We’re still trying to figure that one out.
no more seeks: An individual lseek() system call is not so expensive on Blue Gene /Q. However, if you have tens of thousands of lseek() system calls, they interact with the outstanding read() and write() calls and can sometimes stall for a long time. We have replaced ‘lseek() + read()’ and ‘lseek() + write()’ with pread() and pwrite().

One comment