Archive for June, 2014

ROMIO and Intel-MPI

June 12th, 2014
Comments Off on ROMIO and Intel-MPI

ROMIO, in various forms, provides the MPI-IO implementation for just about every MPI implementation out there.   These implementations incorporate ROMIO’s hints when they pick up our source code, but they also add additional tuning parameters via environment variables.
The Intel MPI library uses ROMIO, but configures the file-system specific drivers a bit differently.   in MPICH, we select which file system drivers to support at compile-time with the --with-file-system configure flag.  These selected drivers are compiled directly into the MPICH library.  Intel-MPI builds its  file-system drivers as loadable modules, and relies on two environment variables to enable and select the drivers


Let’s say you had a Lustre file system, like [edit: archives have moved] This fellow on the HDF5 mailing list Then you would invoke mpiexec like this:

 mpiexec -env I_MPI_EXTRA_FILESYSTEM on \
        -env I_MPI_EXTRA_FILESYSTEM_LIST lustre -n 2 ./test

I found this information in the Intel MPI library Reference Manual, which contains a ton of other tuning parameters.
(Update 12 May 2015): Intel 5.0.2 and newer have GPFS support.  One would enable it the same way with the I_MPI_EXTRA_FILESYSTEM_LIST


(Update 7 April 2023): Intel MPI has added a few more file systems:  panasas (panfs) and DAOS (daos) are supported too.

gpfs, intel-mpi, lustre, tuning

New ROMIO optimizations for Blue Gene /Q

June 5th, 2014
Comments Off on New ROMIO optimizations for Blue Gene /Q

The IBM and Argonne teams have been digging into ROMIO’s collective I/O performance on the Mira supercomputer. These optimizations made it into the MPICH-3.1.1 release, so it seemed like a good time to write up a bit about these optimizations.
no more “bglockless: for Blue Gene /L and Blue Gene /P we wrote a ROMIO driver that never called fcntl-style user-space locks.  This approach worked great for PVFS, which did not support locks anyway, but had a pleasant side effect of improving performance on GPFS too (as long as you did not care about specific workloads and MPI-IO features).  Now, we removed all the extraneous locks from the default I/O driver.  Even better, we kept the locks in the few cases they were needed: shared file pointers and data sieving writes.  Now one does not need to prefix the file name with ‘bglockless:’ or set the BGLOCKLESSMPIO_F_TYPE  environment variable.   It’s the way it should have been 5 years ago.
Alternate Aggregator Selection:  Collective I/O on Blue Gene has long been the primary way to extract maximum performance.  One good optimization is how ROMIO’s two-phase optimization will deal with GPFS file system block alignment.   Even better is how it selects a subset of MPI processes to carry out I/O.  The other MPI processes route their I/O through these “I/O aggregators”.    On Blue Gene, there are some new ways to select which MPI processes should be aggregators:

  • Default: the N I/O aggregators are assigned depth-first based on connections to the I/O forwarding node.    If a file is not very large, we can end up with many active I/O aggregators assigned to one of these I/O nodes, and some I/O nodes with only idle I/O aggregators.
  • “Balanced”:  set the environment variable GPFSMPIO_BALANCECONTIG to 1 and the I/O aggregators will be selected in a more balanced fashion.  With this setting, even small files will be assigned I/O aggregators across as many I/O nodes as possible.  (there’s a limit: we don’t split file domains any smaller than the GPFS block size)
  • “Point-to-point”:  The general two-phase algorithm is built to handle the case where any process might want to send data to or receive data from  any I/O aggregator.  For simple I/O cases we want the benefits of collective I/O — aggregation to a subset of processes, file system alignment — but don’t need the full overhead of potential “all to all” traffic.   Set the environment variable “GPFSMPIO_P2PCONTIG”  to “1” and if certain workload conditions are met — contiguous data, ranks are writing to the file in order (lower mpi ranks write to earlier parts of the file), and data has no holes — then ROMIO will carry out point-to-point communication among an I/O aggregator and the much smaller subset of processes assigned to it.

We don’t have MPI Info hints for these yet, since they are so new.  Once we have some more experience using them, we can provide hints and guidance on when the hints might make sense.   For now, they are only used if  environment variables are set.
Deferred Open revisited: The old “deferred open” optimization, where specifying some hints would have only the I/O aggregators open the file, has not seen a lot of testing over the years.  Turns out it was not working on Blue Gene. We re-worked the deferred open logic, and now it works again.   Codes that open a file only to do a small amount of I/O should see an improvement in open times with this approach.  Oddly, IOR does not show any benefit.  We’re still trying to figure that one out.
no more seeks: An individual lseek() system call is not so expensive on Blue Gene /Q.  However, if you have tens of thousands of lseek() system calls, they  interact with the outstanding read() and write() calls and can sometimes stall for a long time.  We have replaced ‘lseek() + read()’ and ‘lseek() + write()’ with pread() and pwrite().