Archive for August, 2013


August 5th, 2013
Comments Off on bglockless

Update: in MPICH-3.1.1 we finally scrapped bglockless, (see this writeup on 3.1.1 and Blue Gene enhancements)  but it’s still part of the system software on any BG /L BG /P or BG /Q machines.  The following writeup is perhaps of historical interest, but it will be a while (maybe never) before mpich-3.1.1 is the default MPI on Bue Gene /Q.

The IBM BGP MPI-IO implementation is designed to the “lowest common denominator”: NFS. So they’re performing some very conservative locking in their ADIO file system driver in order to try to get correct MPI-IO semantics out of what might be an NFS volume underneath.  It’s possible, though, to select an alternate driver that gives better performance in most cases — and terrible, terrible performance in one specific case.
The MPI routine MPI_File_open takes a string “filename” argument. Normally, ROMIO does a stat of the file system to figure out what kind of file system that file lives on, and then selects a “file system driver” (one of the ADIO modules) that might contain file system specific optimizations.
If you provide a prefix, like “ufs:” for traditional unix files, or “pvfs2:” or even “gridftp:”, then that prefix overrides whatever magic detection routines ROMIO would run, and the corresponding “ADIO driver” will be selected.
For Blue Gene /L (L, I tell you!) I wrote a ROMIO driver that made no explicit fcntl() lock calls.  Those lock calls are normally not a big deal, but PVFS v2 did not support fcntl() locks.   I called this driver ‘bglockless’.
our friends at IBM, in a conservative effort to ensure correctness for all possible file systems, wrapped every I/O operation in an fcntl() lock.  90% of these locks were unnecessary and served only to slow down I/O.
so, the half-day “driver with no locks” project I wrote for PVFS takes on a second life as the “make I/O go fast” driver.
Now here’s the catch, and why we can’t just make “bglockless” the default: certain I/O workloads, if locks are not available, must be carried out in a extremely inefficient manner.  Specifically, strided independent writes to  a file.   Certain rarely used functionality, like shared file pointers and ordered mode operations, are not implemented when locks are disabled.
For Blue Gene /P and /Q, one can set the environment variable BGLOCKLESSMPIO_F_TYPE to 0x47504653 (the GPFS file system magic number). ROMIO will then pretend GPFS is like PVFS and not issue any fcntl() lock commands.