Skip to content

ROMIO

  • ROMIO: A High-Performance, Portable MPI-IO Implementation

New ROMIO optimizations for Blue Gene /Q

June 5, 2014 by Latham, Robert J.

The IBM and Argonne teams have been digging into ROMIO’s collective I/O performance on the Mira supercomputer. These optimizations made it into the MPICH-3.1.1 release, so it seemed like a good time to write up a bit about these optimizations.
no more “bglockless“: for Blue Gene /L and Blue Gene /P we wrote a ROMIO driver that never called fcntl-style user-space locks.  This approach worked great for PVFS, which did not support locks anyway, but had a pleasant side effect of improving performance on GPFS too (as long as you did not care about specific workloads and MPI-IO features).  Now, we removed all the extraneous locks from the default I/O driver.  Even better, we kept the locks in the few cases they were needed: shared file pointers and data sieving writes.  Now one does not need to prefix the file name with ‘bglockless:’ or set the BGLOCKLESSMPIO_F_TYPE  environment variable.   It’s the way it should have been 5 years ago.
Alternate Aggregator Selection:  Collective I/O on Blue Gene has long been the primary way to extract maximum performance.  One good optimization is how ROMIO’s two-phase optimization will deal with GPFS file system block alignment.   Even better is how it selects a subset of MPI processes to carry out I/O.  The other MPI processes route their I/O through these “I/O aggregators”.    On Blue Gene, there are some new ways to select which MPI processes should be aggregators:

  • Default: the N I/O aggregators are assigned depth-first based on connections to the I/O forwarding node.    If a file is not very large, we can end up with many active I/O aggregators assigned to one of these I/O nodes, and some I/O nodes with only idle I/O aggregators.
  • “Balanced”:  set the environment variable GPFSMPIO_BALANCECONTIG to 1 and the I/O aggregators will be selected in a more balanced fashion.  With this setting, even small files will be assigned I/O aggregators across as many I/O nodes as possible.  (there’s a limit: we don’t split file domains any smaller than the GPFS block size)
  • “Point-to-point”:  The general two-phase algorithm is built to handle the case where any process might want to send data to or receive data from  any I/O aggregator.  For simple I/O cases we want the benefits of collective I/O — aggregation to a subset of processes, file system alignment — but don’t need the full overhead of potential “all to all” traffic.   Set the environment variable “GPFSMPIO_P2PCONTIG”  to “1” and if certain workload conditions are met — contiguous data, ranks are writing to the file in order (lower mpi ranks write to earlier parts of the file), and data has no holes — then ROMIO will carry out point-to-point communication among an I/O aggregator and the much smaller subset of processes assigned to it.

We don’t have MPI Info hints for these yet, since they are so new.  Once we have some more experience using them, we can provide hints and guidance on when the hints might make sense.   For now, they are only used if  environment variables are set.
Deferred Open revisited: The old “deferred open” optimization, where specifying some hints would have only the I/O aggregators open the file, has not seen a lot of testing over the years.  Turns out it was not working on Blue Gene. We re-worked the deferred open logic, and now it works again.   Codes that open a file only to do a small amount of I/O should see an improvement in open times with this approach.  Oddly, IOR does not show any benefit.  We’re still trying to figure that one out.
no more seeks: An individual lseek() system call is not so expensive on Blue Gene /Q.  However, if you have tens of thousands of lseek() system calls, they  interact with the outstanding read() and write() calls and can sometimes stall for a long time.  We have replaced ‘lseek() + read()’ and ‘lseek() + write()’ with pread() and pwrite().
 
 

Post navigation

Previous Post:

bglockless

Next Post:

ROMIO and Intel-MPI

One comment

  1. Pingback: bglockless | ROMIO

Comments are closed.

Recent Posts

  • ROMIO and MPICH-4.3.0
  • ROMIO and “large counts”
  • Hintdump: a small utility for poking at MPI implementations.
  • Quobyte file system
  • ROMIO at SC 2019

Recent Comments

  • ROMIO » New ROMIO optimizations for Blue Gene /Q on bglockless
  • bglockless | ROMIO on New ROMIO optimizations for Blue Gene /Q

Archives

  • February 2025
  • May 2024
  • April 2023
  • October 2020
  • November 2019
  • February 2019
  • December 2018
  • November 2018
  • September 2018
  • November 2017
  • September 2017
  • March 2017
  • August 2016
  • June 2016
  • January 2016
  • December 2015
  • November 2015
  • June 2015
  • May 2015
  • February 2015
  • October 2014
  • August 2014
  • July 2014
  • June 2014
  • August 2013
  • July 2013
  • February 2012
  • September 2010
  • November 2009
  • November 2008
  • September 2008
  • February 2006
  • August 2003
  • February 2002

Categories

  • development
  • features
  • gpfs
  • intel-mpi
  • lustre
  • presentations
  • publications
  • releases
  • tuning
  • Uncategorized

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
© 2025 ROMIO | WordPress Theme by Superbthemes