Skip to content

ROMIO

  • ROMIO: A High-Performance, Portable MPI-IO Implementation

aggregation selection on Blue Gene

May 15, 2015 by Latham, Robert J.

For a lot of workloads, simply using collective I/O provides a big performance boost.  Sometimes, though, it’s necessary to tune collective I/O a bit.  The hint “cb_nodes” provides a way to select how many MPI processes will become aggregators.   On Blue Gene, though, the story is a little more complicated.
We’ll start with Blue Gene /L and /P, even though those machines are now obsolete. The concepts on the older machines still apply, if in a slightly different form. The 163840 cores on the Intrepid BlueGene/P system are configured in a hierarchy. To improve the scalability of the BlueGene architecture, dedicated “I/O nodes” (ION) act as system call proxies between the compute nodes and the storage nodes. On Intrepid, we call the collection of an ION and its compute nodes a “pset”. Each Intrepid pset contains one ION and 64 4-core compute nodes.
The MPI standard defines ‘collective’ routines. Unlike the ‘independent’ routines, all processes in a given MPI communicator call the routine together. The MPI implementation, with the knowledge of which tasks participate in a call, can then perform significant optimizations. These collective routines provide tremendous performance benefits for both networking and I/O.
The BlueGene MPI-IO library, based on ROMIO, makes some adjustments to the ROMIO collective buffering optimization. First, data accesses are aligned to file system block boundaries. Such an alignment reduces lock contention in the write case and can yield big performance improvements.
Second, and perhaps most importantly from a scalability perspective, the “I/O aggregators” selected for the I/O phase of two-phase are a small subset of the total number of processors. On BlueGene, the MPI-IO hint “bgl_nodes_pset” defines a ratio. For each pset allocated to a process, that many nodes will be designated as aggregators. The default ratio for a job running in “virtual node” is one aggregator for every 32 MPI processes. Furthermore, these aggregators are distributed over the topology of the application so that no node has more than one aggregator and no pset contains more than “bgl_nodes_pset” aggregators.
On Mira (Blue Gene /Q) the story is a bit more complicated. I/O nodes no longer are statically assigned to compute nodes. Rather, there is a pool of I/O nodes. When a job is launched, some portion of those I/O nodes gets assigned to the compute nodes.
On Mira, a set of 128 compute nodes (known as a pset) has one I/O node acting as an I/O proxy. For every I/O node there are two network links of 2 GB/s toward two distinct compute nodes acting as bridge. Therefore, for every 128-node partition, there are nb = 1 × 2 = 2 bridges. The I/O traffic from compute nodes passes through these bridge nodes on the way to the I/O node. The I/O nodes are connected to the storage servers through Quad-data-rate (QDR) InfiniBand links. On BG/Q the programmer can set the number of aggregators per pset na_pset (the hint on BG/Q has been renamed to “bg_nodes_pset”).  One can determine the total number of aggregators of an application na knowing na_pset , n, and nb with the following equation:

Computing the number of aggregators on Blue Gene is... not straightforward
Computing the number of aggregators on Blue Gene is… not straightforward

The number of bridge nodes is hardware dependent.  For the Argonne machines,  Mira’s  nb is always 1, but on Vesta, it’s 4 and on Cetus it is 8.
Sophisticated applications wishing to do their own I/O subsetting should be aware of these default parameters and optimizations. In some cases, applications will try to subset to a small number of node and find greatly reduced I/O performance.

Post navigation

Previous Post:

Non-blocking collective I/O

Next Post:

Building ROMIO's Lustre driver

Recent Posts

  • ROMIO and MPICH-4.3.0
  • ROMIO and “large counts”
  • Hintdump: a small utility for poking at MPI implementations.
  • Quobyte file system
  • ROMIO at SC 2019

Recent Comments

  • ROMIO » New ROMIO optimizations for Blue Gene /Q on bglockless
  • bglockless | ROMIO on New ROMIO optimizations for Blue Gene /Q

Archives

  • February 2025
  • May 2024
  • April 2023
  • October 2020
  • November 2019
  • February 2019
  • December 2018
  • November 2018
  • September 2018
  • November 2017
  • September 2017
  • March 2017
  • August 2016
  • June 2016
  • January 2016
  • December 2015
  • November 2015
  • June 2015
  • May 2015
  • February 2015
  • October 2014
  • August 2014
  • July 2014
  • June 2014
  • August 2013
  • July 2013
  • February 2012
  • September 2010
  • November 2009
  • November 2008
  • September 2008
  • February 2006
  • August 2003
  • February 2002

Categories

  • development
  • features
  • gpfs
  • intel-mpi
  • lustre
  • presentations
  • publications
  • releases
  • tuning
  • Uncategorized

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
© 2025 ROMIO | WordPress Theme by Superbthemes