Archive

Archive for the ‘tuning’ Category

aggregation selection on Blue Gene

May 15th, 2015
Comments Off on aggregation selection on Blue Gene

For a lot of workloads, simply using collective I/O provides a big performance boost.  Sometimes, though, it’s necessary to tune collective I/O a bit.  The hint “cb_nodes” provides a way to select how many MPI processes will become aggregators.   On Blue Gene, though, the story is a little more complicated.
We’ll start with Blue Gene /L and /P, even though those machines are now obsolete. The concepts on the older machines still apply, if in a slightly different form. The 163840 cores on the Intrepid BlueGene/P system are configured in a hierarchy. To improve the scalability of the BlueGene architecture, dedicated “I/O nodes” (ION) act as system call proxies between the compute nodes and the storage nodes. On Intrepid, we call the collection of an ION and its compute nodes a “pset”. Each Intrepid pset contains one ION and 64 4-core compute nodes.
The MPI standard defines ‘collective’ routines. Unlike the ‘independent’ routines, all processes in a given MPI communicator call the routine together. The MPI implementation, with the knowledge of which tasks participate in a call, can then perform significant optimizations. These collective routines provide tremendous performance benefits for both networking and I/O.
The BlueGene MPI-IO library, based on ROMIO, makes some adjustments to the ROMIO collective buffering optimization. First, data accesses are aligned to file system block boundaries. Such an alignment reduces lock contention in the write case and can yield big performance improvements.
Second, and perhaps most importantly from a scalability perspective, the “I/O aggregators” selected for the I/O phase of two-phase are a small subset of the total number of processors. On BlueGene, the MPI-IO hint “bgl_nodes_pset” defines a ratio. For each pset allocated to a process, that many nodes will be designated as aggregators. The default ratio for a job running in “virtual node” is one aggregator for every 32 MPI processes. Furthermore, these aggregators are distributed over the topology of the application so that no node has more than one aggregator and no pset contains more than “bgl_nodes_pset” aggregators.
On Mira (Blue Gene /Q) the story is a bit more complicated. I/O nodes no longer are statically assigned to compute nodes. Rather, there is a pool of I/O nodes. When a job is launched, some portion of those I/O nodes gets assigned to the compute nodes.
On Mira, a set of 128 compute nodes (known as a pset) has one I/O node acting as an I/O proxy. For every I/O node there are two network links of 2 GB/s toward two distinct compute nodes acting as bridge. Therefore, for every 128-node partition, there are nb = 1 × 2 = 2 bridges. The I/O traffic from compute nodes passes through these bridge nodes on the way to the I/O node. The I/O nodes are connected to the storage servers through Quad-data-rate (QDR) InfiniBand links. On BG/Q the programmer can set the number of aggregators per pset na_pset (the hint on BG/Q has been renamed to “bg_nodes_pset”).  One can determine the total number of aggregators of an application na knowing na_pset , n, and nb with the following equation:

Computing the number of aggregators on Blue Gene is... not straightforward

Computing the number of aggregators on Blue Gene is… not straightforward


The number of bridge nodes is hardware dependent.  For the Argonne machines,  Mira’s  nb is always 1, but on Vesta, it’s 4 and on Cetus it is 8.
Sophisticated applications wishing to do their own I/O subsetting should be aware of these default parameters and optimizations. In some cases, applications will try to subset to a small number of node and find greatly reduced I/O performance.

tuning

ROMIO and Intel-MPI

June 12th, 2014
Comments Off on ROMIO and Intel-MPI

ROMIO, in various forms, provides the MPI-IO implementation for just about every MPI implementation out there.   These implementations incorporate ROMIO’s hints when they pick up our source code, but they also add additional tuning parameters via environment variables.
The Intel MPI library uses ROMIO, but configures the file-system specific drivers a bit differently.   in MPICH, we select which file system drivers to support at compile-time with the --with-file-system configure flag.  These selected drivers are compiled directly into the MPICH library.  Intel-MPI builds its  file-system drivers as loadable modules, and relies on two environment variables to enable and select the drivers

  • I_MPI_EXTRA_FILESYSTEM
  • I_MPI_EXTRA_FILESYSTEM_LIST

Let’s say you had a Lustre file system, like [edit: archives have moved] This fellow on the HDF5 mailing list Then you would invoke mpiexec like this:

 mpiexec -env I_MPI_EXTRA_FILESYSTEM on \
        -env I_MPI_EXTRA_FILESYSTEM_LIST lustre -n 2 ./test

I found this information in the Intel MPI library Reference Manual, which contains a ton of other tuning parameters.
(Update 12 May 2015): Intel 5.0.2 and newer have GPFS support.  One would enable it the same way with the I_MPI_EXTRA_FILESYSTEM_LIST

mpiexec -env I_MPI_EXTRA_FILESYSTEM on \ -env  I_MPI_EXTRA_FILESYSTEM_LIST gpfs

(Update 7 April 2023): Intel MPI has added a few more file systems:  panasas (panfs) and DAOS (daos) are supported too.

gpfs, intel-mpi, lustre, tuning

Deferred Open

August 5th, 2003
Comments Off on Deferred Open

When I came to Argonne in 2002, my second project was to implement “deferred open”, where we would skip opening the file if certain hints were given.  We never got around to writing a paper about this optimization, though.  There’s a brief mention in the ROMIO users guide , but it wouldn’t hurt to have a bit more documentation about this feature.
First, some background. ROMIO has an optimization for collective I/O called “two-phase collective buffering”.  When writing, ROMIO selects a subset of processes as “I/O aggregators” . These aggregators are the MPI processes that actually write data to the file, after collecting data from all the other processors.   When reading, these I/O aggregators read the data in some file-system friendly way, then scatter the data out to the other MPI processors.  Observe that in two-phase, the non-aggregator processes never touch the file.  We use this observation to implement a deferred open strategy for non-aggregators.
To enable deferred open, two hint conditions must be true

  • romio_cb_write and romio_cb_read must not be “disable”.  That’s the default setting for every file system everywhere, though:  it’s rare to find this condition not met
  • romio_no_indep_rw must be “true”.  With this hint, the user has told ROMIO “I will not do any independent I/O”.   ROMIO will then attempt to avoid opening the file on any non-aggregator processes.
  • optional: The cb_config_list and cb_nodes hints can be given to further control which nodes are aggregators

The “deferred” part comes from the fact that MPI Info tunables are hints, not contracts.   The user might lie to ROMIO, specifying “romio_no_indep_rw” to “true” and then go right ahead and carry out a bunch of independent I/O operations.  In that case, ROMIO will open the file just before the independent I/O operation happens — we say the open has been deferred.
 

features, tuning