{"id":277,"date":"2015-05-15T15:46:14","date_gmt":"2015-05-15T15:46:14","guid":{"rendered":"http:\/\/press3.mcs.anl.gov\/romio\/?p=277"},"modified":"2015-05-15T15:46:14","modified_gmt":"2015-05-15T15:46:14","slug":"aggregation-selection-on-blue-gene","status":"publish","type":"post","link":"https:\/\/wordpress.cels.anl.gov\/romio\/2015\/05\/15\/aggregation-selection-on-blue-gene\/","title":{"rendered":"aggregation selection on Blue Gene"},"content":{"rendered":"<p>For a lot of workloads, simply using collective I\/O provides a big performance boost.\u00a0 Sometimes, though, it&#8217;s necessary to tune collective I\/O a bit.\u00a0 The hint &#8220;cb_nodes&#8221; provides a way to select how many MPI processes will become aggregators.\u00a0\u00a0 On Blue Gene, though, the story is a little more complicated.<br \/>\nWe&#8217;ll start with Blue Gene \/L and \/P, even though those machines are now obsolete. The concepts on the older machines still apply, if in a slightly different form. The 163840 cores on the Intrepid BlueGene\/P system are configured in a hierarchy. To improve the scalability of the BlueGene architecture, dedicated &#8220;I\/O nodes&#8221; (ION) act as system call proxies between the compute nodes and the storage nodes. On Intrepid, we call the collection of an ION and its compute nodes a &#8220;pset&#8221;. Each Intrepid pset contains one ION and 64 4-core compute nodes.<br \/>\nThe MPI standard defines &#8216;collective&#8217; routines. Unlike the &#8216;independent&#8217; routines, all processes in a given MPI communicator call the routine together. The MPI implementation, with the knowledge of which tasks participate in a call, can then perform significant optimizations. These collective routines provide tremendous performance benefits for both networking and I\/O.<br \/>\nThe BlueGene MPI-IO library, based on ROMIO, makes some adjustments to the ROMIO collective buffering optimization. First, data accesses are aligned to file system block boundaries. Such an alignment reduces lock contention in the write case and can yield big performance improvements.<br \/>\nSecond, and perhaps most importantly from a scalability perspective, the &#8220;I\/O aggregators&#8221; selected for the I\/O phase of two-phase are a small subset of the total number of processors. On BlueGene, the MPI-IO hint &#8220;bgl_nodes_pset&#8221; defines a ratio. For each pset allocated to a process, that many nodes will be designated as aggregators. The default ratio for a job running in &#8220;virtual node&#8221; is one aggregator for every 32 MPI processes. Furthermore, these aggregators are distributed over the topology of the application so that no node has more than one aggregator and no pset contains more than &#8220;bgl_nodes_pset&#8221; aggregators.<br \/>\nOn Mira (Blue Gene \/Q) the story is a bit more complicated. I\/O nodes no longer are statically assigned to compute nodes. Rather, there is a pool of I\/O nodes. When a job is launched, some portion of those I\/O nodes gets assigned to the compute nodes.<br \/>\nOn Mira, a set of 128 compute nodes (known as a pset) has one I\/O node acting as an I\/O proxy. For every I\/O node there are two network links of 2 GB\/s toward two distinct compute nodes acting as bridge. Therefore, for every 128-node partition, there are n<sub>b<\/sub> = 1 \u00d7 2 = 2 bridges. The I\/O traffic from compute nodes passes through these bridge nodes on the way to the I\/O node. The I\/O nodes are connected to the storage servers through Quad-data-rate (QDR) InfiniBand links. On BG\/Q the programmer can set the number of aggregators per pset n<sub>a_pset<\/sub> (the hint on BG\/Q has been renamed to &#8220;bg_nodes_pset&#8221;).\u00a0 One can determine the total number of aggregators of an application n<sub>a<\/sub> knowing n<sub>a_pset<\/sub> , n, and n<sub>b <\/sub>with the following equation:<br \/>\n<figure id=\"attachment_279\" aria-describedby=\"caption-attachment-279\" style=\"width: 224px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/wordpress.cels.anl.gov\/romio\/wp-content\/uploads\/sites\/31\/2015\/05\/CodeCogsEqn.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-279\" src=\"https:\/\/wordpress.cels.anl.gov\/romio\/wp-content\/uploads\/sites\/31\/2015\/05\/CodeCogsEqn.png\" alt=\"Computing the number of aggregators on Blue Gene is... not straightforward\" width=\"224\" height=\"34\" \/><\/a><figcaption id=\"caption-attachment-279\" class=\"wp-caption-text\">Computing the number of aggregators on Blue Gene is&#8230; not straightforward<\/figcaption><\/figure><br \/>\nThe number of bridge nodes is hardware dependent.\u00a0 For the Argonne machines,\u00a0 Mira&#8217;s\u00a0 n<sub>b<\/sub> is always 1, but on Vesta, it&#8217;s 4 and on Cetus it is 8.<br \/>\nSophisticated applications wishing to do their own I\/O subsetting should be aware of these default parameters and optimizations. In some cases, applications will try to subset to a small number of node and find greatly reduced I\/O performance.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For a lot of workloads, simply using collective I\/O provides a big performance boost.\u00a0 Sometimes, though, it&#8217;s necessary to tune collective I\/O a bit.\u00a0 The hint &#8220;cb_nodes&#8221; provides a way to select how many MPI processes will become aggregators.\u00a0\u00a0 On Blue Gene, though, the story is a little more complicated. We&#8217;ll start with Blue Gene &hellip;<\/p>\n","protected":false},"author":362,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[11],"tags":[],"class_list":["post-277","post","type-post","status-publish","format-standard","hentry","category-tuning"],"acf":[],"_links":{"self":[{"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/posts\/277","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/users\/362"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/comments?post=277"}],"version-history":[{"count":0,"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/posts\/277\/revisions"}],"wp:attachment":[{"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/media?parent=277"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/categories?post=277"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wordpress.cels.anl.gov\/romio\/wp-json\/wp\/v2\/tags?post=277"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}