## December 13, 2021

1 *Materials Science and Chemistry at Exascale: Challenges and Opportunities***Jack Deslippe**, Lawrence Berkeley Lab

The end of Dennard scaling (processor frequency increases) and the near end of Moore’s Law (transistor area and power density) are leading to significant changes in computer architecture. As the High Performance Computing (HPC) community marches towards “Exascale” (system with 10^18 peak flops), the Top500 list of HPC systems in the world is being dominated by supercomputers employing accelerators (e.g. GPUs) and so-called many-core energy efficient processors. Many experts in the field are predicting a diversity in computer architectures exemplified by recent hardware like Google TPUs, Amazon’s custom ARM processors and the possibility of Neuromorphic and Quantum Computers on the horizon.

How will all this affect the computational materials science community? I’ll discuss the important trends and postulate on the challenges and opportunities that our community will need to take advantage of or overcome to continue to push the limits of scale and fidelity in scientific predictions and data analysis. I’ll discuss the material science problems being tackled on today’s largest scale and being readied for the first US exascale systems in both simulation and modeling as well as data analysis from current and next generation light-sources. I’ll discuss case studies of particular science applications that are combining new methods and algorithms with HPC scale to reach studies of system with unprecedented levels of size, accuracy and time.

2 *Readying Quantum Monte Carlo for Exascale***Paul Kent**, Oak Ridge National Laboratory

The diversity and scale of upcoming Exascale architectures poses a severe challenge to scientific codes. High performance, performance portability, and a minimum of sustained developer effort are desired. At the same time, common architectures must be supported and keep running at high efficiency. Readying the Quantum Monte Carlo code QMCPACK for Exascale has required a redesign of the main algorithms, large parts of the implementation, and adopting improved development practices. Due to these changes, QMCPACK is now on track to run with record performance on many accelerated architectures from a largely common codebase. Importantly, the algorithms that have been adopted will enable problems with a wide range of electron counts to run with high efficiency, including molecules and two dimensional nanomaterials that are typically smaller in electron count than solid state problems. We believe that other electronic structure and quantum chemical packages will benefit from adopting similar strategies for portability.

3 *Exploring new hardware and software with NWChem***Jeff Hammond**, NVIDIA Corporation

NWChem, the greatest of all quantum chemistry software packages, is an excellent tool for evaluating new programming models and for benchmarking new systems. This talk will present some of my recent results with NWChem on ARM CPUs and NVIDIA GPUs. On the ARM side, I use the atomic orbital DFT code to compare the Ampere Altra Q80 server processor based on the Neoverse N1 core with x86 server processors from Intel and AMD. These experiments don’t require any code changes, as NWChem has worked out of the box on ARM platforms since 2014. On the GPU side, I’ve ported CCSD(T) methods to A100 GPUs using Fortran standard parallelism, OpenACC, OpenMP and CUTENSOR, to evaluate performance portability tradeoffs between ISO languages, directives and performance libraries.

4 I*ntel® oneAPI and In*t*el® oneAPI Math Kernel Library Overview***Marius Cornea**, Intel Corporation

The presentation will describe the Intel initiative named oneAPI, led by Intel and backed by many members of the industry and academia. Intel® oneAPI is based on widely used standards, and aimed at creating a language and a programming model which should make it easy for users to program only once for many types on Intel (or non-Intel) hardware devices. We will talk about programming challenges for multiple architectures and vendors, and will refer users to the Intel® oneAPI Toolkit, a complete set of developer tools for CPUs, GPUs, FPGAs, and other accelerators. We will also describe how oneAPI can enable very good performance, mainly using optimized libraries. An example of such a library is oneMKL, the oneAPI evolution of Intel® MKL, the Intel® Math Kernel Library. Finally, we will present some of its characteristics along with a few examples.

5 *Template Task Graph: Novel Programming Model for High-Performance Scientific Computation***Ed Valeyev**, Virginia Tech

This talk will present Template Task Graph, a novel flowgraph programming model and its open-source C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. TTG offers ease of composition without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task- centric programming systems, but without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. Current TTG implementation supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of several paradigmatic applications with various degrees of irregularity implemented in TTG will be illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations.

6 *Multireference Electronic Structure with Graphics Processing Units***Bryan Fales**, Stanford University

Direct nonadiabatic molecular dynamics simulations provide an efficient way of modeling nonequilibrium physical processes. The potential energy surfaces required for these simulations are often constructed from multireference, wave function based ab initio electronic structure (EST) calculations. In many cases formation of these potentials dominates the cost of the simulation. Graphical processing units (GPUs) are well-suited for accelerating many of these EST calculations, provided care is taken to recast the problem in a framework amenable to parallel computation. The zeroth order multireference method, complete active space self-consistent field, relies on configuration interaction (CI) for its multireference character. In this presentation we’ll review GPU parallelization of the CI method, then turn our attention to solution of one of the long-standing problems faced by CI when a Slater determinant basis is used the propensity for electronic spin contamination

7 *NWChemEx: High performance computational chemistry for the futur*e**Wibe de Jong**, Lawrence Berkeley National Laboratory

The scientific complexity and high levels of fidelity required to solve many of the current and future chemistry, biochemistry and materials challenges requires software advances both in scalability over modern computer architectures and in reducing the explicit scaling dependency on the size of the molecular system. As part of the Exascale Computing Project funded by the Department of Energy’s Advanced Scientific Computing Research program, we have been developing the NWChemEx software to take advantage of both of these advances and to produce a flexible, extensible software development architecture for the molecular sciences. This talk will present some of the current bottlenecks that must be overcome for exascale computing and approaches used in NWChemEx to overcoming these bottlenecks – such as execution schedule aware tensor frameworks, composable simulations, and code generation targeting accelerator-based hardware. In addition, current capabilities with both canonical and reduced scaling local methods and current performance will be discussed.

8 *Data Parallel C++ (DPC++) as a heterogeneous programming model ***Abhishek Bagusetty**, Argonne National Laboratory

DPC++ is a cross-architecture programming language, based on C++ and SYCL and part of oneAPI, an industry initiative to develop an open, high-level language that can unify and simplify application development across diverse computing architectures. This section discusses execution and memory, dives into the fundamental building blocks of the DPC++ programming model, including default selection and queues, buffers, command group function objects, accessors, device kernels and Intel-specific extensions.

## December 14, 2021

9 *The CECAM Electronic Structure Library***Micael Oliveira**, Max Planck Institute for the Structure and Dynamics of Matter

The CECAM Electronic Structure Library (ESL) is a broad collaboration of electronic structure software developers, brought together in a effort to segregate shared pieces of software as libraries that could be contributed and used by the community. Besides allowing to share the burden of developing and maintaining complex pieces of software, these can also become a target for re-coding by software engineers as hardware evolves, ensuring that electronic structure codes remain at the forefront of HPC trends. In this talk I will discuss the different aspects and aims of the ESL, how it developed since its inception in 2014, and present the current status of the different projects that are part of it.

10 *Libraries for electronic structure community***Anton Kozhevnikov**, CSCS/ETHZ

The complexity of the HPC platforms and programming models should be reflected in the way we develop scientific software. Modularity of the codes and separation of concerns is the key. In this talk I’ll give a short overview of the various libraries created or co-designed at CSCS for the electronic structure community.

11 *Simplifying Multilevel Quantum Chemistry Procedures through Psi4 and QCArchive***Lori Burns**, Georgia Tech

The Psi4 quantum chemistry (QC) program is reworking its outer Python layer to facilitate high-throughput computing. For users, this allows naturally parallel procedures such as composite methods or many-body routines to run in parallel with minimal changes to the input. Central to this effort is interfacing with The Molecular Sciences Software Institute’s Quantum Chemistry Archive (QCA) project to provide database storage and promote standard interfaces for communication between software projects in the field. Capabilities to call Psi4 and other QC programs through increasingly uniform input suitable for software generation will also be discussed.

12 *Expanding Analysis Paradigms* *for Systems with Many-body Correlations***Aurora Clark**, Washington State University

Within real physical systems all particles interact, whether it is in the correlated motions of electrons or the self-assembly of amphiphilic molecules to form an emulsion. High performance computing power combined with the latest advancements of data science are now creating computationally tractable approaches that account for the full impact of many-body interactions across length and timescales. This work will highlight advances to discrete and computational mathematics, and the integration of new mathematics methods with physics-aware learning strategies, that have the potential to extend the impact of many-body theories beyond traditional domains of electronic structure theory, and into complex condensed matter systems.

13 *Programming Exascale machines with performance portable OpenMP***Ye Luo**, Argonne National Laboratory

As Exascale machines are being developed around the world, applications needs to be ported using performance portable programming models. Easy-to-use OpenMP offers a rich set of features including threading, tasking, SIMD and GPU offload to help developers achieving optimal performance by making best use of all the available resources and keep source code neat with directives and minimal API calls. It is one of the popular choices in the scientific community. We will show how OpenMP can be used and also demonstrate performance in real-world applications.

14 Eff*icient Tensor Methods and Software for Simulation of Quantum Systems***Edgar Solomonik**, University of Illinois at Urbana-Champaign

We describe advances in tensor methods and software for approximate modelling of electronic structure in quantum chemistry and condensed matter physics. On the algorithms side, we propose approximation techniques and a new optimization method to accelerate tensor decompositions. We also describe a new general mechanism for tensor contractions with Abelian group symmetry. On the software side, we introduce new libraries for large scale tensor computations: Cyclops, Koala, and AutoHOOT. Koala uses tensor network states (projected entangled pair states) to simulate time evolution of a quantum system. AutoHOOT provides efficient high-order automatic differentiation for tensor optimization problems. Both libraries leverage Cyclops to achieve distributed-memory parallelization of sparse and dense tensor operations.

15 *Fast Coulomb matrix construction via a hierarchical block low-rank representation of the ERI tenso*r**Edmond Chow**, Georgia Institute of Technology

The continuous fast multipole method (CFMM) is well known for its asymptotically linear complexity for constructing the Coulomb matrix in quantum chemistry. However, in practice, CFMM must evaluate a large number of interactions directly, being unable to utilize multipole expansions for interactions between overlapping continuous charge distributions. Instead of multipole expansions, we propose a technique for compressing the interactions between charge distributions into low-rank form, resulting in far fewer interactions that must be computed directly. We apply the H2 hierarchical matrix representation to the electron repulsion integral (ERI) tensor with Gaussian basis sets to rapidly calculate the Coulomb matrices in Hartree-Fock and density functional theory calculations. The hierarchical matrix approach has very modest storage requirements, allowing large calculations to be performed in memory without recomputing ERIs. Like CFMM, the hierarchical matrix approach is asymptotically linear scaling, but the latter requires severalfold less memory (or severalfold less computation, if quantities are computed dynamically) due to being able to efficiently employ low-rank approximations for far more blocks.

16 *Staggered mesh method for post-Hartree-Fock calculations of periodic systems***Lin Lin**, University of California, Berkeley

We will discuss recent progresses in understanding the finite-size effects for periodic electronic structure calculations, in particular for MP2 and RPA calculations. We propose a new method called the staggered mesh method, which is very simple to implement and can significantly reduce the finite size error for a large class of materials.

References:

X. Xing, L. Lin, Staggered mesh method for correlation energy calculations of solids: Random phase approximation in direct ring coupled cluster doubles and adiabatic connection formalisms [arXiv:2109.12430]

X. Xing, X. Li, L. Lin, Unified analysis of finite-size error for periodic Hartree-Fock and second order Möller-Plesset perturbation theory [arXiv:2108.00206]

X. Xing, X. Li, L. Lin, Staggered mesh method for correlation energy calculations of solids: Second order Moller-Plesset perturbation theory, J. Chem. Theory Comput. 17, 4733, 2021 [arXiv:2102.09652]

17 *Sparse Matrix Algorithms and Data Structures for Linear Scaling Density Functional Theory***William Dawson**, RIKEN Center for Computational Science

In this talk, I will present NTPoly, a library for computing the functions of sparse matrices. NTPoly can serve as a replacement for traditional eigensolver approaches by exploiting the locality that exists in large systems. I will begin by introducing the algorithms and data structures that power the NTPoly library. Then I will present some recent improvements to NTPoly since its integration into ELSI. I will continue by presenting some lessons learned about integrating NTPoly into DFT codes. Finally, I will overview some of our recent applications of large scale DFT.

## December 15, 2021

18 *Density functional theory calculations of large systems: Interplay between fragments, observables, and computational complexity***Luigi Genovese**, Atomistic Simulation Laboratory,CEA Grenoble (F)

In the past decade, developments of computational technology around Density Functional Theory (DFT) calculations have considerably increased the system sizes which can be practically simulated. The advent of robust high performance computing algorithms which scale linearly with system size has unlocked numerous opportunities for researchers. This fact enables computational physicists and chemists to investigate systems of sizes which are comparable to systems routinely considered by experimentalists, leading to collaborations with a wide range of techniques and communities. This has important consequences for the investigation paradigms which should be applied to reduce the intrinsic complexity of quantum mechanical calculations of many thousand atoms. It becomes important to consider portions of the full system in the analysis, which have to be identified, analyzed, and employed as building-blocks from which decomposed physico-chemical observables can be derived. After introducing the state-of-the-art in the large scale DFT community and some details about the computational implmentations of a Linear Scaling codes, we will illustrate the emerging research practices in this rapidly expanding field, and the knowledge gaps which need to be bridged to face the stimulating challenge of the simulation of increasingly realistic systems.

19 Withdrawn

20 *Designing a Python Interface for Molecular Dynamics and Monte Carlo Particle Simulations on GPUs: HOOMD-Blue v3.0***Joshua Anderson**, University of Michigan

HOOMD-blue [1] is a general-purpose toolkit that performs molecular dynamics and Monte Carlo simulations of particles. In this talk, I describe the design and implementation of the next major release of HOOMD-blue, v3.0. This release includes a new object-oriented Python interface, adds integrations with commonly used packages like NumPy and CuPY for zero-copy memory access, allows users to implement customizations to the simulation run loop, offers a flexible system for accessing and logging computed quantities, and adds support for AMD GPUs via HIP. As one example, HOOMD-blue v3.0 integrates with the MoSDeF [2] toolkit, allowing users to generate complex initial conditions with mBuild, atom-type the system with foyer, define force fields with gmso, and then run those simulations with HOOMD-blue – all in a script driven reproducible workflow scalable to large parameter spaces. We have also increased the amount of testing we perform in v3.0 including new and rewritten unit and validation tests using the pytest framework, which run on GitHub Actions. Docker and Singularity images package HOOMD-blue binaries for performance computing clusters such as PSC Bridges2 and SDSC Expanse. We also provide generic container images for workstations as well as a conda package on conda-forge.

[1]: Anderson, J. A., Glaser, J., & Glotzer, S. C. (2020). HOOMD-blue: A Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Computational Materials Science, 173, 109363. http://glotzerlab.engin.umich.edu/hoomd-blue/

[2]: Thompson, M. W., Gilmer, J. B., Matsumoto, R. A., Quach, C. D., Shamaprasad, P., Yang, A. H., Iacovella, C. R., McCabe, C., & Cummings, P. T. (2020). Towards molecular simulations that are transparent, reproducible, usable by others, and extensible (TRUE). Molecular Physics, 118(9–10), e1742938. https://mosdef.org/

21 *Periodic Coulomb Tree Method: An Alternative to Parallel Particle Mesh Ewald***Henry Boateng**, San Francisco State University

Particle mesh Ewald (PME) is efficient on low processor counts due to the use of the fast Fourier transform (FFT). However, due to the high communication cost of the FFT, PME scales poorly in parallel. We will present a periodic Coulomb tree (PCT) method for electrostatic interactions in periodic boundary conditions as an alternative to PME in parallel simulations. We will provide parallel timing comparisons of PME and PCT on up to 1024 cores.

22 *INQ: a state-of-the art implementation of (TD)DFT for GPUs***Xavier Andrade**, Lawrence Livermore National Laboratory

In this talk I will present INQ, a new implementation of density functional theory (DFT) and time-dependent DFT (TDDFT) written from scratch to work on graphical processing units (GPUs).

Besides GPU support, INQ makes use of modern code design features and techniques, to make development fast and simple, and to ensure the quality of the program. By designing the code around algorithms, rather than against specific implementations and numerical libraries, we provide a concise and modular code that is simple to understand, flexible, and extensible.

What we achieve is a fairly complete DFT/TDDFT implementation in roughly 12,000 lines of open-source C++ code. It represents a modular platform for community-driven application development on emerging high-performance computing architectures. The code is freely accessible at http://gitlab.com/npneq/inq .

In TDDFT simulations on GPU-based supercomputers INQ achieves excellent performance. It can handle hundreds and thousands of atoms, with simulation times of a second or less per time-step, and scale to thousands of GPUs.

23 *GPU-Acceleration of Large-Scale Full-Frequency GW Calculations***Victor Yu**, Argonne National Laboratory

Many-body perturbation theory is a powerful method to simulate accurate electronic excitations in molecules and materials starting from the output of density functional theory calculations. However, its widespread application to large systems has been hindered by the high computational cost. We present a GPU acceleration study of the full-frequency GW method for periodic systems, as implemented in the WEST code [http://west-code.org]. We discuss the use of (1) optimized GPU libraries, e.g., cuFFT and cuBLAS, (2) a hierarchical parallelization strategy that minimizes CPU-GPU, GPU-GPU, and CPU-CPU data transfer operations, (3) asynchronous GPU kernels that overlap with MPI communications, and (4) mixed precision in selected portions of the code. We demonstrate a substantial speedup of the GPU-accelerated version of WEST with respect to its CPU-version, and we show good strong and weak scaling using up to 25,920 GPUs on the OLCF/Summit supercomputer. The GPU version of WEST yields electronic structures using the full-frequency GW method for realistic nanostructures and interfaces comprising up to 10,368 electrons. This work was supported by MICCoM, as part of the Computational Materials Sciences Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences.

24 *B.MPI3 and Multi C++ libraries, generic common code for electronic structure codes and for various architectures***Alfredo Correa**, Lawrence Livermore National Laboratory

I will introduce the B.MPI3 and Multi libraries, for handling MPI communication and multidimensional arrays in generic modern C++. The libraries are currently used by the QMCPack and INQ (TDDFT) codes. I will present some design principles and common uses of the libraries.

25 *Applying Quantum Mechanics to Materials: Real-Space Codes***Jim Chelikowsky**, University of Texas at Austin

The advent of quantum mechanics provides us with a framework for computing the properties of materials based solely on the constituent atomic species. However, the computational load required to make successful materials predictions based on quantum mechanics, as originally cast, is intense and restricts applications to systems with only a small number of atoms.

As noted elsewhere: In recent years, real-space numerical methods have attracted attention as they are mathematically robust, very accurate and well suited for modern, massively parallel computing resources [1]. We will discuss recent advances for such real-space methods for the electronic structure problem as implemented in the codes: PARSEC and NanoGW [2-6]. The former is a ground state code that solves the Kohn-Sham equation on a grid. The latter is an excited state code that solves the GW/BS equations using the wave functions computed in real space by PARSEC. Applications include diverse materials ranging from magnetic clusters to liquid metals to doped nanostructures to defects in bulk materials to heterogeneous interfaces to complex organic molecules.

1 L. Frediani and D. Sunholm, Phys. Chem. Chem. Phys. 17, 31357 (2015).

2 K.-H. Liou, C. Yang and J.R. Chelikowsky, Comp. Phys. Commun. 254, 107330 (2020).

3 K.-H. Liou, A. Biller, L. Kronik and J.R. Chelikowsky, J. Chem. Theory and Comp. 17, 4039 (2021)

4 W. Gao and J.R. Chelikowsky, J. Chem. Theory and Comp. 15, 5299 (2019).

5 W. Gao and J.R. Chelikowsky, J. Chem. Theory and Comp. 16, 2216 (2020).

6 Website: real-space.org for downloads.

26 *NWChemEx Plane Wave Density Functional Theory and Ab Initio Molecular Dynamics Using Intel and Nvidia Graphical Processing Units***Eric Bylaska**, Pacific Northwest National Laboratory, Richland, WA 99354 USA

Ab-initio Molecular Dynamics (AIMD) simulations are an important technique, as they enable scientists to directly model he chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. A drawback of this method is that it has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. Graphical Processing Units (GPU), since they have the potential to provide the needed computational power to carry out simulations of interesting problems in chemistry. In this paper, we describe our initial efforts of refactoring the AIMD plane-wave method of NWChemEx from an MPI-only implementation to a scalable, hybrid code that employs MPI, Intel oneAPI DPC++ or Nvidia CUDA to exploit the capabilities of current and future many-GPU architectures. We describe the use of DPC++ and CUDA kernels required to get performance for the three dimensional fast Fourier transforms (3D-FFTs) and the multiplication of the tall-and-skinny matrices that form the other two parts of the core of the AIMD algorithm, i.e., Lagrange multiplier and non-local pseudopotential kernels. Our relatively straight forward implementation in which the key kernels are solely computed on a GPU is able to show an order of magnitude speedup in the simulation of test case for the absorption of hydrogen on a copper nanoparticle. We had less success in obtaining speedups with hybrid MPI-GPU kernels, especially for the 3D-FFT kernel. Further development is needed to reduce the significant data transfer costs between the host and the GPU.