## ATPESC (Argonne Training Program on Extreme-Scale Computing)

### Computer Architecture and Structured Parallel Programming

James Reinders, Intel August 3, 2015, Pheasant Run, St Charles, IL 09:30– 10:15





#### I have been fortunate, and I like to share.



I saw example after example get performance and performance portability with "just parallel programming" I summarize as "inspired by 61 cores"

|    | Volume 1  | <u>Volume 2</u> includes the following chapters:                                 |
|----|-----------|----------------------------------------------------------------------------------|
|    |           | Foreword by Dan Stanzione, TACC                                                  |
|    | Chapter 1 | Chapter 1: Introduction                                                          |
|    |           | Chapter 2: Numerical Weather Prediction Optimization                             |
|    | Chapter 3 | Chapter 3: WRF Goddard Microphysics Scheme Optimization                          |
|    |           | Chapter 4: Pairwise DNA Sequence Alignment Optimization                          |
|    | Chapter 5 | Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery              |
|    | Chapter 6 | Chapter 6: Amber PME Molecular Dynamics Optimization                             |
|    |           | Chapter 7: Low Latency Solutions for Financial Services                          |
|    | •         | Chapter 8: Parallel Numerical Methods in Finance                                 |
|    |           | Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization                    |
|    |           | Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice |
|    |           | Chapter 11: Visual Search Optimization                                           |
|    | •         | Chapter 12: Radio Frequency Ray Tracing                                          |
| 9  |           | Chapter 13: Exploring Use of the Reserved Core                                   |
|    |           | Chapter 14: High Performance Python Offloading                                   |
|    |           | Chapter 15: Fast Matrix Computations on Asynchronous Streams                     |
|    | •         | Chapter 16: MPI-3 Shared Memory Programming Introduction                         |
|    |           | Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism                  |
| h  |           | Chapter 18: Exploiting Multilevel Parallelism with OpenMP                        |
|    | •         | Chapter 19: OpenCL: There and Back Again                                         |
|    |           | Chapter 20: OpenMP vs. OpenCL: Difference in Performance?                        |
|    |           | Chapter 21: Prefetch Tuning Optimizations                                        |
|    | •         | Chapter 22: SIMD functions via OpenMP                                            |
|    |           | Chapter 23: Vectorization Advice                                                 |
|    |           | Chapter 24: Portable Explicit Vectorization Intrinsics                           |
|    |           | Chapter 25: Power Analysis for Applications and Data Centers                     |
| // | Chapter 2 |                                                                                  |
|    | Chapter 2 |                                                                                  |
|    | Chapter 2 |                                                                                  |

## Computer Architecture is FUN AGAIN



#### we need to make sure What if we talked about them this way: software is not CPUs Processors coprocessors, collateral damage. accelerators Interesting GPUs, Shared vs. Discrete Memory Spaces / Memory System Design FPGAS Integration of combinations vs. Discrete Building Blocks Fully Capable Programming Support vs. Restrictive Programming Performance and Hardware Configurability **Performance Portability** Power Consumption should be a requirement. calability



## Processors,

# coprocessors, GPUs & FPGAs,



## What if we talked about them this way?

## Processors — CPUs

## coprocessors, GPUs &

**FPGAs** 

## accelerators



## Interesting

Shared vs. Discrete Memory Spaces / Memory System Design

Integration of combinations vs. Discrete Building Blocks

Fully Capable Programming Support vs. Restrictive Programming

Hardware Configurability

Power Consumption

Scalability







## A cliché about someone missing the "big picture" because they focus too much on details: They "cannot see the forest for the trees."



# I V architecture.



# I V architecture.

but...





#### Can you teach parallel programming without first teaching computer architecture?









Can you teach parallel programming without first teaching computer architecture? (Or without just teaching a single API?)













TREES Cores HW threads Vectors Offload Heterogeneous Cloud Caches NUMA

FOREST Parallelism, Locality Parallelism, Locality



## Teach the Forest

## Increase exposing parallelism. Increase locality of reference.



## Teach the Forest

Increase exposing parallelism. Increase locality of reference.

Why? Because it's programming that addresses the universal needs of computers today and in the future future.



## Teach the Forest

Increase exposing parallelism. Increase locality of reference.

> THIS IS YOUR MISSION



# Why so many cores?



(intel)

### Why Multicore?

The "Free Lunch" is over, really. But Moore's Law continues!



### Processor Clock Rate over Time



#### Transistors per Processor over Time Continues to grow exponentially (Moore's Law)



(intel)

#### Moore's Law

Number of

components

(transistors) doubles about every 18-24 months.





#### Core and Thread Counts



Single core, single thread, ruled for decades. Multithread: grow die area small % for once, multibyte, multiword, many words. addition hardware thread(s) sharing resources. Multicore/Many Core: 100% die area for additional hardware thread without sharing,



Width



## Is this the Architecture Track?





## CPU



#### These were simpler times.



## CPU + cache



Memories got "further away" (meaning: CPU speed increased faster than memory speeds)

A closer "cache" for frequently used data helps performance when memory is no longer a single clock cycle away.

28



## CPU + caches



Memories keep getting "further away" (this trend continues today).

More "caches" help even more (with temporal reuse of data).



## CPU with caches



As transistor density increased (Moore's Law), cache capabilities were integrated onto CPUs.

Higher performance external (discrete) caches persisted for some time while integrated cache capabilities increase.



CPU

L2

FP

## CPU / Coprocessors

Memory

Coprocessors appearing first in 1970s were FP accelerators for CPUs without FP capabilities.





As transistor density increased (Moore's Law), FP capabilities were integrated onto CPUs.

Higher performance discrete FP "accelerators" persisted a little bit while integrated FP capabilities increase.



Interest to provide hardware support for displays increased as use of graphics grew (games being a key driver).

This led to graphics processing units (GPUs) attached to CPUs to create video displays.





GPU speeds and CPU speeds increase faster than memory speeds. Direct connection to memory best done via caches (on the CPU).



34



GPU speeds and CPU speeds increase faster than memory speeds. Direct connection to memory best done via caches (on the CPU).



35



As transistor density increased (Moore's Law), GPU capabilities were integrated onto CPUs.

Higher performance external (discrete) GPUs persist while integrated GPU capabilities increase.





A many core coprocessor (Intel<sup>®</sup> Xeon Phi<sup>™</sup>) appears, purpose built for accelerating technical computing.





# CPU / Coprocessors

As transistor density increased (Moore's Law), many core capabilities will be integrated to create a many core CPU. ("Knights Landing")





Nodes

"Nodes" are building blocks for clusters.

With or without GPUs. Displays not needed.







# Clusters

Clusters are made by connecting nodes regardless of "Nodes" type.





# NIC (Network Interface Controller) integration

As transistor density increased (Moore's Law), NIC capabilities will be integrated onto CPUs.









## What matters when programming?

- Parallelism
- Locality







# Amdahl who?



© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune , Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the

## How much parallelism is there?

#### Amdahl's Law

#### Gustafson's observations on Amdahl's Law









#### Amdahl's law

"...the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude."

– Amdahl, 1967

#### Amdahl's law – an observation

"...speedup should be measured by scaling the problem to the number of processors, not by fixing the problem size."

- Gustafson, 1988









## How much parallelism is there?

#### Amdahl's Law

Gustafson's observations on Amdahl's Law

# Plenty – but the workloads need to continue to grow !

© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.

# Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessor

#### Twice: More than one sustained TeraFlop/sec





ASCI Red: 1 TeraFlop/sec December 1996

1 TF/s 7264 Intel<sup>®</sup> Pentium Pro processors Knights Corner November 2011

Loop Phil Consocesso

1 TF/s one chip

22nm One PCI express slot

72 Cabinets

Twice: More than one sustained TeraFlop/sec More than *three* sustained TeraFlop/sec





ASCI Red: 1 TeraFlop/sec December 1996

1 TF/s 7264 Intel<sup>®</sup> Pentium Pro processors

1999: upgraded to **3.1 TF/s** with 9298 Intel<sup>®</sup> Pentium II Xeon processors

72 Cabinets

Knights Corner November 2011

1 TF/s one chip

22nm One PCI express slot Twice: More than one sustained TeraFlop/sec Twice: More than *three* sustained TeraFlop/sec



Knights Corner Kr November 2011

1 TF/s one chip

22nm One PCI express slot Knights Landing 2015 **3 TF/s** 

14nm

**One Processor** 

**XEON PHI** 

© 2015, Intel Corporation 241 Charter the Long to the Intel 104 (c) Cite With 104 (c

ASCI Red: 1 TeraFlop/sec December 1996

1 TF/s 7264 Intel<sup>®</sup> Pentium Pro processors

1999: upgraded to **3.1 TF/s** with 9298 Intel<sup>®</sup> Pentium II Xeon processors

72 Cabinets



#### I want you to understand why







### Design Question: Computation?





A few powerful vs. Many less powerful.

Diagrams for discussion purposes only, not a precise representation of any product of any company.



## Design Question





A few powerful vs. Many much less powerful and very restrictive.

Diagrams for discussion purposes only, not a precise representation of any product of any company.



### If you were plowing a field, which would you rather use... two strong oxen, or 1024 chickens?







## Design Question





A few powerful vs. Many less powerful.

## Same programming models, languages, optimizations and tools.

Diagrams for discussion purposes only, not a precise representation of any product of any company.



## Design Question



## Same programming models, languages, optimizations and tools.

Diagrams for discussion purposes only, not a precise representation of any product of any company.



## vision

# span from *few cores* to *many cores* with consistent models, languages, tools, and techniques



Optimizations for Intel® Xeon® and Intel® Xeon Phi™ products share the

- ✓ Languages
  - Directives
- Libraries
- ✓ Tools

© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the

#### Picture worth many words



DER DEC

© 2013, James Reinders & Jim Jeffers, diagram used with permission

#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor



#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor



#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessors



Up to 61 cores, 1.1 GHz, 244 threads.

Up to 16GB memory.

Up to 352 GB/s bandwidth.

Runs Linux OS.

Standard tools, models, languages.

1 TFLOP/s DP FP peak.

Better for parallelism than processor... Up to 2.2X performance Up to 4X more power efficient





## it is an SMP-on-a-chip running Linux

| 800                               | /pot@idpidknf01 /KNC — sah — 100x35                                             |
|-----------------------------------|---------------------------------------------------------------------------------|
| % cat /proc/cpu                   | info   head -5                                                                  |
| processor                         |                                                                                 |
| vendor_id                         | : GenuineIntel                                                                  |
| cpu family                        | : 11                                                                            |
| model                             | : 1                                                                             |
| model name                        | : 06/01                                                                         |
| States and a second second second |                                                                                 |
| % cat /proc/cpu                   | info   tail -26                                                                 |
| processor                         | : 243                                                                           |
| vendor_id                         | : GenuineIntel                                                                  |
| cpu family                        | : 11                                                                            |
| model                             |                                                                                 |
| model name                        | : 0b/01                                                                         |
| stepping                          | 2 I                                                                             |
| cpu MHz                           | : 1090.908                                                                      |
| cache size                        | : 512 KB                                                                        |
| physical id                       | 10                                                                              |
| siblings                          | : 244                                                                           |
| core id                           | : 60                                                                            |
| cpu cores                         | : 61                                                                            |
| apicid                            | 1 243                                                                           |
| initial apicid                    | 1 243                                                                           |
| fpu                               | : yes                                                                           |
| fpu_exception                     | : yes                                                                           |
| cpuid level                       |                                                                                 |
| wp                                | 1 YOS AND                                   |
| flags                             | fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht syscall lm lahf_lm |
| bogomips                          | : 2192.10                                                                       |
| clflush size                      | : 64                                                                            |
| cache_alignment                   |                                                                                 |
| address sizes<br>power managemen  | : 40 bits physical, 48 bits virtual<br>t:                                       |
| 5                                 |                                                                                 |

## Next Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor: Knights Landing





#### 14nm process

#### standalone standalone CPU or PCIe coprocessor

up to 72 cores ("Silvermont" based)

AVX-512 SIMD

on-package high-bandwidth memory

> on-package NIC (optional)

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.





Based on an actual customer example. Shown to illustrate a point about common techniques. Your results may vary!

Fortran code using MPI, single threaded originally. Run on Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessor natively (no offload).













**Processors and Intel® Xeon Phi™ products** span from few cores to many cores with consistent models, languages, tools, and techniques

#### Picture worth many words





#### © 2013, James Reinders & Jim Jeffers, diagram used with permission

#### Picture worth many words



#### © 2013, James Reinders & Jim Jeffers, diagram used with permission

GPUs, FPGAs, and probably other "accelerators" will still offer "alternatives." They all start by trading off flexibility (generality). The question to ask is "is it worth it?" The answer is "sometimes." *James' assertion:* Non-general solutions will give way to general solutions when we have them. ( if ? )

# How do I "think parallel" ?

(intel)



## **Parallel Patterns: Overview**





Examples: gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing.

- *Map* invokes a function on every element of an index set.
- The index set may be abstract or associated with the elements of an array.
- Corresponds to "parallel loop" where iterations are independent.



Examples: averaging of Monte Carlo samples; convergence testing; image comparison metrics; matrix operations.

- Reduce combines every element in a collection into one using an associative operator: x+(y+z) = (x+y)+z
- For example: *reduce* can be used to find the sum or maximum of an array.
- Vectorization may require that the operator *also* be *commutative*: x+y = y+x



**Examples:** image filtering including convolution, median, anisotropic diffusion

- *Stencil* applies a function to neighbourhoods of an array.
- Neighbourhoods are given by set of relative offsets.
- Boundary conditions need to be considered.

# Pipeline

- Pipeline uses a sequence of stages that transform a flow of data
- Some stages may retain state
- Data can be consumed and produced incrementally: "online"

**Examples:** image filtering, data compression and decompression, signal processing



- Parallelize pipeline by
  - Running different stages in parallel
  - Running *multiple copies* of stateless stages in parallel
- Running multiple copies of stateless stages in parallel requires reordering of outputs
- Need to manage buffering between stages



## For More Information

#### **Structured Parallel Programming**

- Michael McCool
- Arch Robison
- James Reinders

Uses Cilk Plus and TBB as primary frameworks for examples.

Appendices concisely summarize Cilk Plus and TBB.

www.parallelbook.com

(pointers to teaching materials, ours and others!)



# Computer Architecture is FUN AGAIN



#### we need to make sure What if we talked about them this way: software is not CPUs Processors coprocessors, collateral damage. accelerators Interesting GPUs, Shared vs. Discrete Memory Spaces / Memory System Design FPGAS Integration of combinations vs. Discrete Building Blocks Fully Capable Programming Support vs. Restrictive Programming Performance and Hardware Configurability **Performance Portability** Power Consumption should be a requirement. calability



# It's your Forest

Increase exposing parallelism. Increase locality of reference.

# **YOUR MISSION**

# Questions?



# james.r.reinders@intel.com

# Break Now We resume @ 10:45am (to talk about TBB, OpenMP, SIMD/vectors)



# james.r.reinders@intel.com



#### James Reinders. Parallel Programming Evangelist. Intel.

James is involved in multiple engineering, research and educational efforts to increase use of parallel programming throughout the industry. He joined Intel Corporation in 1989, and has contributed to numerous projects including the world's first TeraFLOP/s supercomputer (ASCI Red) and the world's first TeraFLOP/s microprocessor (Intel® Xeon Phi<sup>™</sup> coprocessor). James been an author on numerous technical books, including VTune<sup>™</sup> Performance Analyzer Essentials (Intel Press, 2005), Intel<sup>®</sup> Threading Building Blocks (O'Reilly Media, 2007), Structured Parallel Programming (Morgan Kaufmann, 2012), Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor High Performance Programming (Morgan Kaufmann, 2013), Multithreading for Visual Effects (A K Peters/CRC Press, 2014), High Performance Parallelism Pearls Volume 1 (Morgan Kaufmann, Nov. 2014), and High Performance Parallelism Pearls Volume 2 (Morgan Kaufmann, Aug. 2015). James is working on a refresh of both the Xeon Phi<sup>™</sup> book (original Feb. 2013, revised with KNL information by mid-2016) and a refresh of the TBB book (original June 2007, revised by 2017



## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright ° 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804