## PIM & Memory: The Need for a Revolution in Architecture

Peter M. Kogge McCourtney Prof. of CSE Univ. of Notre Dame IBM Fellow (retired) 7/29/13



http://www.ket.org/pressroom/2000/38/MPT\_OliverTwist\_1200.jpg

NOTRE DAME

ATPESC July 29, 2013

1

## Comment

- Following has *lots* of charts and pictures
- Key take-aways are *trends*
- Original charts in 2008 Exascale report
- Updates in SC11 paper
- More updates shortly

Acknowledgement: The data in this presentation was funded in part by the US Dept. of Energy, Sandia National Labs, as part of their XGC project.

### We All Know The Story: Unbroken Growth in Rmax



### 2004: The Power Wall Changed Architecture



# **Key Memory Characteristics**

- **Capacity**: esp. per node/socket/core...
- Bandwidth: esp. per flop
- Latency: as a function of size
- Energy: esp. compared to computation Looking Forward: Problems in All Areas! My view: Architecture must focus on memory, not computation

## And What Do We See in Apps?



## The 2008 Exascale Report

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems

Peter Kogge, Editor & Study Lead Keren Bergman Shekhar Borkar Dan Campbell William Carlson William Dally **Monty Denneau Paul Franzon** William Harrod Kerry Hill Jon Hiller Sherman Karp Stephen Keckler Dean Klein **Robert Lucas Mark Richards** Al Scarpelli Steven Scott **Allan Snavely Thomas Sterling R. Stanley Williams Katherine Yelick** 





September 28, 2008

This work was sponsored by DARPA IPTO in the ExaScale Computing Study with Dr. William Harrod as Program Manager, AFRL contract number **FA8650-07-C-7724.** This report is published in the interest of scientific and technical information exchange and its publication does not constitute the Government's approval or disapproval of its ideas or findings

#### NOTICE

Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Government. The fact that the Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation; or convey any rights or permission to manufacture, use, or sell any patented invention that may relate to them.

APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED.

NOTRE DAME

N I V E R S I

- Goal: "Exascale" 1000X Petascale
  - Exa supercomputer
  - Peta rack
  - Tera embedded
- 2015 Exa supercomputer in 20MW = 20pJ/flop
- 4 problems
  - Power/Energy
  - Memory
  - Resiliency
  - Programming

## Energy per Flop is Dropping: But Not Fast Enough



VITA CEDO DUL- SPES

# Topics

- Today's architectures
- Memory as a Technology
- Why is memory a growing problem
- The first attempt at alternative architectures: **Processing In Memory**
- The emerging future: Processing Near Memory



## Memory in Today's Architectures



ATPESC July 29, 2013

ENABI

10

# **Today's Architecture Classes**

- **Heavyweight**: traditional 100+W multi-core
  - Often requires supporting chip set
- Lightweight: lower power single chip system
   Lower performance but denser packaging
- Hybrid/Heterogeneous: Heavyweight/GPU combination, with radically different ISAs
- **Big/Little**: Multi-core, same ISA, but different core sizes
- But wait! There's more when we try for very large shared memory
  - And more on the way

# **Today's Heavyweight Blade**





**A Power 7 Drawer** 



# Lightweight: Eg. BlueGene/Q







Integrated

- NIC
- Memory controllers

ENAB

13

http://www.heise.de/newsticker/meldung/SC-2010-IBM-zeigt-BlueGene-Q-mit-17-Kernen-1138226.html

VERSITY OF

NOTRE DAME

VITA CEDO DUL- SPES

# Other Lightweight Systems Emerging



Calxeda quad-socket, quad-core ARMs

NIVERSIT

NOTRE DAME

Dual LAN switch modules (6 x 10GbE uplinks) Redundant power and cooling SMBIOS 2.6.1 & PXE support



HP Moonshot 1500 System

Up to 45 Moonshot server/blade/cards



Vertical installed Intel Atom S1260 2Ghz 8GB DRAM (unbuffered) Dual port 1 GbE LAN Single SATA HDD or SSD

14

**HP Moonshot** 



ABLING

## Heterogeneous Architectures



http://www.nvidia.com/object/fermi\_architecture.html

10

D

100

LD.

VERSI

NOTRE DAME

A Titan Blade

.

.

VITA CEDO DUL- SPES

.



Mix of heavyweight masters and GPU compute engines

Host

Memory

**Conventional Computer** 

Proc

Stream

Stream Proc

15



Host

Processor

# **Big Little Architectures**

- Heterogeneous multi-core with same ISA
- "Bigger" cores have higher performance (more instructions per second)
  - But are less energy efficient
- "Littler" cores have less performance
  - But are much more energy efficient
- Ability to move program states from core to core
- Examples:
  - ARM Cortex-A15 and A7, A53 and A57
  - Intel Xeon and Xeon PHI **#1 in June 2013**

# **Memory In Any of These**

- On the end of a memory channel and NOT on the processor chip
- At most 2-4 such channels per socket
   Limited by off-chip pins
- At most 4 sockets sharing memory over specialized interfaces before complexity too great
- Energy of access/transport becoming dominate
- Increasingly deep cache structures on processor socket
  - With complex rules for coherency/consistency
  - And very complex protocols for "atomic" operations
  - And punt to software when non-local access

### **Accessing Remote Memory Today**



## Cray MTA (and Follow-ons)

- Heavily multi-threaded cores
  - With fast thread create/switch
- True PGAS memory
  - With non-local load/store detected/managed/routed by hardware
- 2 tag bits per memory word
  - Full/empty
  - extended
- Extended load/store semantics to interact with full/empty words

#### Cray uRiKA High Density Node



http://www.adms-conf.org/uRiKA\_ADMS\_keynote.pdf

- All non-local memory "equally remote"
- Relatively less dense memory (6TB/rack)
- Atomics still require interaction with remote host

UNIVERSITY OF NOTRE DAME

ATPESC July 29, 2013

MAIDLINUT

## SGI UV 2000 cc NUMA



 $http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Developer/books/LX_86_AppTune/sgi_html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.html/ch05.h$ 

NIVERSITYOF

NOTRE DAME

- Each blade
  - 2 8-core sockets
  - Up to 128GB
  - Separate Hub chip
- 32 blades/rack
  - 512 cores
  - only 4TB/rack
- +: Cache-coherent shared memory
  - But via complex offchip directories

20

- Up to 64TB
- Limited atomics

### 2008 Exascale Report Strawman



## Memory: The Technology



ATPESC July 29, 2013

ENABLI

NRO VAII

#

22

## **Basic DRAM**

From Computer Desktop Encyclopedia @ 2005 The Computer Language Co. Inc.



http://encyclopedia2.thefreedictionary.com/dynamic+RAM

UNIVERSITY OF NOTRE DAME

VITA CEDO DUI- SPES



## **Intel 1103: Splitting the Address**



NIVERSI

NOTRE DAME



- 1k x 1bit part from outside
- But 10b address split in 2
  - 5b Row Address: which of 32
    32b words
  - 5b Column Address: which bit of that word

24

### "Please Sir, I want more" Multiple Sockets with Coherent Shared Memory



- Now addresses must be sorted by socket
  - before they are routed to correct socket
  - before the are routed to correct channel

NOTRE DAM

before they are processed by memory controllers

# A Bigger Die

- Cannot organize Gb chips as 1G rows by 1b
- Must break into "Blocks"
  - Typically  $\sim$ 1Kb x 1Kb
- Arrange blocks into-"Banks"
- Address now:
  - Which bank

NOTRE DAME

VITA CEDO DUL- SPES

- Which block in bank
- Which row in block
- Which bits in row



ENABLING

26

### **But Now We Can Run Banks** "Concurrently"



from Micron MT41J256M8 32M x b x 8 Banks ENABLING

NROVATTION

27



# A "Simple" DIMM



- All chips get same address/command
- Each chip contributes its 4 or 8 bits to data bus

28

- Interface speed rated in "Transfers/sec"
- DIMM "looks like" 8 concurrent banks of 64b

ATPESC July 29, 2013

NOTRE DA

### What Does the Memory Controller Do?

- Stream of addresses from core(s)
- Sort by bank number
- Within same bank, sort by row #
- For same row of same bank:
  - Issue initial row read request
  - Issue word reads and writes to that row
  - Close row when done to refresh memory
- Remember, sets for other banks can be executed concurrently
  - Sequentially interleaved over single common memory channel



29

## "Please Sir, I want more"



- All share same wires to microprocessor
- But can only talk to 1 DIMM at a time
- Add "DIMM #" to address called "Rank"
- Now Memory Controller must sort by rank also
- Capacitive loading from all DIMMs slows transfers

30

## "Please Sir, I want more"



- Put multiple ranks on same DIMM
- Still can only talk to 1 rank at a time
- Electrical loading problem continues
  - Each rank still loading same bus

NOTRE

- Even worse with multiple multi-rank DIMMs

ATPESC July 29, 2013

31

### "Please Sir, I want more" Load Reduced DIMMs



- Helps improve electrical transfer speeds
- But still deal with multiple ranks, banks, blocks, rows
- And typically increase latency

VITA CEDO DUL- SPES NOTRE DAME

ATPESC July 29, 2013

32

### "Please Sir, I want more" Multiple Ranks Move "On Die"



VITA CEDO DUI-SPES

### **Towards a Single DIMM Per Channel**



VITA CEDO DUI- SPES

### "Please Sir, I want more" Multiple Memory Channels/Socket



- Multiple cores on socket all contribute to address streams
- Now addresses must be sorted by channel before they are processed by memory controllers

35

ATPESC July 29, 2013

NOTRE DAI

## Memory: The Growing Problem



ATPESC July 29, 2013

ENABL

KYKY(D) 1/4 | 7 [

#

36

## **The Traditional Rule of Thumb**



## **Capacity per Socket**





## **Memory Density Increasing**



## **But DRAM Die Sizes Are Flattening or Decreasing**



## So Memory Density Growth/ Die is Slowing



## Off-Chip Signaling Rates Have Hit A Ceiling



## But Growth In Chip I/O is at Best Slow



## With Max "Per Unit Logic" Off-Chip B/W Decaying



## With Even Less B/W When We Consider Real Clocks



## **Relook at Exascale Strawman**



| <u>Operation</u>        | Energy (pJ/bit) |  |  |
|-------------------------|-----------------|--|--|
| Register File Access    | 0.16            |  |  |
| SRAM Access             | 0.23            |  |  |
| DRAM Access             | 1               |  |  |
| On-chip movement        | 0.0187          |  |  |
| Thru Silicon Vias (TSV) | 0.011           |  |  |
| Chip-to-Board           | 2               |  |  |
| Chip-to-optical         | 10              |  |  |
| Router on-chip          | 2               |  |  |

| Step             | Target | рJ     | #Occurrances | Tot | al pJ | % of Total   |
|------------------|--------|--------|--------------|-----|-------|--------------|
| Read Alphas      | Remote | 13,819 | 4            | 55  | 276   | 16.5%        |
| Read pivot row   | Remote | 13,819 | 4            | 55  | 276   | 16.5%        |
| Read 1st Y[i]    | Local  | 1,380  | 88           | 121 | 5     | <b>0X</b> %  |
| Read Other Y[i]s | L1     | 39     | 264          | 10  | 2     | <b>V</b> %   |
| Write Y's        | L1     | 39     | 352          | 13  | 900   | 4.2 <b>%</b> |
| Flush Y's        | Local  | 891    | 88           | 78  | 380   | 23.4%        |
| Total            |        |        | 334,656      |     |       |              |
| Ave per Flop     |        |        |              | 4   | 75    |              |

**If this is true, 1 EF/s = 0.5 GW!** 



## **Processing In Memory**



ATPESC July 29, 2013

ENABLI

KKOV/4177

48

## The 3 1994 Approaches to Petaflops



### PIM

- Only way to get a lot of memory is a lot of memory!
- Current memory *wastes* 98% of actual data fetched within DRAM chip
- Bulk of energy costs on
  - shipping small piece of requested data off chip
  - transporting it up and down cache hierarchy
  - over long on-chip distances
- Obvious solution: place cores on memory
- But still permit large multi-chip systems

## TERASYS SIMD PIM (circa 1993)





NOTRE D

- Memory part for CRAY-3
- "Looked like" SRAM memory
  - With extra command port
- •128K SRAM bits (2k x 64)
- 64 1 bit ALUs
- SIMD ISA
- Fabbed by National
- Also built into workstation with 64K processors

51

• 5-48X Y-MP on 9 NSA benchmarks

## RTAIS: Search In Memory (circa 1993)



NOTRE DAME



- Application: "Linda in Memory"
- Designed from onset to perform wide ops "at the sense amps"

**MEMORY BUS** 

- More than SIMD: flexible mix of VLIW
- "Object oriented" multi-threaded memory interface
- Result: 1 card 60X faster than state-of-art R3000 card

## EXECUBE: SPMD on Chip (1993)

- First DRAM-based Multi-core on a Chip
- Designed from onset for "glueless" one-part-type scalability



## **An Array of EXECUBEs**



ENABLING

INNOVATION

54



## Mitsubishi M32R/D (circa 1997)



55

LENAB

- 32-bit fixed point CPU + 2 MB DRAM
- "Memory-like" Interface

NOTRE DAME

VITA CEDO DUL- SPES • Utilize wide word I/F from DRAM macro for cache line

## Linden DAAM Chip (1998)

- Designed for in-memory text search
- 16 Mbit DRAM divided into 64 blocks
- 64 1-bit Processing Elements per block



Fig. 1 from Lipovski and Yu, "The Dynamic Associative Access Memory Chip and its Application to SIMD Processing and Full-text Database Retrieval", 1999

56

ATPESC July 29, 2013

NOTRE DAME

## **DIVA:** Smart DIMMs for Irregular Data Structures



## **Berkeley VIRAM**

- System Architecture: single chip media processing
- ISA: MIPS Core + Vectors + DSP ops
- 13 MB DRAM in 8 banks
- Includes flt pt

NOTRE DAME

 2 Watts @ 200 MHz, 1.6GFlops



58

ENABL



## The HTMT Architecture & PIM Functions



## **Micron Yukon**

- 0.15μm eDRAM/ 0.18μm logic
   process
- 128Mbits DRAM

NOTRE DAME

- 2048 data bits per access
- 256 8-bit integer processors
  - Configurable in multiple topologies
- On-chip programmable controller
- Operates like an SDRAM







Kogge, "Of Piglets and Threadlets: Architectures for Self-Contained, Mobile, Memory Programming, IWIA, Maui, HI, Jan. 2004

- Single Address Space Visible to all Hosts & Gossamer Cores
- Hosts can launch:
  - Reads and Writes of Memory
  - Threadlets for execution on Gossamer core
- Gossamer Cores can

VERSITYOF

NOTRE DAME

- Spawn new threadlets
- Migrate threadlets to other cores

**ATPESC July 29, 2013** 

63

## **Processing Near Memory**



ATPESC July 29, 2013

64

ENABLI

INNOVAT



### "Please Sir, I want more" The Emergence of Hybrid 3D Memory



http://www.micron.com/products/hybrid-memory-cube

NOTRE D



Stackable memory chips (no cores)

"Through Silicon Vias" (TSVs)

Logic chip on bottom

- Multiple memory controllers

 More sophisticated off-stack interfaces than DDR

• Prototype demonstrated in 2011

• 1<sup>st</sup> Product expected in 2015 timeframe

- Spec:http://www.hybridmemorycube.org

- Capacity: up to 8GB: 8X single chip

- Bandwidth: up to 480GB/s: 40X

- Lots of room on logic chip

Bottom Line: Huge increase in

- Memory density

- Bandwidth

## **The HMC Architecture**



• All vaults run independently

NIVERSITY OF

NOTRE DAME

VITA CEDO DUL- SPES

Each vault looks like a set of M dual independent banks

67



### **Enhancing a Conventional Architecture**

- CPU(s) sees sea of memory stacks all "far"
- True address-based routing







### What Might be the Bandwidth/Stack



## **Memory Stack Power Estimates**



## **SNL Xcaliber Architecture**



74

DRAM Layer 1

DRAM

DRAM

DRAM Layer N

Logic

Layer

EMP

EMP

Layer 3

Layer 2

## What's In an Exascale Address?

- 1. Which Node
- 2. Which Socket
- 3. Which Channel
- 4. Which "DIMM"
- 5. Which Stack
- 6. Which Vault
- 7. Which Bank set
- 8. Which Bank
- 9. Which Block
- 10. Which Row
- 11. Which Word

Optimal Data Placement Now a 11-Dimensional Problem And this doesn't include Cache Hierarchy

75

### But ...

- What if we add cores to the logic die on each stack?
- Now opportunity for real PGAS architecture *in the small*!



76



### Thought Experiment: Memory Stack Only Version

- Same stack as from X-caliber
  - Multiple DRAM, NVRAM vaults
  - Internal crossbar for full interconnect
  - 8 external ports (still wire)
- Multiple stacks on something like a DIMM
- Remove Processor sockets and NIC chips
- Use stack external ports for all routing
- Keep routing on global address
- And grow up logic chip processing
  - "Conventional Core" per vault

### **Memory + Processing**



## **Energy/Flop Extrapolation**



#### We can see the Goal!

Projections made on basis of ITRS and other public data



ATPESC July 29, 2013

79

## Conclusions

- Memory is essential for computing
- But rapidly becoming severe limitation
- Limitations stem from architecture that separates memory from computation
- PIM: attempt to overcome
- 3D stacks will enable massive "Processing Near Memory"

# That have a shot at useable extreme scale

