## **IBM Research Europe** PHRYCTORIA: A Messaging System for Transprecision OpenCAPI-attached FPGA Accelerators

<u>Dionysios Diamantopoulos,</u> Mitra Purandare, Burkhard Ringlein and Christoph Hagleitner

Cloud FPGAs & Tape Group Cloud & AI Systems Research Department did@zurich.ibm.com

© 2020 IBM Corporation



#### IBM Research is Leading in Reduced Precision Scaling

### **IBM Research is Leading in Reduced Precision Scaling**



Source: https://www.ibm.com/blogs/research/2018/12/8-bit-precision-training/

## Did anyone say Moore's Law End?

## Transistor scaling



Intel, IEDM 2019, Germanium-based GAAFET PMOS device layer on top of a more traditional silicon FinFET NMOS

- 5 chipmakers/foundries in the **16nm/14nm** market—GlobalFoundries, Intel, Samsung, TSMC UMC, SMIC (14nm finFETs).
- GlobalFoundries and UMC last year halted their respective 7nm process efforts.
- Currently, TSMC's 7nm process is in its peak (orders from AMD for its Ryzen 3000-series CPUs and Navi graphics cards). Huge invest in <mark>5nm</mark>.
- Compared to <mark>7nm</mark>, Samsung's <mark>5nm</mark> finFET technology provides up to a 25% increase in logic area with 20% lower power or 10% higher performance.
- TSMC expects mass **3nm** production in 2022.
- A nanosheet FET is a type of gate-all-around (GAA) architecture. That's not the only possible scenario. "The industry is very conservative. They will try to extend the finFET as much as possible," IMEC's Naoto Horiguchi said. "At <mark>3nm</mark>, we have a window to use a finFET. But we need several process innovations for finFET in terms of overall improvement.
- TSMC announced starting <mark>2nm</mark> development (Apr. 2020)

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation https://semiengineering.com/5nm-vs-3nm/





System scaling: Beyond the transistor, e.g. Intel's EMIB (Embedded Multi-die Interconnect Bridge) and Foveros to connect chiplets in both 2 and 3 dimensions (HBM in CPU-GPU)



## Did anyone say Moore's Law End?

## **Transistor scaling** Cost scaling



- After <mark>5nm</mark>, the next f<u>ull n</u>ode is <mark>3nm</mark>. But <mark>3nm</mark> is not for the faint of heart.
- The cost to design a <mark>3nm</mark> device ranges from \$500 million to \$1.5 billion, according to IBS.
- Process development costs ranges from \$4 billion to \$5 billion, while a fab runs \$15 billion to \$20 billion, according to IBS.
- "Transistor costs at <mark>3nm</mark> are expected to be 20% to 25% higher than at 5nm based on same level of maturity," IBS' Jones said. "Expect 15% more performance and with 25% less power consumption compared to <mark>5nm</mark> finFETs."

https://semiengineering.com/5nm-vs-3nm/

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

#### The Growing Cost of Keeping Up With Moore's Law

The cost of following Moore's law has increased exponentially. Semiconductor companies are now debating whether investment in keeping up with the expected pace of chip advancement is worthwhile.

| 600 M       | ILLION USD— |
|-------------|-------------|
| 500 —       |             |
| 400 —       |             |
| 300 —       |             |
| 200 —       |             |
| 100 —       |             |
| 0 —         |             |
|             | 65nm 45r    |
| O           | der «       |
| Source: IBS |             |

- \$35 million (Gartner).
- (Gartner).

TO NEXT LEVEL OF CHIP DESIGN



The cost to design a 28nm planar device ranges from \$10 million to

The cost to design a 7nm system-on-a-chip (SoC) ranges from \$120 million to \$420 million (Gartner).

5nm is a completely new process with updated EDA tools and IP. The cost to design a 5nm device ranges from \$210 million to \$680 million

## Silicon alternatives for rapid enterprise-ready specialization

**A GPU** is effective at processing the <u>same set of operations</u> in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values.









•An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width.

Figures source: AWS - Announcing Amazon EC2 F1 Instances with Custom FPGAs, Bringing Hardware Acceleration closer to the programmer, Ecoscale-ExaNest workshop, 2017

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

Reconfigurable logic Reconfigurable memory Reconfigurable interconnects



**ASICs** 

## Silicon alternatives for rapid POWER<sup>TM</sup> specialization

#### Byte-addressable



#### **GPU** Byte-addressable

#### **FPGA** External: Byte-addressable Internal : >Bit-addressable



video processing, genomics, analytics

FAST

•Energy efficient

•Optimized for scale-out servers

- Purpose built for inferencing
- •40x better low-latency throughput than CPUs
- •2x decoding performance over prior generation GPUs over GPUs

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

Note: Used material from "IC922 Seller Deck (2020-Mar-24)"



Reconfigurable to optimize for shifting workloads
20x versus throughput over CPU's, 3x reduced latency

•Custom accelerator designed for inferencing workloads

**New ASIC** 

•Designed for power efficiency and cost savings

## A look on an Enterprise's Portfolio for the AI Era From Mission-Critical workloads to AI and Cloud Computing leadership

Accelerated Compute

Data, Inferencing, and Cloud

PowerVM and high RAS

L922



- Industry leading reliability and computing capability
- PowerVM ecosystem focus for outstanding utilization
- Focus on memory capacity with up to 4TB of RAM

1st TOP500!

200.795 TFlop/s

..and Sierra 2<sup>nd</sup>



AC922

- Industry first and only in advanced IO with 2<sup>nd</sup> **Generation CPU - GPU NVLink** delivering ~5.6x higher data throughput
- Up to 4 integrated NVIDIA "Volta" GPUs air cooled (GTH) 600<sup>+</sup> and up to 6 GPUs with water cooled (GTX) version
  - OpenCAPI support (FPGA, ASIC)

#### Memory coherence

ew Orleans, Louisiana, USA / © 2020 IBM Corporation



- Storage dense, high
- Optimized inferencing server with up to 6 Nvidia T4 GPUs at GA and additional accelerators in roadmap<sup>1</sup>
- OpenCAPI support<sup>1</sup> (FPGA, ASIC)
- Price/performance server

- 1U and 2U form factors
- Advanced IO with PCIe **4.0/CAPI 2.0 (FPGA, ASIC)**
- Up to 44 cores (2U) or 40 cores (1U) at lower frequency

Note: Used material from "IC922 Seller Deck (2020-Mar-24)"

## Proposed POWER Processor Technology and I/O Roadmap

| POWER7 A                                     | rchitecture                                                                                                                  | POWER8 Architecture                                                                                                                                                                                       |                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 2010<br>POWER7<br><sup>8 cores</sup><br>45nm | 2012<br>POWER7+<br><sup>8 cores</sup><br>32nm                                                                                | 2014<br>POWER8<br>12 cores<br>22nm                                                                                                                                                                        | 2016<br>POWER8<br>w/ NVLink<br>12 cores<br>22nm                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
| New Micro-<br>Architecture                   | Enhanced<br>Micro-<br>Architecture                                                                                           | New Micro-<br>Architecture                                                                                                                                                                                | Enhanced<br>Micro-<br>Architecture<br>With NVLink                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
| New Process<br>Technology                    | New Process<br>Technology                                                                                                    | New Process<br>Technology                                                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                  | 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |
| 65 GB/s                                      | 65 GB/s                                                                                                                      | 210 GB/s                                                                                                                                                                                                  | 210 GB/s                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
| PCle Gen2                                    | PCle Gen2                                                                                                                    | PCle Gen3                                                                                                                                                                                                 | PCle Gen3                                                                                                                                                                                                                                                                                                                        | P                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |
| N/A                                          | N/A                                                                                                                          | N/A                                                                                                                                                                                                       | 20 GT/s<br>160GB/s                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
| N/A                                          | N/A                                                                                                                          | CAPI 1.0                                                                                                                                                                                                  | CAPI 1.0 ,<br>NVLink                                                                                                                                                                                                                                                                                                             | c                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |
|                                              | 2010<br>POWER7<br>8 cores<br>45nm<br>New Micro-<br>Architecture<br>New Processs<br>Technology<br>65 GB/s<br>PCle Gen2<br>N/A | POWER7<br>8 cores<br>45nmPOWER7+<br>8 cores<br>32nmNew Micro-<br>ArchitectureEnhanced<br>Micro-<br>ArchitectureNew Processs<br>TechnologyNew Processs<br>Technology65 GB/s65 GB/sPCle Gen2PCle Gen2N/AN/A | 2010<br>POWER7<br>8 cores<br>45nm2012<br>POWER7+<br>8 cores<br>32nm2014<br>POWER8<br>12 cores<br>22nmNew Micro-<br>ArchitectureEnhanced<br>Micro-<br>ArchitectureNew Micro-<br>ArchitectureNew Process<br>TechnologyNew Process<br>TechnologyNew Process<br>Technology65 GB/s65 GB/s210 GB/sPCIe Gen2PCIe Gen2PCIe Gen3N/AN/AN/A | 2010<br>POWER7<br>8 cores<br>45nm2012<br>POWER7+<br>8 cores<br>32nm2014<br>POWER8<br>12 cores<br>22nm2016<br>POWER8<br>w/ NVLink<br>12 cores<br>22nmNew Micro-<br>ArchitectureEnhanced<br>Micro-<br>ArchitectureNew Micro-<br>ArchitectureNew Micro-<br>ArchitectureNew Process<br>TechnologyNew Process<br>TechnologyNew Process<br>Technology210 GB/s210 GB/s65 GB/s65 GB/s210 GB/s210 GB/s210 GB/sPCIe Gen2PCIe Gen2PCIe Gen3PCIe Gen3N/AN/AN/AN/A20 GT/s<br>160GB/s |  |  |

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation



## **PHRYCTORIA** motivation: I/O + low-bit = ?

Traditional communication mechanisms for modern low-precision data-types (e.g. brainfloat16, int5) cannot exploit the bandwidth of emerging communication links for FPGA accelerators (e.g. OpenCAPI, PCIe4, etc).



PHRYCTORIA name inspired after the ancient Greek communication system "ΦΡΥΚΤΩΡΙΑ", 1900 B.C.



- throughput.

a transprecision data type from the total number of bits transferred for that datum is very low.

The number of bits transferred per second defines

However, out of the transferred bits, the number of bits needed to reconstruct a datum of a certain data type at the accelerator end defines the goodput.

## PHRYCTORIA messaging system

**Contributions:** 

- supporting message schemas expressed in protobul IDL for specifying to/from communication data structure between the POWER host and the FPGA accelerator
- automatically generating the serialization/deserialization functions for accelerator and software interfaces from the protobuf schemas
- leveraging varint encoding for transprecision data types to increase the goodput of the communication.



/ May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

#### Where protobuf fits in?

Platform independent message representation formats, like Protocol buffers (protobufs), XML, or JSON allow applications running on different systems, to exchange data.

#### Practically what we propose?

We address the issue of low goodput for transprecision FPGA accelerators by leveraging the varint encoding of protobuf for serialization/deserialization of transprecision data types.

# PHRYCTORIA conceptual system



The PHRYCTORIA system assumes the coexistence of FPGAbased accelerator cards along with generalpurpose processors in a shared memory symmetric multiprocessor system (SMP) The architecture assumes that the reconfigurable logic devices can directly access the same virtual address space of general purpose processors. A specific OpenCAPI Accelerator logic (OC-Accel logic), overlaid on FPGA's resources enables the coherent access to the CPU's onchip coherent interconnection bus.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

The PHRYCTORIA system has two components, i.e. the PHRYCTORIA SW, layered as an application on the OS and the PHRYCTORIA HW, layered inside the AFU, as an interface between the OC-Accel logic and the accel/tors.

# PHRYCTORIA integrated development environment



To facilitate a rapid development and deployment environment, the **OpenCAPI** consortium introduced the OC-Accel Integrated Development Environment (IDE).

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

This framework consists of HW components, SW components and automation tools that allow users to quickly develop, debug and evaluate FPGA accelerators coherently attached to CPUs using the OpenCAPI communication link.

We introduce PHRYCTORIA IDE by extending OC-Accel IDE both in the software and hardware component list.



#### Integrated Develoment Environment (IDE)

**OC-Accel's SW is** extended by the **PHRYCTORIA** Automatic Code **Generator** 

AND

**OC-Accel's HW is** extended by the **PHRYCTORIA AFU** 

# PHRYCTORIA Design-time Run-time System



Given a protobuf schema by the user, ACG will generate a customized HW interface template and an Acc. SW interface. The SW I/F that will provide the functions to send/receive streams of serialized data to/from the FPGA. The generated HW interface with the OC-Accel logic are synthesized, placed and routed to provide the FPGA bitstream.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

The Accelerator SW interface is compiled to a dynamic loadable library for the run-time system.

# PHRYCTORIA Design-time Run-time System



During run-time, an app can send/receive a volume of data to/from the FPGA using the generated SW libs and the protobuf libraries. Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 /

New Orleans, Louisiana, USA / © 2020 IBM Corporation

After serializing the data with protobuf encoding, a user-space virtual address pointer, pointing to these data, is passed to the OC-Accel runtime system.

This system performs basic discovering, availability and readiness status of the FPGA device, prior to scheduling a "job" descriptor to this device Runtime can interact with the generated HW interface, via registers, to control the SERDES. All transactions are served over the 25GBps OpenCAPI link.

# PHRYCTORIA Internal AFU structure



The micro-architecture depicted is designed in mind to scale up to the instances of acc/tors. This is enabled by formulating clusters of accelerators and embedded SRAMs (BRAMs/URAMs).

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation Each cluster is managed by an attached controller which is an FSM (C++/HLS) responsible for executing the program that run within the cluster. SRAMs can be configured either as RAMs or FIFOs, depending on access. Decode/Encoder units break large data transfers from AXI4-MM into words and steers them to SRAMs. They are interfaced to the AXI4-MM bus using optional FIFOs in order to absorb the latency of (AXI4 in burst Mode).

A Transprecision Casting unit may be involved in order to cast the data to the right representation. It is mandatory for all datatypes transferred as uniform unsigned integer types (i.e. only native float and double do not need casting).

## PHRYCTORIA Supported Transprecision Data-types

|                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | l                                                             | I                                                                                                                                                                                                                                                                                                                     | I I                                 |                                                                                                                                                                             | 1                                   | 1          |  |  |  |
|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|------------|--|--|--|
|                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                               | N                                                                                                                                                                                                                                                                                                                     | ative data-t                        | ypes                                                                                                                                                                        |                                     |            |  |  |  |
|                                          | char                                                                                                                                                                                                                                                                                                                                                                                                                                                            | int8/<br>uint8                                                | int16/<br>uint16                                                                                                                                                                                                                                                                                                      | int32/<br>uint32                    | int64/<br>uint64                                                                                                                                                            | float                               | double     |  |  |  |
| Bits                                     | 8                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 8                                                             | 16                                                                                                                                                                                                                                                                                                                    | 32                                  | 64                                                                                                                                                                          | 32                                  | 64         |  |  |  |
| protobuf<br>wire type                    | 2 (string)                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 0 (varint)                                                    | 0 (varint)                                                                                                                                                                                                                                                                                                            | 0 (varint)                          | 0 (varint)                                                                                                                                                                  | 5 (32-bit)                          | 1 (64-bit) |  |  |  |
|                                          | Transprecision data-types                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                               |                                                                                                                                                                                                                                                                                                                       |                                     |                                                                                                                                                                             |                                     |            |  |  |  |
|                                          | ap_int <w><br/>ap_uint<w< th=""><th colspan="6"><math display="block">\begin{array}{ll} W &gt; \ &amp; \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \</math></th></w<></w>                                                                                                                                                                                                                                                                                                   | $\begin{array}{ll} W > \ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$ |                                                                                                                                                                                                                                                                                                                       |                                     |                                                                                                                                                                             |                                     |            |  |  |  |
| Bits                                     | W                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                               | W                                                                                                                                                                                                                                                                                                                     |                                     | w+t+1                                                                                                                                                                       |                                     |            |  |  |  |
| Native type<br>used<br>through<br>unions | 1 <w<7<br>8<w<15<br>16<w<31<br>32<w<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th>1<w<7<br>8<w<15<br>16<w<31<br>32<w<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th>1<w+t+1<7<br>8<w+t+1<15<br>16<w+t+1<31<br>32<w+t+1<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th></th></w+t+1<63<></w+t+1<31<br></w+t+1<15<br></w+t+1<7<br></th></w<63<></w<31<br></w<15<br></w<7<br></th></w<63<></w<31<br></w<15<br></w<7<br> | uint8<br>uint16<br>uint32<br>uint64                           | 1 <w<7<br>8<w<15<br>16<w<31<br>32<w<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th>1<w+t+1<7<br>8<w+t+1<15<br>16<w+t+1<31<br>32<w+t+1<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th></th></w+t+1<63<></w+t+1<31<br></w+t+1<15<br></w+t+1<7<br></th></w<63<></w<31<br></w<15<br></w<7<br> | uint8<br>uint16<br>uint32<br>uint64 | 1 <w+t+1<7<br>8<w+t+1<15<br>16<w+t+1<31<br>32<w+t+1<63< th=""><th>uint8<br/>uint16<br/>uint32<br/>uint64</th><th></th></w+t+1<63<></w+t+1<31<br></w+t+1<15<br></w+t+1<7<br> | uint8<br>uint16<br>uint32<br>uint64 |            |  |  |  |
| protobuf<br>wire type                    | 0 (varint)                                                                                                                                                                                                                                                                                                                                                                                                                                                      | unitor                                                        | 0  (varint)                                                                                                                                                                                                                                                                                                           | unitor                              | 0 (varint)                                                                                                                                                                  | unitor                              |            |  |  |  |

Arbitrary signed/unsigned integer types (ap int<W>/ap uint<W>) and arbitrary signed/unsigned fixed-point types (ap fixed<W>/ap ufixed<W>) using the HLS arbitrary Precision Types lib.



In addition, we support arbitrary floating point numbers using the FloatX library. FloatX emulates the types of custom precision and does not implement arbitrary floating point types.

| IEEE ' | 754 floating          | gpoint s             | tandard:                        | $v = (-1)^s * 2^e * m$                   |                  |
|--------|-----------------------|----------------------|---------------------------------|------------------------------------------|------------------|
| _      | 1 bit MSB             | w bits               | LSB MSB                         | t = p - 1 bits                           | LSB              |
|        | S<br>(sign) (bias     | <i>E</i><br>ed expor | nent)                           | <i>T</i><br>(trailing significand field) |                  |
| _      | <i>E</i> <sub>0</sub> |                      | E <sub>w-1</sub> d <sub>1</sub> |                                          | d <sub>p-1</sub> |
| float  | <<11,52> ≡ a          | double (6            | 54-bit)                         |                                          |                  |
|        |                       |                      |                                 |                                          |                  |
| float  | <<8,23> ≡ fla         | oat (32-k            | oit)                            |                                          |                  |
| float  | <<5,10> ≡ ha          | <i>alf</i> (16-b     | it)                             |                                          |                  |
| float  | x <w,t> ≡ 1</w,t>     | +w+t bi              | its                             |                                          |                  |
|        |                       |                      |                                 |                                          |                  |
| I      |                       |                      | I                               |                                          |                  |

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

|  |  |  |   |  | I | I | l |   |   |  |  | ļ |
|--|--|--|---|--|---|---|---|---|---|--|--|---|
|  |  |  |   |  |   |   |   |   |   |  |  |   |
|  |  |  | I |  |   |   |   | I | Ι |  |  |   |

## HOW PHRYCTORIA manages to improve the communication goodput...



We depict a naive example of transmitting an array-A of 200 integers with value "01", from the host to the FPGA.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation Left: OpenCAPI's DL interface operates on the LLC of the processor and as such the array-A is fetched in packed data of 128 Bytes, (POWER9's cache line). Right: Serialization function (automatically generated by ACG) will send 202 Bytes, i.e. 200 Bytes for the small integers and two extra bytes for pb headers.

The serialized data can fit in just 2 lines. On the FPGA side, cache line delivers 126 elements (128-2 for headers) i.e. 126/32=3.9x improvement / clk. <sup>17</sup>

# PHRYCTORIA Automatic Code Generator (ACG)



e.g. host sends to the FPGA messages with two data-types, a native supported int32 and a transprecision floatX<8,7>.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation Protobuf supports the serialization of native data-types & the automatic generation of the SW functions that manipulate data in this schema through protoc.

A Python script parses the protobuf schema as a 1<sup>st</sup> step and converts the transprecision datatypes to bytealigned native ones (to be processed by protoc)

For every generated function of protoc and only the converted data-types, a wrapper is generated to facilitate the conversion using unions.

# PHRYCTORIA Evaluation: Synthetic Data Set



Allocate different sizes of native data-types on the host and either send as structured bytealigned data to FPGA over OCAPI, or firstly serializes and then sends as pb stream.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation The evaluation was realized for both host and FPGA acting as Rx and Tx, H2D and D2H. The intention was to show that the OpenCAPI throughput is not degraded for native data-types. The figure depicts the average for all native data-types, as the deviation for the seven data-types was marginally 2.3% for a sequence of 1k repeated tests.

The bandwidth is not severely degraded for native datatypes (0.92%). We intentionally loaded data that could not benefit from the *varint,* by setting at least one bit on all bytes of every native type. <sup>19</sup>

## PHRYCTORIA Evaluation: Transprecision-Data Set

# 6.3-7.4x-

for < 8 bits

**2.9-3.4**x<sup>-</sup> for 8-15 bits



Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation for 16-32 bits

for 64 bits

# PHRYCTORIA-Evaluation: Realistic NLP Data Set - AI2-Reasoning



Accelerators for an NLP domain that need to access the respective dataset over the OpenCAPI link.

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

Allen Institute for AI Reasoning Challenge "dataset, which contains 7,787 genuine school-grade level, multiple-choice science questions, assembled to encourage research in adv. question answering.

From the 12 columns of data, 7 can be represented with unsigned integers of 8bits, 2 with strings and 3 with 16 bits. The second gain comes directly from the varint encoding.



The data-set was reduced from 298.8 MB to 250.3 MB, from the first step, and 61.1 MB from serialization. 6.9x Goodput gair

# PHRYCTORIA adaptation to accelerator's data-flow.



An important parameter of PHRYCTORIA HW design is the depth of the optional 128B-wide FIFOs between the AXI4-MM AFU interface and decoder/encoder. These FIFOs can absorb the latency of the accelerator and keep the AXI4 bus operating in burst mode. An accelerator with II=2 consumes 1 AXI words per 2 cycles. But VHLS scheduler is not able to keep a burst of AXI4 and 2 full AXI4 TX are issued. The FIFOs save temporally consequent AXI4 words per cycle in order to saturate the OpenCAPI-AXI4 BW. PHRYCTORIA Logic Utilization AD9V3, Xilinx VU3P FPGA - 251.775MHz

**73%** empty space to fit your

accelerators

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

| K2Y0 |             |             | < <u>&lt;</u> | <u>K2Y4</u>  |  |
|------|-------------|-------------|---------------|--------------|--|
|      |             |             |               |              |  |
|      |             |             |               |              |  |
|      |             |             |               |              |  |
| ХЗҮО | X3Y1        | <u>X3Y2</u> | хэүз          | <u>X3</u> Y4 |  |
|      |             |             |               |              |  |
|      |             |             |               |              |  |
| K4Y0 | <u></u>     | <u>K4Y2</u> | X4Y3          | <u>.</u>     |  |
|      |             |             |               |              |  |
|      |             |             |               |              |  |
|      |             |             |               |              |  |
| χ5γ0 | <u>X5Y1</u> | <u>x5Y2</u> | <u>x5</u> Y3  | X5Y4         |  |
|      |             |             |               |              |  |

## PHRYCTORIA Logic Utilization AD9H7, Xilinx VU37P FPGA- 251.775MHz

# empty space to fit your accelerators

AD9H7 impressively met the timing constraint of 2.482375ns-402.840MHz

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

|          |              |                                                   |                    |             |              | JLK2          |
|----------|--------------|---------------------------------------------------|--------------------|-------------|--------------|---------------|
| Y11      | <u>X2Y11</u> | <u>X3Y11                                     </u> | X4Y11              | X5Y11       | <u>X6Y11</u> | <u>X7Y11</u>  |
| Y10      | X2Y10        | <u>X3Y10 ;</u>                                    | X4Y10              | Х5Ү10       | X6Y10        | <u> X7Y10</u> |
| Y9       | Х2Ү9         | X3Y9 ;                                            | x4Y9               | <u>X5Y9</u> | хбүр         | хтү9          |
|          | A213         |                                                   |                    | NJ12        | NOT 2        |               |
| Y8       | X2Y8         | хзүв                                              | <mark>ж</mark> 4ү8 | X5Y8        | Х6Ү8         | Х7Ү8          |
| Y7       | X2Y7         | <u>x3Y7 :</u>                                     | X4Y7               | Х5Ү7        | Х6Ү7         | SLR1<br>X7Y7  |
|          |              |                                                   | A4Y6               | <u>X5Y6</u> | <u>X6Y6</u>  | <u>Х7Ү6</u>   |
| (Eastern |              |                                                   | x4Y5               | X5Y5        | <u>X6Y5</u>  | <u>X7Y5</u>   |
| Y4       | X2Y4         | X3Y4 :                                            | <b>X</b> 4Y4 •     | <u>X5Y4</u> | X6Y4         | ¥∰/4 slr0     |
| Y3       | Х2Ү3         | ХЗҮЗ :                                            | X4Y3               | х5үз        | хөүз         | х7ҮЗ          |
| Y2       | X2Y2         |                                                   | x4Y2               | X5Y2        | <u>x6</u> Y2 | Х7Ү2          |
|          |              |                                                   |                    |             |              |               |
|          | X2Y1         | <u>X3Y1 :</u>                                     | 2 <mark>471</mark> | <u>x5Y1</u> | <u>x6y1</u>  | X7Y1          |
| YO       | X2Y <b>0</b> | ХЗҮО ;                                            | X4Y0               | X5Y0        | ХбҮО         | <u>X7Y0</u>   |
|          |              |                                                   |                    |             |              |               |

High Bandwidth Memory

gh Bandwidth Memory

24

Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

25

## Take-away

- We explored the possibility to sustain high useful throughput on systems with OpenCAPI-enabled FPGAs for emerging transprecision data-types.
- By adopting the widely used protobul IDL description and the OC-Accel framework we built a system, named PHRYCTORIA, that automatically generates the appropriate interfaces on a host SW and on FPGA HW.
- We showed that the inherent variat encoding of protobuf used in PHRYCTORIA sustains an effective throughout similar to the practical best of OpenCAPI for
  - o synthetic datasets with uniform transprecision data-types and
  - o a real NLP data-set with combined transprecision data-types

#### – Future work

- Integrate with accelerators from different domains (linear algebra, quantitative finance, computer vision, DSP.)
- Support synthesizable versions of arbitrary floating point.
- Support posits arithmetic system.
- Support FPGA on-board DRAM and FPGA in-package HBM.

## Thank you

Dionysios Diamantopoulos

<u>did@zurich.ibm.com</u> <u>https://researcher.watson.ibm.com/researcher/view.php?person=zurich-DID</u>

27



Dionysios Diamantopoulos / RAW-IPDPS / May 18-19 2020 / New Orleans, Louisiana, USA / © 2020 IBM Corporation

28

## AC922 Nvidia V100 vs. IC922 Nvidia T4

|                                 | Tesla V100<br>PCle         | Tesla V100<br>SXM2    | NVIDIA       | T4 SPECIE  |  |  |
|---------------------------------|----------------------------|-----------------------|--------------|------------|--|--|
| GPU Architecture                | NVIDIA Volta<br>640        |                       | Performance  | TU         |  |  |
| NVIDIA Tensor<br>Cores          | 640                        |                       |              | 3          |  |  |
| NVIDIA CUDA®<br>Cores           | 5,1                        | 120                   |              | 2          |  |  |
| Double-Precision<br>Performance | 7 TFLOPS                   | 7.5 TFLOPS            |              | SIN<br>(FP |  |  |
| Single-Precision<br>Performance | 14 TFLOPS                  | 15 TFLOPS             |              | мр         |  |  |
| Tensor<br>Performance           | 112 TFLOPS                 | 120 TFLOPS            |              | тлі<br>1   |  |  |
| GPU Memory                      | 16 GB                      | HBM2                  |              | I          |  |  |
| Memory<br>Bandwidth             | 900 G                      | B/sec                 |              | 2          |  |  |
| ECC                             | Y                          | es                    | Interconnect | GE         |  |  |
| Interconnect<br>Bandwidth*      | 32 GB/sec                  | 300 GB/sec            | Interconnect | X          |  |  |
| System Interface                | PCIe Gen3                  | NVIDIA NVLink         | Memory       | CA         |  |  |
| Form Factor                     | PCIe Full<br>Height/Length | SXM2                  | Merrory      | 1          |  |  |
| Max Power<br>Comsumption        | 250 W                      | 300 W                 |              | ВА         |  |  |
| Thermal Solution                | Pas                        | sive                  |              | 7          |  |  |
| Compute APIs                    |                            | ctCompute,<br>OpenACC | Power        | 7          |  |  |

- T4 Specs: https://www.nvidia.com/en-us/data-center/tesla-t4/
- V100 Specs: https://www.nvidia.com/en-us/data-center/v100/



