SlideShare a Scribd company logo
Mario Porrmann
Osnabrück University
27. April 2022
Performance Evaluation and
Benchmarking –
Reconfigurable Accelerators
in VEDLIoT
2
Applications
Requirements Security & Safety
Hardware
Plattforms
Microservers &
Accelerators
Middleware
Embedded/
Far Edge
Near Edge Cloud
Safety
&
Robustness
Modelling
&
Verification
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
Smart Home Industrial IoT Automotive AI
Open
Call
Monitoring
Trusted
Execution
&
Communication
RISC-V
extensions
Optimizer Emulation Benchmarking & Deployment
uRECS t.RECS RECS|Box
Big Picture
3
Applications
Requirements Security & Safety
Hardware
Plattforms
Microservers &
Accelerators
Middleware
Embedded/
Far Edge
Near Edge Cloud
Safety
&
Robustness
Modelling
&
Verification
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
Smart Home Industrial IoT Automotive AI
Open
Call
Monitoring
Trusted
Execution
&
Communication
RISC-V
extensions
Optimizer Emulation Benchmarking & Deployment
uRECS t.RECS RECS|Box
Big Picture
Hardware
Plattforms
Microservers &
Accelerators
Embedded/
Far Edge
Near Edge Cloud
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
uRECS t.RECS RECS|Box
• FPGA-based Accelerators in VEDLIoT
• Dynamic Reconfiguration of Accelerators
• First Results on Performance and Energy Efficiency
• Workflow for Configurable Soft SoCs
4
FPGA Infrastructure
• FPGA base architecture
• Integration of the required Interfaces and accelerators
• Support for dynamic run-time reconfiguration
• Exchange accelerators on the FPGA at run-time to increase resource efficiency and flexibility
• FPGA task deployment mechanism
• Migration of a task from one FPGA to another FPGA
Logic Cells 85k 2800k 25.2M 75.6M
5
Basic FPGA Infrastructure
• FPGA base architecture for the µ.RECS
• Block-based design enabling easy customization of the FPGA platform in the µ.RECS
• Front-end based on Xilinx Vitis with additional (optional) IP-cores from LiteX
• Scripting approach for complete system design
• Easy porting to new FPGAs and FPGA platforms, esp. µ.RECS. t.RECS, RECS|Box
• Flexible integration of accelerators
• Integration of the required Interfaces for communication (Ethernet, PCIe, etc)
as well as sensors and actuators targeted in the use cases
• PetaLinux enables easy access to the
system and to integrated accelerators
for software developers
• µ.RECS testbed for early evaluation
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt
Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
Accelerator(s)
AXI
AXI-Lite
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Clk
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
6
FPGA Base Architecture for µ.RECS
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
Accelerator(s)
AXI
AXI-Lite
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Clk
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
7
First Reference Design Based on Xilinx DPU
• Baseline for evaluation of FPGA accelerators developed in VEDLIoT
• Xilinx Deep Learning Processor Unit (DPU)
• Programmable engine
for convolutional neural networks
• Easy integration as an IP core in
Xilinx UltraScale+ MPSoCs
• Configurable hardware architecture
(e.g., parallelism, memory/DSP usage)
• Evaluation on various platforms with Xilinx UltraScale+ MPSoCs
• ZU3EG on Avnet Ultra96-v2 (154k Logic Cells)
• ZU4EG in the µ.RECS testbed (192k Logic Cells)
• ZU15EG on Trenz TE0808 MPSoC Module (747k Logic Cells)
• ZU19EG on Trenz COM-HPC Module in t.RECS (1,143k Logic Cells)
DPU
Peak
ops/clock
Peak performance
(300 MHz) [GOPS]
Peak performance
(200 MHz) [GOPS]
B512 512 153.6 102.4
B2304 2304 691.2 460.8
B4096 4096 1228.8 819.2
8
First Reference Design Based on Xilinx DPU
• Example implementation
utilizing the µ.RECS testbed
• SMARC module SECO RUSSELL
• Xilinx Zynq UltraScale+ XCZU4EG-1 FPGA
• Quad-core Arm Cortex-A53,
Dual-core Arm Cortex-R5
• 88k 6-input look-up tables (LUTs)
• 176k Flip-Flops (FFs)
• 728 DSP Slices
• 128 36kb BRAM blocks (4.5 Mb total)
• 48 288kb URAM blocks (13.5 Mb total)
• 2 GByte 64-Bit DDR4 SDRAM (PS)
• 512 MByte 64-Bit DDR4 SDRAM (PL)
DPU Configuration
Resources B512 B2304 B4096
Complete
Design
LUTs 34,456 39.2% 47,107 53.6% 56,685 64.5%
FFs 43,557 24.8% 78,215 44.5% 107,732 61.3%
DSPs 110 15.1% 422 58.0% 690 94.8%
BRAMs 13.5 10.5% 61 47.7% 81 63.3%
URAMs 16 33.3% 40 83.3% 48 100%
Base
Design
LUTs 8,439 9.6% 8,434 9.6% 8,456 9.6%
FFs 10,205 5.8% 10,205 5.8% 10,205 5.8%
DSPs 0 0% 0 0% 0 0%
BRAMs 4 3.1% 4 3.1% 4 3.1%
URAMs 0 0% 0 0% 0 0%
DPU
LUTs 26,017 29.6% 38,673 44.0% 48,229 54.9%
FFs 33,352 19.0% 68,010 38.7% 97,527 55.5%
DSPs 110 15.1% 422 58.0% 690 94.8%
BRAMs 9.5 7.4% 57 44.5% 77 60.2%
URAMs 16 33.3% 40 83.3% 48 100%
9
Efficient Utilization of the Xilinx DPU
• Multithreading is crucial for high performance
• Environment supporting semi-automatic realization and evaluation
of multithreading during application development
• Execution Time – Single-threaded
• Execution Time – Multi-threaded
Read Data Preproc. DPU Processing Post.
Read Data Preproc.
Post.
DPU Processing DPU Processing
Post.
Read Data Preproc. Read Data Preproc.
t1 t2 t3 t4
t0
t0 ttotal
10
Efficient Utilization of the Xilinx DPU
• Performance and power monitoring for single- and multi-threaded implementations
• Detailed power measurements on RECS platforms
• Power-aware profiling and optimization
11
Example DSE Using Different DPU
Configurations
12
Example DSE Using Different DPU
Configurations
13
Example DSE Using Different DPU
Configurations
14
Benchmark Performance of DL Accelerators
YoloV4
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLR…
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRA…
[CELLRANGE]
10
100
1000
10000
2 4 8 16 32 64 128
Performance
[GOPS]
Power [Watt]
INT8 FP16 FP32
ZU3
ZU15
15
Dynamic Reconfiguration of DL Accelerators
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
Clk
AXI
CB
AXI
–Lite
CB
Disconnect
PR-Region
DFX
Accelerator A
Accelerator B
16
Dynamic Reconfiguration of DL Accelerators
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
Clk
AXI
CB
AXI
–Lite
CB
Disconnect
Accelerator
Disconnect
Accelerator
Accelerator
PR-Region
PR-Region
DFX
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
17
Reconfigurable DL Accelerators
• Accelerator to be used for the codesign approach:
Generation of dataflow-architectures
based on C++ templates
• Support for inference and training
• Targeting CNNs, deep reinforcement learning, and federated learning
• Definition of parameterizable layer templates in C++
(e.g., convolution, fully connected, pooling, and activation functions, …)
• Parameterizable, e.g., quantization (from low bit-width INT to float)
• Optimized for high-level synthesis
• All layers integrate three functions (if required):
inference/forward propagation, backpropagation, and update function
• Inference utilizes only forward path
• Learning (DeepRL): utilizes the full functionality of the layer templates
18
Soft SoC Platform
• Generation of soft SoC platforms
• Utilize RISC-V soft cores
• Generic interface to AI-Accelerators
• Modelled in an open source
emulation environment
• Utilize LiteX SoC generator
• Run-time reconfiguration
• Accelerators
• Processor cores
FPGA
Base Architecture
AI-Accelerator
Run-Time
Reconfiguration
Interface
19
• Configurable soft SoC generator provides a platform for low power AI accelerator
exploration
• The generator enables a functionality to generate a system with a set of peripherals
required for a specific tasks
• Scalable from MCU-class to Linux-capable platforms
• Support for generic, vendor independent accelerator integration interface makes it a
perfect AI research platform
• Portable across different hardware, based on open-source tooling
• CFUs - Custom Function Units – custom accelerators designed for specific workflows,
tightly coupled with the CPU
• Accessed via custom RISC-V instructions
• Can be implemented in high-level hardware description languages, like, e.g., Python-based Amaranth
Configurable SoC for ML Workflows
20
• CFUs offer great flexibility
• Test various dedicated accelerators for specific
workflows
• Renode simulation framework
extended with CFU support
• Co-simulating functional models of the
SoC with verilated, cycle-accurate CFUs
• Invaluable tool for development
• Massive continuous integration testing
• Different CFU implementations
• Different inputs
• Allows for automatic result comparison and
analysis
• Everything open-sourced
Configurable SoC for ML Workflows
21
• Platform
• Hardware: Scalable, heterogeneous, distributed
• Accelerators: Efficiency boost by FPGA and ASIC technology
• Toolchain: Optimizing Deep Learning for IoT
• Use cases
• Industrial IoT
• Automotive
• Smart Home
• Open call
• Open for submissions until 8. May
• Early use and evaluation of VEDLIoT technology
Very Efficient Deep Learning for IoT – VEDLIoT
22
Follow our work
⇒ https://twitter.com/VEDLIoT
⇒ https://www.linkedin.com/company/vedliot/
⇒ https://vedliot.eu
Be part of it
⇒ Open call NOW!
⇒ Allows early use and evaluation of VEDLIoT
technology
23
Thank you for your attention.
24
DL Accelerator
CPU
GPU
TPU
Compiler
DL
model
Heterogenenous
DL Accelerator
DL Accelerator
FPGA
Compiler
HW Spec
DL
model
Reconfigurable
DL Accelerator
DL Accelerator
FPGA
Compiler
DL
model
HW Spec
HW Spec
Compiler
Dynamically
Reconfigurable
DL Accelerator
DL Accelerator
FPGA
Compiler
Co-
Design
DL
model
Co-Designed
DL Accelerator
Deep Learning Accelerators
25
Dynamic Reconfiguration of DL Accelerators
• Utilize dynamic reconfiguration
• Change the complete DL model and the corresponding accelerator at run-time
depending on application requirements
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
• Enable accelerator to be partially reconfigured for different phases of the application
26
First Reference Design Based on Xilinx DPU
Example
• Performance and power evaluation
for YoloV4
• Trade-off latency vs. performance
Platform SM-B71 on SOM-DB2500 Carrier
DPU B3136 x1, 300MHz B4096 x1, 300MHz
Number of threads 1 2 4 1 2 4
Latency [ms] 120.34 198.62 383.72 93.42 144.51 276.33
Achieved
performance
[Inferences/s]
8.28 10.76 10.76 10.66 15.12 15.12
Achieved
performance [GOPS]
500.11 649.90 649.90 643.86 913.25 913.25
Peak performance
[GOPS]
940.8 940.8 940.8 1228.8 1228.8 1228.8
Performance Ratio 53.16% 69.08% 69.08% 52.40% 74.32% 74.32%
Cost Metrics
Power [W] 11.20 12.49 12.51 13.14 15.42 15.44
Idle Power [W] 0.07/7.09 0.07/7.09 0.07/7.09 0.07/7.56 0.07/7.56 0.07/7.56
Energy/Inference [J] 1.352 1.161 1.173 1.233 1.020 1.021
Power Efficiency
[GOPS/W]
44.65 52.03 51.95 49.00 59.23 59.15

More Related Content

PPTX
HiPEAC 2022_Marco Tassemeier presentation
VEDLIoT Project
 
PPTX
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar
 
PPTX
VEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoT
VEDLIoT Project
 
PPTX
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
PDF
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
PDF
RISC V in Spacer
klepsydratechnologie
 
PDF
Review of QNX
Robert-Emmanuel Mayssat
 
PDF
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Embarcados
 
HiPEAC 2022_Marco Tassemeier presentation
VEDLIoT Project
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar
 
VEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoT
VEDLIoT Project
 
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
RISC V in Spacer
klepsydratechnologie
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Embarcados
 

Similar to HiPEAC Computing Systems Week 2022_Mario Porrmann presentation (20)

PDF
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
PDF
Hari Krishna Vetsa Resume
Hari Krishna
 
PPT
No[1][1]
51 lecture
 
PPTX
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
VEDLIoT Project
 
PDF
Vlsi lab
Hendrick Rick
 
PPTX
Fixed-point Multi-Core DSP Platform
Sundance Multiprocessor Technology Ltd.
 
PDF
5G transformation with Open Source (on ONF)
gangiliu
 
PDF
uCluster
Christos Kotsalos
 
PDF
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
PDF
UTM Appliance Fact Sheet
Karthik Ethirajan
 
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
PPTX
Introduction to architecture exploration
Deepak Shankar
 
PDF
DRIVE PX 2
Shri Sundaram
 
PDF
AXONIM 2018 industrial automation technical support
Vitaliy Bozhkov ✔
 
PDF
Re-Vision stack presentation
Sundance Multiprocessor Technology Ltd.
 
PPTX
Sundance at the 49th Intelligent Sensing Program
Sundance Multiprocessor Technology Ltd.
 
PDF
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
PDF
Plan with confidence: Route to a successful Do178c multicore certification
ICTperspectives
 
PPTX
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
Rebekah Rodriguez
 
PDF
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
OPNFV
 
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
Hari Krishna Vetsa Resume
Hari Krishna
 
No[1][1]
51 lecture
 
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
VEDLIoT Project
 
Vlsi lab
Hendrick Rick
 
Fixed-point Multi-Core DSP Platform
Sundance Multiprocessor Technology Ltd.
 
5G transformation with Open Source (on ONF)
gangiliu
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
UTM Appliance Fact Sheet
Karthik Ethirajan
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
Introduction to architecture exploration
Deepak Shankar
 
DRIVE PX 2
Shri Sundaram
 
AXONIM 2018 industrial automation technical support
Vitaliy Bozhkov ✔
 
Re-Vision stack presentation
Sundance Multiprocessor Technology Ltd.
 
Sundance at the 49th Intelligent Sensing Program
Sundance Multiprocessor Technology Ltd.
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
Plan with confidence: Route to a successful Do178c multicore certification
ICTperspectives
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
Rebekah Rodriguez
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
OPNFV
 
Ad

More from VEDLIoT Project (20)

PPTX
IoT Tech Expo 2023_Micha vor dem Berge presentation
VEDLIoT Project
 
PPTX
Computing Frontiers 2023_Pedro Trancoso presentation
VEDLIoT Project
 
PPTX
HiPEAC-CSW 2022_Pedro Trancoso presentation
VEDLIoT Project
 
PPTX
IoT Week 2022-NGIoT session_Micha vor dem Berge presentation
VEDLIoT Project
 
PPTX
Next Generation IoT Architectures_Hans Salomonsson
VEDLIoT Project
 
PPTX
CONASENSE 2022_Jens Hagemeyer presentation
VEDLIoT Project
 
PPTX
NGIoT standardisation workshops_Jens Hagemeyer presentation
VEDLIoT Project
 
PPTX
IoT Tech Expo 2023_Pedro Trancoso presentation
VEDLIoT Project
 
PPTX
HiPEAC-CSW 2022_Kevin Mika presentation
VEDLIoT Project
 
PPTX
HiPEAC 2022-DL4IoT workshop_René Griessl presentation
VEDLIoT Project
 
PPTX
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
VEDLIoT Project
 
PPTX
IoT Week 2021_Jens Hagemeyer presentation
VEDLIoT Project
 
PPTX
HiPEAC 2022_Marcelo Pasin presentation
VEDLIoT Project
 
PPTX
IoT Tech Expo 2023_Marcelo Pasin presentation
VEDLIoT Project
 
PPTX
IoT Tech Expo 2023_Hans-Martin Heyn presentation
VEDLIoT Project
 
PPTX
HiPEAC2022_António Casimiro presentation
VEDLIoT Project
 
PPTX
NGIoT Sustainability Workshop 2023_ Hans-Martin Heyn presentation
VEDLIoT Project
 
PPTX
EU-IoT Training Workshops Series: AIoT and Edge Machine Learning 2021_Jens Ha...
VEDLIoT Project
 
PPTX
NGIoT Sustainability Workshop 2023_Rene Griessl presentation
VEDLIoT Project
 
PPTX
HiPEAC2022-DL4IoT workshop_ Muhammad Waqar Azhar
VEDLIoT Project
 
IoT Tech Expo 2023_Micha vor dem Berge presentation
VEDLIoT Project
 
Computing Frontiers 2023_Pedro Trancoso presentation
VEDLIoT Project
 
HiPEAC-CSW 2022_Pedro Trancoso presentation
VEDLIoT Project
 
IoT Week 2022-NGIoT session_Micha vor dem Berge presentation
VEDLIoT Project
 
Next Generation IoT Architectures_Hans Salomonsson
VEDLIoT Project
 
CONASENSE 2022_Jens Hagemeyer presentation
VEDLIoT Project
 
NGIoT standardisation workshops_Jens Hagemeyer presentation
VEDLIoT Project
 
IoT Tech Expo 2023_Pedro Trancoso presentation
VEDLIoT Project
 
HiPEAC-CSW 2022_Kevin Mika presentation
VEDLIoT Project
 
HiPEAC 2022-DL4IoT workshop_René Griessl presentation
VEDLIoT Project
 
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
VEDLIoT Project
 
IoT Week 2021_Jens Hagemeyer presentation
VEDLIoT Project
 
HiPEAC 2022_Marcelo Pasin presentation
VEDLIoT Project
 
IoT Tech Expo 2023_Marcelo Pasin presentation
VEDLIoT Project
 
IoT Tech Expo 2023_Hans-Martin Heyn presentation
VEDLIoT Project
 
HiPEAC2022_António Casimiro presentation
VEDLIoT Project
 
NGIoT Sustainability Workshop 2023_ Hans-Martin Heyn presentation
VEDLIoT Project
 
EU-IoT Training Workshops Series: AIoT and Edge Machine Learning 2021_Jens Ha...
VEDLIoT Project
 
NGIoT Sustainability Workshop 2023_Rene Griessl presentation
VEDLIoT Project
 
HiPEAC2022-DL4IoT workshop_ Muhammad Waqar Azhar
VEDLIoT Project
 
Ad

Recently uploaded (20)

PPT
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PPTX
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
PPTX
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
PPT
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
PPTX
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PDF
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PDF
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
NSF-DOE Vera C. Rubin Observatory Observations of Interstellar Comet 3I/ATLAS...
Sérgio Sacani
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 

HiPEAC Computing Systems Week 2022_Mario Porrmann presentation

  • 1. Mario Porrmann Osnabrück University 27. April 2022 Performance Evaluation and Benchmarking – Reconfigurable Accelerators in VEDLIoT
  • 2. 2 Applications Requirements Security & Safety Hardware Plattforms Microservers & Accelerators Middleware Embedded/ Far Edge Near Edge Cloud Safety & Robustness Modelling & Verification Jetson AGX NVIDIA Xavier COM-HPC Xilinx Zynq UltraScale+ SMARC Xilinx Zynq UltraScale+ Coral SoM Xilinx Kria RPi CM4 ARVSOM Smart Home Industrial IoT Automotive AI Open Call Monitoring Trusted Execution & Communication RISC-V extensions Optimizer Emulation Benchmarking & Deployment uRECS t.RECS RECS|Box Big Picture
  • 3. 3 Applications Requirements Security & Safety Hardware Plattforms Microservers & Accelerators Middleware Embedded/ Far Edge Near Edge Cloud Safety & Robustness Modelling & Verification Jetson AGX NVIDIA Xavier COM-HPC Xilinx Zynq UltraScale+ SMARC Xilinx Zynq UltraScale+ Coral SoM Xilinx Kria RPi CM4 ARVSOM Smart Home Industrial IoT Automotive AI Open Call Monitoring Trusted Execution & Communication RISC-V extensions Optimizer Emulation Benchmarking & Deployment uRECS t.RECS RECS|Box Big Picture Hardware Plattforms Microservers & Accelerators Embedded/ Far Edge Near Edge Cloud Jetson AGX NVIDIA Xavier COM-HPC Xilinx Zynq UltraScale+ SMARC Xilinx Zynq UltraScale+ Coral SoM Xilinx Kria RPi CM4 ARVSOM uRECS t.RECS RECS|Box • FPGA-based Accelerators in VEDLIoT • Dynamic Reconfiguration of Accelerators • First Results on Performance and Energy Efficiency • Workflow for Configurable Soft SoCs
  • 4. 4 FPGA Infrastructure • FPGA base architecture • Integration of the required Interfaces and accelerators • Support for dynamic run-time reconfiguration • Exchange accelerators on the FPGA at run-time to increase resource efficiency and flexibility • FPGA task deployment mechanism • Migration of a task from one FPGA to another FPGA Logic Cells 85k 2800k 25.2M 75.6M
  • 5. 5 Basic FPGA Infrastructure • FPGA base architecture for the µ.RECS • Block-based design enabling easy customization of the FPGA platform in the µ.RECS • Front-end based on Xilinx Vitis with additional (optional) IP-cores from LiteX • Scripting approach for complete system design • Easy porting to new FPGAs and FPGA platforms, esp. µ.RECS. t.RECS, RECS|Box • Flexible integration of accelerators • Integration of the required Interfaces for communication (Ethernet, PCIe, etc) as well as sensors and actuators targeted in the use cases • PetaLinux enables easy access to the system and to integrated accelerators for software developers • µ.RECS testbed for early evaluation SMARC Module SoC FPGA-Fabric Processing System HDMI CSI PCIe x4 GigE USB DDR (PS) Memory Subsystem Interrupt Controller Dual/Quad Arm Cortex- A53 Dual Arm Cortex-R5 I/O Interfaces AXI Accelerator(s) AXI AXI-Lite AXI-Lite GPIO, UART DDR (PL) Xilinx/ LiteX Memory Ctrl eMMC Flash SD GPIO, UART I/O Ctrl SATA Clk Platform Mgmt, System Funct. & Configuration HDMI CSI
  • 6. 6 FPGA Base Architecture for µ.RECS SMARC Module SoC FPGA-Fabric Processing System HDMI CSI PCIe x4 GigE USB DDR (PS) Memory Subsystem Interrupt Controller Dual/Quad Arm Cortex- A53 Dual Arm Cortex-R5 I/O Interfaces AXI Accelerator(s) AXI AXI-Lite AXI-Lite GPIO, UART DDR (PL) Xilinx/ LiteX Memory Ctrl eMMC Flash SD GPIO, UART I/O Ctrl SATA Clk Platform Mgmt, System Funct. & Configuration HDMI CSI
  • 7. 7 First Reference Design Based on Xilinx DPU • Baseline for evaluation of FPGA accelerators developed in VEDLIoT • Xilinx Deep Learning Processor Unit (DPU) • Programmable engine for convolutional neural networks • Easy integration as an IP core in Xilinx UltraScale+ MPSoCs • Configurable hardware architecture (e.g., parallelism, memory/DSP usage) • Evaluation on various platforms with Xilinx UltraScale+ MPSoCs • ZU3EG on Avnet Ultra96-v2 (154k Logic Cells) • ZU4EG in the µ.RECS testbed (192k Logic Cells) • ZU15EG on Trenz TE0808 MPSoC Module (747k Logic Cells) • ZU19EG on Trenz COM-HPC Module in t.RECS (1,143k Logic Cells) DPU Peak ops/clock Peak performance (300 MHz) [GOPS] Peak performance (200 MHz) [GOPS] B512 512 153.6 102.4 B2304 2304 691.2 460.8 B4096 4096 1228.8 819.2
  • 8. 8 First Reference Design Based on Xilinx DPU • Example implementation utilizing the µ.RECS testbed • SMARC module SECO RUSSELL • Xilinx Zynq UltraScale+ XCZU4EG-1 FPGA • Quad-core Arm Cortex-A53, Dual-core Arm Cortex-R5 • 88k 6-input look-up tables (LUTs) • 176k Flip-Flops (FFs) • 728 DSP Slices • 128 36kb BRAM blocks (4.5 Mb total) • 48 288kb URAM blocks (13.5 Mb total) • 2 GByte 64-Bit DDR4 SDRAM (PS) • 512 MByte 64-Bit DDR4 SDRAM (PL) DPU Configuration Resources B512 B2304 B4096 Complete Design LUTs 34,456 39.2% 47,107 53.6% 56,685 64.5% FFs 43,557 24.8% 78,215 44.5% 107,732 61.3% DSPs 110 15.1% 422 58.0% 690 94.8% BRAMs 13.5 10.5% 61 47.7% 81 63.3% URAMs 16 33.3% 40 83.3% 48 100% Base Design LUTs 8,439 9.6% 8,434 9.6% 8,456 9.6% FFs 10,205 5.8% 10,205 5.8% 10,205 5.8% DSPs 0 0% 0 0% 0 0% BRAMs 4 3.1% 4 3.1% 4 3.1% URAMs 0 0% 0 0% 0 0% DPU LUTs 26,017 29.6% 38,673 44.0% 48,229 54.9% FFs 33,352 19.0% 68,010 38.7% 97,527 55.5% DSPs 110 15.1% 422 58.0% 690 94.8% BRAMs 9.5 7.4% 57 44.5% 77 60.2% URAMs 16 33.3% 40 83.3% 48 100%
  • 9. 9 Efficient Utilization of the Xilinx DPU • Multithreading is crucial for high performance • Environment supporting semi-automatic realization and evaluation of multithreading during application development • Execution Time – Single-threaded • Execution Time – Multi-threaded Read Data Preproc. DPU Processing Post. Read Data Preproc. Post. DPU Processing DPU Processing Post. Read Data Preproc. Read Data Preproc. t1 t2 t3 t4 t0 t0 ttotal
  • 10. 10 Efficient Utilization of the Xilinx DPU • Performance and power monitoring for single- and multi-threaded implementations • Detailed power measurements on RECS platforms • Power-aware profiling and optimization
  • 11. 11 Example DSE Using Different DPU Configurations
  • 12. 12 Example DSE Using Different DPU Configurations
  • 13. 13 Example DSE Using Different DPU Configurations
  • 14. 14 Benchmark Performance of DL Accelerators YoloV4 [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLR… [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRA… [CELLRANGE] 10 100 1000 10000 2 4 8 16 32 64 128 Performance [GOPS] Power [Watt] INT8 FP16 FP32 ZU3 ZU15
  • 15. 15 Dynamic Reconfiguration of DL Accelerators • Change the characteristics of the DL accelerator at run-time (e.g., change performance-power trade-off or performance-accuracy trade-off) SMARC Module SoC FPGA-Fabric Processing System HDMI CSI PCIe x4 GigE USB DDR (PS) Memory Subsystem Interrupt Controller Dual/Quad Arm Cortex- A53 Dual Arm Cortex-R5 I/O Interfaces AXI AXI-Lite GPIO, UART DDR (PL) Xilinx/ LiteX Memory Ctrl eMMC Flash SD GPIO, UART I/O Ctrl SATA Platform Mgmt, System Funct. & Configuration HDMI CSI Clk AXI CB AXI –Lite CB Disconnect PR-Region DFX Accelerator A Accelerator B
  • 16. 16 Dynamic Reconfiguration of DL Accelerators SMARC Module SoC FPGA-Fabric Processing System HDMI CSI PCIe x4 GigE USB DDR (PS) Memory Subsystem Interrupt Controller Dual/Quad Arm Cortex- A53 Dual Arm Cortex-R5 I/O Interfaces AXI AXI-Lite GPIO, UART DDR (PL) Xilinx/ LiteX Memory Ctrl eMMC Flash SD GPIO, UART I/O Ctrl SATA Platform Mgmt, System Funct. & Configuration HDMI CSI Clk AXI CB AXI –Lite CB Disconnect Accelerator Disconnect Accelerator Accelerator PR-Region PR-Region DFX • Change the characteristics of the DL accelerator at run-time (e.g., change performance-power trade-off or performance-accuracy trade-off)
  • 17. 17 Reconfigurable DL Accelerators • Accelerator to be used for the codesign approach: Generation of dataflow-architectures based on C++ templates • Support for inference and training • Targeting CNNs, deep reinforcement learning, and federated learning • Definition of parameterizable layer templates in C++ (e.g., convolution, fully connected, pooling, and activation functions, …) • Parameterizable, e.g., quantization (from low bit-width INT to float) • Optimized for high-level synthesis • All layers integrate three functions (if required): inference/forward propagation, backpropagation, and update function • Inference utilizes only forward path • Learning (DeepRL): utilizes the full functionality of the layer templates
  • 18. 18 Soft SoC Platform • Generation of soft SoC platforms • Utilize RISC-V soft cores • Generic interface to AI-Accelerators • Modelled in an open source emulation environment • Utilize LiteX SoC generator • Run-time reconfiguration • Accelerators • Processor cores FPGA Base Architecture AI-Accelerator Run-Time Reconfiguration Interface
  • 19. 19 • Configurable soft SoC generator provides a platform for low power AI accelerator exploration • The generator enables a functionality to generate a system with a set of peripherals required for a specific tasks • Scalable from MCU-class to Linux-capable platforms • Support for generic, vendor independent accelerator integration interface makes it a perfect AI research platform • Portable across different hardware, based on open-source tooling • CFUs - Custom Function Units – custom accelerators designed for specific workflows, tightly coupled with the CPU • Accessed via custom RISC-V instructions • Can be implemented in high-level hardware description languages, like, e.g., Python-based Amaranth Configurable SoC for ML Workflows
  • 20. 20 • CFUs offer great flexibility • Test various dedicated accelerators for specific workflows • Renode simulation framework extended with CFU support • Co-simulating functional models of the SoC with verilated, cycle-accurate CFUs • Invaluable tool for development • Massive continuous integration testing • Different CFU implementations • Different inputs • Allows for automatic result comparison and analysis • Everything open-sourced Configurable SoC for ML Workflows
  • 21. 21 • Platform • Hardware: Scalable, heterogeneous, distributed • Accelerators: Efficiency boost by FPGA and ASIC technology • Toolchain: Optimizing Deep Learning for IoT • Use cases • Industrial IoT • Automotive • Smart Home • Open call • Open for submissions until 8. May • Early use and evaluation of VEDLIoT technology Very Efficient Deep Learning for IoT – VEDLIoT
  • 22. 22 Follow our work ⇒ https://twitter.com/VEDLIoT ⇒ https://www.linkedin.com/company/vedliot/ ⇒ https://vedliot.eu Be part of it ⇒ Open call NOW! ⇒ Allows early use and evaluation of VEDLIoT technology
  • 23. 23 Thank you for your attention.
  • 24. 24 DL Accelerator CPU GPU TPU Compiler DL model Heterogenenous DL Accelerator DL Accelerator FPGA Compiler HW Spec DL model Reconfigurable DL Accelerator DL Accelerator FPGA Compiler DL model HW Spec HW Spec Compiler Dynamically Reconfigurable DL Accelerator DL Accelerator FPGA Compiler Co- Design DL model Co-Designed DL Accelerator Deep Learning Accelerators
  • 25. 25 Dynamic Reconfiguration of DL Accelerators • Utilize dynamic reconfiguration • Change the complete DL model and the corresponding accelerator at run-time depending on application requirements • Change the characteristics of the DL accelerator at run-time (e.g., change performance-power trade-off or performance-accuracy trade-off) • Enable accelerator to be partially reconfigured for different phases of the application
  • 26. 26 First Reference Design Based on Xilinx DPU Example • Performance and power evaluation for YoloV4 • Trade-off latency vs. performance Platform SM-B71 on SOM-DB2500 Carrier DPU B3136 x1, 300MHz B4096 x1, 300MHz Number of threads 1 2 4 1 2 4 Latency [ms] 120.34 198.62 383.72 93.42 144.51 276.33 Achieved performance [Inferences/s] 8.28 10.76 10.76 10.66 15.12 15.12 Achieved performance [GOPS] 500.11 649.90 649.90 643.86 913.25 913.25 Peak performance [GOPS] 940.8 940.8 940.8 1228.8 1228.8 1228.8 Performance Ratio 53.16% 69.08% 69.08% 52.40% 74.32% 74.32% Cost Metrics Power [W] 11.20 12.49 12.51 13.14 15.42 15.44 Idle Power [W] 0.07/7.09 0.07/7.09 0.07/7.09 0.07/7.56 0.07/7.56 0.07/7.56 Energy/Inference [J] 1.352 1.161 1.173 1.233 1.020 1.021 Power Efficiency [GOPS/W] 44.65 52.03 51.95 49.00 59.23 59.15