SlideShare a Scribd company logo
Information Classification: General
December 8-10 | Virtual Event
Klessydra-T: Designing Vector Coprocessors for
Multi-Threaded Edge-Computing Cores
Mauro Olivieri
Professor
Sapienza University of Rome
#RISCVSUMMIT
Information Classification: General
Francesco Lannutti
collaborator @Synopsys
DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME
Marcello Barbirotta
PhD candidate
Mauro Olivieri
Associate Professor
Francesco Menichelli
Assistant Professor
Antonio Mastrandrea
Research Fellow
Abdallah Cheikh
Research Fellow
Luigi Blasi
PhD cand. @DSI Gmbh
Francesco Vigli
PhD cand. @ ELT Spa
Stefano Sordillo
PhD candidate
Information Classification: General
INTRODUCTION & MOTIVATION
THE KLESSYDRA-T ARCHITECTURE
• Interleaved Multi-Threading baseline
• Parameterized vector acceleration schemes
• Klessydra vector intrinsic functions
BENCHMARK WORKLOADS
• Convolution, Matmul, FFT
• Homogeneous and composite workload
RESULTS
• Cycle count and absolute execution time
• Maximum clock frequency and hardware resource utilization
• Energy efficiency
CONCLUSIONS
OUTLINE
Information Classification: General
19/04/2021 Page 4
APPLICATION CONTEXT AND MOTIVATION
 There are recognized drives towards (extreme)
edge computing: availability, energy saving,
security, etc., having implications on both SW
design and HW design
 HW design challenges of extreme edge computing
devices:
• Local energy budget
• Cost & size
• Computing power
 General setting:
• Possibly taking advantage of inherently
multi-threaded application routines
• Inevitability of hardware acceleration support
Information Classification: General
• “space-qualified” core,
• T0 microarchitecture
• + configurable HW/SW fault-
tolerance support
• “edge computing” core
• extends T0 microarchitecture
• RV32IM
• + configurable multiple
scratchpad memories
• + configurable vector unit
• extended ISA
• Starting point
• M mode v1.10
• RV32I user ISA
• single hart
• M mode v1.10
• RV32I user ISA
• Atomic ext. (partial)
• multiple PC & CSR
• multiple interleaved
harts
PULPino
feat.
Klessydra S0
core
PULPino
feat.
Klessydra
T0 cores
PULPino
feat.
Klessydra F0
cores
PULPino
feat.
Klessydra T1
cores
19/04/2021 Page 5
core
courtesy of
THE PULPINO-COMPATIBLE KLESSYDRA CORE
FAMILY
Information Classification: General
THE KLESSYDRA IMT MICROARCHITECTURE
 Baseline Klessydra T03 core features:
• Thread context switch at each clock cycle
• in-order, single issue instruction execution
• feed-forward pipeline structure (no hardware support
for pipeline hazard handling)
• bare metal execution (RISCV M mode)
 The vector-accelerated Klessydra-T13 core has been
designed as a superset of the basic Klessydra-T03
microarchitecture.
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Information Classification: General
THE KLESSYDRA-T1 MICROARCHITECTURE
FAMILY
Input Mapping
Add
Sub
Shft Mul Accum Relu
MFU
Bank Intrlv
Bank1
Bank0 BankN
SPMI
Data reorder
Output Mapping
MAU_busy
MAU_req
EXEC
Regfile
Decode
Fetch
PC
PC
CSR
Data Mem
WB
Debug
Prg Mem
Updater
harc
Updater
DSP Initialization
Control / Mapping
Add
Sub
Shft Mul Accum Relu
Accl Exec
MFU
Accl Init
hart a
hart a,
b, or c
hart c SPMI
B0 B1 B2
LSU
x F
x D
SPM
SPM
SPM
x D
bank
bank
bank
…
x N
bank
bank
bank
bank
bank
bank
SPM0 SPM1
SPMN-1
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Execute MFU
SPMI
LSU
Klessydra T13 core features
 multiple units in the execution stage
• scalar execution unit (EXEC)
• vector-oriented multi-purpose functional
unit (MFU) with Scratchpad Memory
support
• Load/Store unit (LSU)
 possible concurrent execution of instructions
of different types
Information Classification: General
HARDWARE ACCELERATION PARAMETRIC
SCHEMES
The parametric coprocessor architecture in T13 cores,
comprised of the MFU and the SPMIs, can be
configured at synthesis level according to the following
values:
• the number of parallel lanes D in the MFU, which
defines the DLP degree and also corresponds to the
number of SPM banks in each SMPI block
• the number of MFUs F
• the SPM bank capacity B
• the number of SPMs N
• the number of SPMIs M
• The sharing scheme of MFUs and SMPI among the
harts, i.e. heterogeneous or symmetric
19/04/2021 Titolo Presentazione Pagina 8
 M=1, F=1, D=1: SISD
 M=1, F=1, D=2,4,8: Pure SIMD
 M=3, F=3, D=1: Symmetric MIMD
 M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD
 M=3, F=1, D=1: Heterogeneous MIMD
 M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
Information Classification: General
KLESSYDRA VECTOR EXTENSION AND INTRINSIC
FUNCTIONS
Assembly syntax – (r) denotes
memoryaddressing via register r
Short description
kmemld (rd),(rs1),(rs2) load vector into scratchpad region
kmemstr (rd),(rs1),(rs2) store vector into main memory
kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region
ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region
kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region
kvred (rd),(rs1),(rs2) reduce vector by addition
kdotp (rd),(rs1),(rs2) vector dot product into register
ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad
ksvaddrf (rd),(rs1),rs2 add vector + scalar into register
ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad
ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register
kdotpps (rd),(rs1),(rs2) vector dot product and post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad
krelu (rd),(rs1) vector ReLu within scratchpad
kvslt (rd),(rs1),(rs2) compare vectors and create mask vector
ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask
kvcp (rd),(rs1) copy vector within scratchpad region
The instructions supported by the coprocessor sub-
system are exposed to the programmer in the form of
very simple intrinsic functions, fully integrated in the
RISC-V gcc compiler toolchain.
CSR_MVSIZE(Row_size); //set vector length
for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows
k_element = 0;
for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) {
for ( column_offset = 0; column_offset < kernel_size; column_offset++){
FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space
ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result
ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment
OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space
kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row
}
}
}
Information Classification: General
BENCHMARK WORKLOADS AND EVALUATION SETUP
 2D convolution
• 32-bit data elements in fixed-point representation
• 3x3 filter size
• matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements
• additional analysis of larger than 3x3 filter sizes on 32x32 matrices
 FFT
• 256 complex samples
 Matmul
• Square matrices of 64x64 elements
• Homogeneous workload (3 harts running same program)
• Composite workload (3 harts running different programs)
19/04/2021 Titolo Presentazione Pagina 10
ANALYZED PERFORMANCE FIGURES
ON FPGA SOFT-CORE
IMPLEMENTATION
• Average total cycle count per hart
• Maximum clock frequency
• Absolute execution time
• Hardware Resource Utilization
• Average energy per algorithmic
operation
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
• The clock speed exhibited the sharpest drops as the DLP grew larger.
• In the symmetric MIMD scheme, the large HW overhead forced FPGA
slices on the same critical path to be placed far from each other, thus
increasing interconnect delay.
• Pipelining the heterogeneous MIMD crossbar to reduce the critical path,
introduces additional HW overhead, compromising the area advantage.
Information Classification: General
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
SUMMARY OF PERFORMANCE RESULTS
• Small matrix convolutions and FFT on
the accelerated core reached up to
2X cycle count reduction over the
single-threaded, DSP-extended
RI5CY core.
• Large matrix convolutions and
MatMul obtain advantage from
vector-acceleration reaching 9X cycle
count reduction relative to RI5CY.
Information Classification: General
• Assuming maximum clock frequency for each core
• Zeroriscy core taken as common reference
• In pure SIMD configurations, the speed-up grows linearly
with the DLP
• Going from a SISD/SIMD to MIMD+SIMD improved the
speedup in all cases, despite the frequency drop
associated to the MIMD hardware.
• The symmetric MIMD+SIMD schemes exhibit up to 17X
speed-up over Zeroriscy for Convolution 32x32 and up to
13X speed-up for the composite workload.
• Heterogeneous MIMD configurations maintain an almost
perfect overlap with the symmetric MIMD.
• The non-accelerated Klessydra-T03, exhibits an absolute
performance gain over RI5CY and ZeroRiscy
Pagina 14
ABSOLUTE EXECUTION TIME SPEED-UP
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4
Conv.2D 8x8
Conv.2D 16x16
Conv.2D 32x32
FFT 256
MatMul 64x64
Composite
Information Classification: General
ENERGY EFFICIENCY
• The result of this analysis is expressed as energy
per algorithmic operation, for the FPGA soft-core
implementations, normalized to Zeroriscy, taken as
reference.
• The most energy efficient designs resulted to be
the T13 symmetric MIMD configurations
• The heterogenous MIMD approach exhibited an
almost complete overlap in energy consumption
with the symmetric MIMD
• The pure SIMD schemes resulted in a larger
energy consumption than other schemes, due to
the impossibility of efficiently exploiting TLP.
Pagina 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3 SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4 Conv.2D 8x8
Conv.2D 16x16 Conv.2D 32x32
FFT 256 MatMul 64x64
Composite
Information Classification: General
Pagina 16
LARGER CONVOLUTION FILTERS
Core DLP
Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11)
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6
T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8
T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5
T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7
T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1
T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1
RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3
ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4
• The matrix being convoluted is 32x32 elements
• The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
Information Classification: General
 The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP
• >15X absolute time speed-up , -85% energy per operation.
 Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core,
• 2X-3X speed-up.
 Fully symmetric MIMD and heterogeneous MIMD give very similar results,
• functional unit contention is less impacting than SPM contention.
• coprocessor contention can be effectively mitigated by functional unit heterogeneity
 Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration.
• The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.
 In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution
• Simplified hardware structure phylosophy
19/04/2021 Pagina 17
CONCLUSIONS
Information Classification: General
December 8-10 | Virtual Event
Thank you for joining
Contribute to the RISC-V conversation on social!
#RISCVSUMMIT #KLESSYDRA @mauro_olivieri_
https://github.com/klessydra
Mauro.Olivieri@uniroma1.it

More Related Content

What's hot (20)

PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
RISC-V International
 
PDF
RISC-V 30908 patra
RISC-V International
 
PPTX
Static partitioning virtualization on RISC-V
RISC-V International
 
PDF
Andes open cl for RISC-V
RISC-V International
 
PDF
Andes andes clarity for risc-v vector processor
RISC-V International
 
PDF
Andes RISC-V processor solutions
RISC-V International
 
PPTX
RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V International
 
PPTX
Reverse Engineering of Rocket Chip
RISC-V International
 
PDF
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V International
 
PPTX
Educating the computer architects of tomorrow's critical systems with RISC-V
RISC-V International
 
PPTX
RISC-V assembly
Peter Cheung
 
PPTX
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
KarthiSugumar
 
PDF
LAS16-403: GDB Linux Kernel Awareness
Linaro
 
PDF
LAS16-TR03: Upstreaming 201
Linaro
 
PDF
System Design on Zynq using SDSoC
Sundance Multiprocessor Technology Ltd.
 
PDF
BUD17 Socionext SC2A11 ARM Server SoC
Linaro
 
PDF
Semi dynamics high bandwidth vector capable RISC-V cores
RISC-V International
 
PPTX
Closing the RISC-V compliance gap via fuzzing
RISC-V International
 
PDF
BKK16-400A LuvOS and ACPI Compliance Testing
Linaro
 
PPTX
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
RISC-V International
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
RISC-V International
 
RISC-V 30908 patra
RISC-V International
 
Static partitioning virtualization on RISC-V
RISC-V International
 
Andes open cl for RISC-V
RISC-V International
 
Andes andes clarity for risc-v vector processor
RISC-V International
 
Andes RISC-V processor solutions
RISC-V International
 
RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V International
 
Reverse Engineering of Rocket Chip
RISC-V International
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V International
 
Educating the computer architects of tomorrow's critical systems with RISC-V
RISC-V International
 
RISC-V assembly
Peter Cheung
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
KarthiSugumar
 
LAS16-403: GDB Linux Kernel Awareness
Linaro
 
LAS16-TR03: Upstreaming 201
Linaro
 
System Design on Zynq using SDSoC
Sundance Multiprocessor Technology Ltd.
 
BUD17 Socionext SC2A11 ARM Server SoC
Linaro
 
Semi dynamics high bandwidth vector capable RISC-V cores
RISC-V International
 
Closing the RISC-V compliance gap via fuzzing
RISC-V International
 
BKK16-400A LuvOS and ACPI Compliance Testing
Linaro
 
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
RISC-V International
 

Similar to Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores (20)

PDF
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Ardavan Pedram
 
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
PPTX
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
PPTX
DATE 2020: Design, Automation and Test in Europe Conference
LEGATO project
 
PPTX
Mirabilis_Presentation_DAC_June_2024.pptx
Deepak Shankar
 
PPTX
Introduction to architecture exploration
Deepak Shankar
 
PDF
Architect Cheatsheet
Karthik Ethirajan
 
PDF
00364438
Rob Yates
 
PDF
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PPTX
Webinar on RISC-V
Deepak Shankar
 
PPTX
Exploration of Radars and Software Defined Radios using VisualSim
Deepak Shankar
 
PDF
POWER10 innovations for HPC
Ganesan Narayanasamy
 
PPTX
Industrial trends in heterogeneous and esoteric compute
Perry Lea
 
PPTX
Zvika Rozenshein,General Manager, EngineeringIQ
chiportal
 
PDF
ERTS_Unit 1_PPT.pdf
VinothkumarUruman1
 
PPTX
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
PPTX
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
VEDLIoT Project
 
PPT
Introduction to Embedded Systems and its Applications
Gaurav Verma
 
PDF
Systems on chip (so c)
sandeep1721
 
PDF
RISC V in Spacer
klepsydratechnologie
 
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Ardavan Pedram
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
DATE 2020: Design, Automation and Test in Europe Conference
LEGATO project
 
Mirabilis_Presentation_DAC_June_2024.pptx
Deepak Shankar
 
Introduction to architecture exploration
Deepak Shankar
 
Architect Cheatsheet
Karthik Ethirajan
 
00364438
Rob Yates
 
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
Webinar on RISC-V
Deepak Shankar
 
Exploration of Radars and Software Defined Radios using VisualSim
Deepak Shankar
 
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Industrial trends in heterogeneous and esoteric compute
Perry Lea
 
Zvika Rozenshein,General Manager, EngineeringIQ
chiportal
 
ERTS_Unit 1_PPT.pdf
VinothkumarUruman1
 
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
VEDLIoT Project
 
Introduction to Embedded Systems and its Applications
Gaurav Verma
 
Systems on chip (so c)
sandeep1721
 
RISC V in Spacer
klepsydratechnologie
 
Ad

More from RISC-V International (20)

PDF
WD RISC-V inliner work effort
RISC-V International
 
PDF
RISC-V Online Tutor
RISC-V International
 
PPTX
London Open Source Meetup for RISC-V
RISC-V International
 
PPTX
RISC-V Introduction
RISC-V International
 
PPTX
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
RISC-V International
 
PDF
Standardizing the tee with global platform and RISC-V
RISC-V International
 
PPTX
Security and functional safety
RISC-V International
 
PPTX
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V International
 
PPTX
RISC-V 30906 hex five multi_zone iot firmware
RISC-V International
 
PPTX
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V International
 
PDF
RISC-V software state of the union
RISC-V International
 
PDF
Ripes tracking computer architecture throught visual and interactive simula...
RISC-V International
 
PPTX
Porting tock to open titan
RISC-V International
 
PPTX
Open j9 jdk on RISC-V
RISC-V International
 
PDF
Open source manufacturable pdk for sky water 130nm process node
RISC-V International
 
PPTX
Gernot heiser unsw sydney and se l4 foundation
RISC-V International
 
PPTX
Fueling the datasphere how RISC-V enables the storage ecosystem
RISC-V International
 
PPTX
Easily emulating full systems on amazon fpg as
RISC-V International
 
PPTX
Developing for polar fire soc
RISC-V International
 
PPTX
Data trustworthiness at the edge
RISC-V International
 
WD RISC-V inliner work effort
RISC-V International
 
RISC-V Online Tutor
RISC-V International
 
London Open Source Meetup for RISC-V
RISC-V International
 
RISC-V Introduction
RISC-V International
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
RISC-V International
 
Standardizing the tee with global platform and RISC-V
RISC-V International
 
Security and functional safety
RISC-V International
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V International
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V International
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V International
 
RISC-V software state of the union
RISC-V International
 
Ripes tracking computer architecture throught visual and interactive simula...
RISC-V International
 
Porting tock to open titan
RISC-V International
 
Open j9 jdk on RISC-V
RISC-V International
 
Open source manufacturable pdk for sky water 130nm process node
RISC-V International
 
Gernot heiser unsw sydney and se l4 foundation
RISC-V International
 
Fueling the datasphere how RISC-V enables the storage ecosystem
RISC-V International
 
Easily emulating full systems on amazon fpg as
RISC-V International
 
Developing for polar fire soc
RISC-V International
 
Data trustworthiness at the edge
RISC-V International
 
Ad

Recently uploaded (20)

PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
July Patch Tuesday
Ivanti
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 

Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

  • 1. Information Classification: General December 8-10 | Virtual Event Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores Mauro Olivieri Professor Sapienza University of Rome #RISCVSUMMIT
  • 2. Information Classification: General Francesco Lannutti collaborator @Synopsys DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME Marcello Barbirotta PhD candidate Mauro Olivieri Associate Professor Francesco Menichelli Assistant Professor Antonio Mastrandrea Research Fellow Abdallah Cheikh Research Fellow Luigi Blasi PhD cand. @DSI Gmbh Francesco Vigli PhD cand. @ ELT Spa Stefano Sordillo PhD candidate
  • 3. Information Classification: General INTRODUCTION & MOTIVATION THE KLESSYDRA-T ARCHITECTURE • Interleaved Multi-Threading baseline • Parameterized vector acceleration schemes • Klessydra vector intrinsic functions BENCHMARK WORKLOADS • Convolution, Matmul, FFT • Homogeneous and composite workload RESULTS • Cycle count and absolute execution time • Maximum clock frequency and hardware resource utilization • Energy efficiency CONCLUSIONS OUTLINE
  • 4. Information Classification: General 19/04/2021 Page 4 APPLICATION CONTEXT AND MOTIVATION  There are recognized drives towards (extreme) edge computing: availability, energy saving, security, etc., having implications on both SW design and HW design  HW design challenges of extreme edge computing devices: • Local energy budget • Cost & size • Computing power  General setting: • Possibly taking advantage of inherently multi-threaded application routines • Inevitability of hardware acceleration support
  • 5. Information Classification: General • “space-qualified” core, • T0 microarchitecture • + configurable HW/SW fault- tolerance support • “edge computing” core • extends T0 microarchitecture • RV32IM • + configurable multiple scratchpad memories • + configurable vector unit • extended ISA • Starting point • M mode v1.10 • RV32I user ISA • single hart • M mode v1.10 • RV32I user ISA • Atomic ext. (partial) • multiple PC & CSR • multiple interleaved harts PULPino feat. Klessydra S0 core PULPino feat. Klessydra T0 cores PULPino feat. Klessydra F0 cores PULPino feat. Klessydra T1 cores 19/04/2021 Page 5 core courtesy of THE PULPINO-COMPATIBLE KLESSYDRA CORE FAMILY
  • 6. Information Classification: General THE KLESSYDRA IMT MICROARCHITECTURE  Baseline Klessydra T03 core features: • Thread context switch at each clock cycle • in-order, single issue instruction execution • feed-forward pipeline structure (no hardware support for pipeline hazard handling) • bare metal execution (RISCV M mode)  The vector-accelerated Klessydra-T13 core has been designed as a superset of the basic Klessydra-T03 microarchitecture. Regfile Decode PC PC CSR Data Mem WB Debug Updater harc Updater hart a hart b hart c Fetch Prg Mem Execute Program memory Data memory
  • 7. Information Classification: General THE KLESSYDRA-T1 MICROARCHITECTURE FAMILY Input Mapping Add Sub Shft Mul Accum Relu MFU Bank Intrlv Bank1 Bank0 BankN SPMI Data reorder Output Mapping MAU_busy MAU_req EXEC Regfile Decode Fetch PC PC CSR Data Mem WB Debug Prg Mem Updater harc Updater DSP Initialization Control / Mapping Add Sub Shft Mul Accum Relu Accl Exec MFU Accl Init hart a hart a, b, or c hart c SPMI B0 B1 B2 LSU x F x D SPM SPM SPM x D bank bank bank … x N bank bank bank bank bank bank SPM0 SPM1 SPMN-1 Regfile Decode PC PC CSR Data Mem WB Debug Updater harc Updater hart a hart b hart c Fetch Prg Mem Execute Program memory Data memory Execute MFU SPMI LSU Klessydra T13 core features  multiple units in the execution stage • scalar execution unit (EXEC) • vector-oriented multi-purpose functional unit (MFU) with Scratchpad Memory support • Load/Store unit (LSU)  possible concurrent execution of instructions of different types
  • 8. Information Classification: General HARDWARE ACCELERATION PARAMETRIC SCHEMES The parametric coprocessor architecture in T13 cores, comprised of the MFU and the SPMIs, can be configured at synthesis level according to the following values: • the number of parallel lanes D in the MFU, which defines the DLP degree and also corresponds to the number of SPM banks in each SMPI block • the number of MFUs F • the SPM bank capacity B • the number of SPMs N • the number of SPMIs M • The sharing scheme of MFUs and SMPI among the harts, i.e. heterogeneous or symmetric 19/04/2021 Titolo Presentazione Pagina 8  M=1, F=1, D=1: SISD  M=1, F=1, D=2,4,8: Pure SIMD  M=3, F=3, D=1: Symmetric MIMD  M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD  M=3, F=1, D=1: Heterogeneous MIMD  M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
  • 9. Information Classification: General KLESSYDRA VECTOR EXTENSION AND INTRINSIC FUNCTIONS Assembly syntax – (r) denotes memoryaddressing via register r Short description kmemld (rd),(rs1),(rs2) load vector into scratchpad region kmemstr (rd),(rs1),(rs2) store vector into main memory kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region kvred (rd),(rs1),(rs2) reduce vector by addition kdotp (rd),(rs1),(rs2) vector dot product into register ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad ksvaddrf (rd),(rs1),rs2 add vector + scalar into register ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register kdotpps (rd),(rs1),(rs2) vector dot product and post scaling ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad krelu (rd),(rs1) vector ReLu within scratchpad kvslt (rd),(rs1),(rs2) compare vectors and create mask vector ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask kvcp (rd),(rs1) copy vector within scratchpad region The instructions supported by the coprocessor sub- system are exposed to the programmer in the form of very simple intrinsic functions, fully integrated in the RISC-V gcc compiler toolchain. CSR_MVSIZE(Row_size); //set vector length for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows k_element = 0; for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) { for ( column_offset = 0; column_offset < kernel_size; column_offset++){ FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row } } }
  • 10. Information Classification: General BENCHMARK WORKLOADS AND EVALUATION SETUP  2D convolution • 32-bit data elements in fixed-point representation • 3x3 filter size • matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements • additional analysis of larger than 3x3 filter sizes on 32x32 matrices  FFT • 256 complex samples  Matmul • Square matrices of 64x64 elements • Homogeneous workload (3 harts running same program) • Composite workload (3 harts running different programs) 19/04/2021 Titolo Presentazione Pagina 10 ANALYZED PERFORMANCE FIGURES ON FPGA SOFT-CORE IMPLEMENTATION • Average total cycle count per hart • Maximum clock frequency • Absolute execution time • Hardware Resource Utilization • Average energy per algorithmic operation
  • 11. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
  • 12. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 • The clock speed exhibited the sharpest drops as the DLP grew larger. • In the symmetric MIMD scheme, the large HW overhead forced FPGA slices on the same critical path to be placed far from each other, thus increasing interconnect delay. • Pipelining the heterogeneous MIMD crossbar to reduce the critical path, introduces additional HW overhead, compromising the area advantage.
  • 13. Information Classification: General MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 SUMMARY OF PERFORMANCE RESULTS • Small matrix convolutions and FFT on the accelerated core reached up to 2X cycle count reduction over the single-threaded, DSP-extended RI5CY core. • Large matrix convolutions and MatMul obtain advantage from vector-acceleration reaching 9X cycle count reduction relative to RI5CY.
  • 14. Information Classification: General • Assuming maximum clock frequency for each core • Zeroriscy core taken as common reference • In pure SIMD configurations, the speed-up grows linearly with the DLP • Going from a SISD/SIMD to MIMD+SIMD improved the speedup in all cases, despite the frequency drop associated to the MIMD hardware. • The symmetric MIMD+SIMD schemes exhibit up to 17X speed-up over Zeroriscy for Convolution 32x32 and up to 13X speed-up for the composite workload. • Heterogeneous MIMD configurations maintain an almost perfect overlap with the symmetric MIMD. • The non-accelerated Klessydra-T03, exhibits an absolute performance gain over RI5CY and ZeroRiscy Pagina 14 ABSOLUTE EXECUTION TIME SPEED-UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 15. Information Classification: General ENERGY EFFICIENCY • The result of this analysis is expressed as energy per algorithmic operation, for the FPGA soft-core implementations, normalized to Zeroriscy, taken as reference. • The most energy efficient designs resulted to be the T13 symmetric MIMD configurations • The heterogenous MIMD approach exhibited an almost complete overlap in energy consumption with the symmetric MIMD • The pure SIMD schemes resulted in a larger energy consumption than other schemes, due to the impossibility of efficiently exploiting TLP. Pagina 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 16. Information Classification: General Pagina 16 LARGER CONVOLUTION FILTERS Core DLP Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11) Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6 T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8 T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5 T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7 T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1 T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1 RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3 ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4 • The matrix being convoluted is 32x32 elements • The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
  • 17. Information Classification: General  The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP • >15X absolute time speed-up , -85% energy per operation.  Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core, • 2X-3X speed-up.  Fully symmetric MIMD and heterogeneous MIMD give very similar results, • functional unit contention is less impacting than SPM contention. • coprocessor contention can be effectively mitigated by functional unit heterogeneity  Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration. • The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.  In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution • Simplified hardware structure phylosophy 19/04/2021 Pagina 17 CONCLUSIONS
  • 18. Information Classification: General December 8-10 | Virtual Event Thank you for joining Contribute to the RISC-V conversation on social! #RISCVSUMMIT #KLESSYDRA @mauro_olivieri_ https://github.com/klessydra [email protected]