Implementing AI: Hardware Challenges

• Knowledge Transfer Network (KTN) is Innovate UK’s Network partner
• Innovate UK drive productivity and economic growth by supporting
businesses to develop and realise the potential of new ideas,
including those from the UK’s world-class research base.
• Connecting with Knowledge Transfer Network can “lead to potential
collaborations, horizon-expanding events, bespoke support and
innovation insights relevant to your needs.”
• Nigel Rix, Head of Enabling Technology: Nigel.rix@ktn-uk.org

eFutures aims to strengthen and support a network of people
working in electronic systems across the UK
• Building new links and increasing involvement with industry
• Mapping the national electronics research, to ensure the work across the UK is known and noted
• Encouraging and funding innovative multi-disciplinary/multi-university proposals
• Communicating with our network via a monthly magazine, social media and new website
• Running events that support our network and our strategy
• Piloting an academic Mentoring Scheme pilot
• Launching a Big Ideas Challenge – more details soon
• Ideas warmly welcomed. Please get involved!
Twitter @efuturesuk
Sign up to our mailing list: efutures@qub.ac.uk

Agenda
10:15 - Professor Themis Prodromakis, Director of the Centre for Electronics Frontiers at the University of
Southampton & founder of SoneT.ai
10:35 - Iain Wallace, Rovco
10:55 - Dr Jose Nunez Yanez, Reader in adaptive and energy efficient computing, University of Bristol
11.15 - Matt Holdsworth, Lattice SemiConductor
11:35 - 12:00 Panel Q&A, hosted by Professor Roger Woods, Queen’s University Belfast & CTO Analytics Engines
BREAK
14:00- 15:00 Workshop
15:30 - 16:30 1-to-1 Meetings via meeting Mojo

Memristive Technologies
from functional oxides to AI on a chip
Themis Prodromakis
Professor of Nanotechnology
Zepler Institute, University of Southampton

Zepler Institute for Photonics & Nanoelectronics

Outline
Modern electronics challenges & the AI era needs
Memristors:
• Technology
• Tools & Infrastructure
Application Examples – beyond memory
Conclusion

Our AI is as good as our access to data
ENGINEERING CHALLENGE: “The fundamental design of separate memory and
processing places a limit on what can be achieved.”

Can we continue scaling?
The end of Moore’s law???

Chua’s symmetry argument STM image of HP’s memristor cross-bar
Cross-section of a memristor’s
core.
L. Chua, “Memristor-the missing circuit element,” IEEE Trans. Circuit Theory, vol. 18, 1971.
R. Williams, “How we found the missing memristor,” IEEE spectrum, vol. 45, 2008.
Memristor (Memory-resistor)

E-Beam lithography of Sub 15 nm ultrahigh density cross-bar memory chips
Memristors fabrication
Scientific Reports, 6, 32614, 2016.
12

Metal-oxide memristors memory capacity up to 7-bit states per cell
13
b)a)
Resistance(k)
30
50
70
90
0 5010 4020 30
Resistance(k)
80
Time (hrs)
2 3 4 50 1
60
40
30
70
50
Time (ms)
S1
S47
b) c)
Cumulativedistributionfunction(%)
Resistance (k
30
0
20
80
60
40
100
40 50 60
Resistance(k)
6
80
Time (hrs)
2 3 4 50 871
60
40
30 S1
S47
S5
S7
70
50
S1
Memristors as analogue memory

Application Demonstrators
Examples – beyond memory

Example #1
In-silico ML implementations

Eric Kandel
Nobel Prize
in Physiology 2000
Emulating synapses with memristors

17
Unsupervised learning in probabilistic memristor neural network
Switching vs.
resistive state
relation at fixed
voltage levels ->
Exploit to encode
conditional
probabilities
Desired switching level
Approx. operating V
Unsupervised Learning
Nature Communications, 7, 12611, 2016.

18
• Network shows capability of learning in unsupervised manner and handles mistakes rather well.
• Copes with cases where class centres drift over time.

19
• Whilst ‘learn once’ systems have their uses, ideally one wants something more flexible
(e.g. if class centres drift over time).

Example #2
Energy-efficient Bayesian Inference

21
Bayesian Inference
“Hardware-Level Bayesian Inference”, Neural Information Processing Systems (NIPS), 2017.
Computing directly in the probability domain
Vector-Matrix-Vector Scalar multiply

Example #3
Empowering new design paradigms

Our world is analogue!
Our electronics is mainly digital!

24
Fusing Analogue and Digital Paradigms
Charge-based computing

25
Fusing Analogue and Digital Paradigms

26
In silico classifiers

27
In silico classifiers

Example #4
Employ device physics for sensory data compression

On-node processing of rich data with single nanoscale devices
29
Memristive Sensors

Memristive Sensors
Spike detection & sorting with single nanoscale devices
Nature Communications, vol. 7, 12805, 2016.
RSC Faraday Discussions, 213, 511-520, 2019.

31
Memristive Sensors
Spike sorting with single nanoscale devices
Nature Communications, vol. 7, 12805, 2016.
RSC Faraday Discussions, 213, 511-520, 2019.

Example #5
Bio-hybrid systems: Linking Brain and Silicon Neurons

“Memristive synapses connect brain and silicon
spiking neurons”, Sc. Reports, 10, 2590, 2020
A geographically distributed bio-hybrid neural network
Internet of Neuroelectronics
ANPREBNABm

Unique solutions that address technology gaps across
4 computational pillars
Thinking
AI on a chip
Our chipsets will equip AI systems with sensing, recognition, learning and
reasoning capabilities, paving the way towards “Thinking Machines”.
“AI on chips” will embed intelligence everywhere

How could the future look like?

A pathway to keep your data private!

Bioelectronic Medicines
Feynman: “What I cannot create, I do not understand”
Can we replace parts of our brain?
Can we extend our brain’s capacity?
Can we…???
Augmented Intelligence

What’s next?
Challenges vs opportunities

180nm TSMC node:
- Custom design kit
- Primitive cells (symbol,
layout, extracted, Verilog-A)
- HV infrastructure
- Memory array design
Under development:
- Shared design library
(IP, analogue cells, etc)
- Scalable on-chip
instrumentation 40
Monolithic integration on CMOS
Top level reticle:
Overall size:
10.9mm x 13.8mm

t.prodromakis@soton.ac.uk
Acknowledgments
This work was supported by:
EU-FP7 RAMP, EP/K017829/1 and EP/R024642/1,
the Royal Academy of Engineering and the Royal Society.

Heterogeneous and adaptive computing for
energy efficient AI
Jose Nunez-Yanez
University of Bristol/Royal Society industrial fellow

Talk structure
§ The energy and performance challenge in AI.
§ Addressing this challenge with custom
hardware.
§ Optimizing energy and performance with
adaptive voltage scaling and heterogenous
circuits.
§ Conclusions and future work.

AI is an energy guzzler
§ AI can be extremely power-hungry for both training and
inference:
§ Training is especially power intensive but you need to
do it a number of limited times and you can do it at
locations with no constraints in resources.
§ The complexity of inference is lower but needs to be
done continuously and potentially in constraint
environments (e.g. mobile computing, edge
computing etc)

AI hardware accelerators available
§ Hardware accelerators deliver high-throughput, energy-
efficiency and low-latency with power profiles ranging from
watts (e.g. Google TPU, Intel NCS, Intel/Xilinx FPGAs,
Graphcore ) to milliwatts with embedded processors based
on ARM/RISCV with parallel (e.g. RISCV GAP) or
subthreshold voltage computing (e.g. ETA/Ambiq)
§ Challenge: how to combine and deploy these different
architectures to obtain optimal operating points for energy
efficiency and performance.

Case study: FPGA + TPU
§ Hardware consists of a ZCU102 board with
2 Xilinx DPUs and 2 Google TPUs units.
§ The Xilinx DPU is a soft FPGA overlay that
adapts to the DNN complexity and FPGA
resources.
§ The Google EdgeTPU is an ASIC also
based on a systolic array architecture with
similar 8-bit precision.
§ We use a single framework for both type of
devices based on Tensorflow and train only
once. Then we can freeze the network and
customize it for TPU/DPU.
DPU architecture
TPU architecture

Host is a Zynq MPSOC Ultrascale device => (ARM + FPGA)
§ The Zynq processing
platform are a system
on a chip (SoC)
processor with
embedded
programmable logic :
processing system
(PS) + programmable
logic (PL).
§ Google TPU attached
to high-performance
USB3 interface. ZYNQ Ultrascale (High performance)

Object detection with SSD (Single Shot Detection)
§ Host ARM
schedules
detections in
TPUs and DPUs.
§ 1 DPU power
~5.2Watt and 1
TPU power
~1.2Watt.
§ 1 DPU obtains
up to 80FPS and
1 TPU 35 FPS
and 115 FPS
combined.

Power Subsystem in ZCU102 board enables
voltage scaling investigation
I2C
A series of PMBus commands are
required to set the output
voltage.
Open-Standard Digital
Power Management

Better FPGA energy efficiency with Adaptive Voltage Scaling
§ Elongate is a tool and IP
blocks to control the
frequency and voltage and
detect optimal operational
points using in-situ
detectors.
§ Elongate instruments the
FPGA design with in-situ
timing detectors
Elongate implementation flow
MAP
PLACE
&
ROUTE
BITGEN
.v
.vhd
netlist
.NCD
netlist
.BIT
bitstream
.TWR
Timing
Elongate
User
constraints
.v
.vhd
netlist
.VHD .V
source
NTC
component
library
SYN
HLS
High Level
Synthesis
OpenCL, C++
source

Example of timing detector for logic
§ Soft-macro detectors create
different paths for the slow
flip-flop (SFF) and the main
flip-flop (MFF).
§ Discrepancies between MFF
and SFF are detector in
XOR.
§ MFF replicates the
functionality of the original
flip-flop in the critical path.
Generate 0
Generate 1
XOR
SFF
MFF
Q
Output
Detector
Output
D
Input
0/1
0/1
Data Steering
MUXF8
MUXF5
Synchronizer FF

AI architecture with voltage and frequency scalability
ARM A53 MP
ELO CONTROL
FREQ/PHASE
BNN_ZU0
ELO_CLK
ELO_CLK_
PHASE
I2C
Voltage
regulators
AXI
slave (128b)
Reset elo
freq
Reset elo
phase
Peripheral and
PMBUS interfaces
AXI interconnect
BNN_ZU1 BNN_ZU2 BNN_ZU3Enable
Detector
error
LEDS (Locked,
error, debug)
Master
HPM1
(128b)
Slave
HP0
(128b)
Slave
HP1
(128b)
Slave
HP2
(128b)
Slave
HP3
(128b)
DMA0 DMA1 DMA2 DMA3
AXI
master (128b)
Power rail voltage regulators
AXI
master
(128b)
AXI
slave
(128b)
AXI
master
(128b)
AXI
slave
(128b)
AXI
master
(128b)
AXI
slave
(128b)
VCCINT
AXI interconnect
CCI (cache coherent interface)
Master
HPM0
(128b)
§ Only one
FPGA
core is
instrument
ed with
Elongate
detectors.
§ All cores
use the
same
voltage
and
frequency.

Elongate complexity overheads (LUTs and
FFs)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0 2 4 6 8 10 12 14
overhead%
path timing %
FF ZY LUTS ZY FF ZU LUTS ZU

Power scalability in Zynq ultrascale
§ Voltage levels
range from
0.55 v to 0.85
v for the 16nm
Zynq
Ultrascale
device.
§ Elastic
power/perform
ance with up to
85% power
reduction or 2x
performance.

Adaptive Voltage Scaling Applied to
convolutional neural network
380.95, 78.5
352.38, 78.4292.59, 78.4180.95, 78.3
0
10
20
30
40
50
60
70
80
90
170 190 210 230 250 270 290 310 330 350 370 390 410
Accuracy%
Frequency(MHz)
run_0.85v run_0.75v run_0.65v run_0.55v

Conclusions and Future work
§ FPGAs enable custom circuits neural network circuits with
many levels of precision from 1-bit to floating point.
§ FPGA hardware instrumentation enables significant better
energy efficiency and performance at run-time.
§ Heterogenous hardware can work together with different
network precisions and power operating points.
§ Explore novel ways to combine heterogenous hardware that
includes other architectures in addition to CPUs/TPUs and
FPGAs to deliver energy proportional AI.
§ More details:
§ J. Nunez-Yanez, "Energy Proportional Neural Network Inference
with Adaptive Voltage and Frequency Scaling," in IEEE
Transactions on Computers, vol. 68, no. 5, pp. 676-687, 1 May 2019.

Acknowledgement
• Thanks to the Royal Society with the
MINET industrial fellow award and Xilinx
for the hardware/software support.

Matt Holdsworth, Lattice
Semiconductors

LATTICE RISING
2020
Delivering Milliwatt AI to the Edge with
Ultra-Low Power FPGAs
Matt Holdsworth
FAE Lattice Semiconductors
matt.holdsworth@latticesemi.com

- NASDAQ: LSCC2
Rapidly Emerging Edge Computing Trend
Edge Networking Cloud
IoT Communication
Gateway
Wireless /
Wireline Access
Core Network
Driven by Latency, Privacy, and Bandwidth Limitations
Unit growth for edge devices with AI will explode increasing over 110% CAGR
over the next five years – Semico Research

- NASDAQ: LSCC3
HARDWARE PLATFORMS
IP CORES
SOFTWARE TOOLS
REFERENCE DESIGNS / DEMOS
CNN Compact Accelerator CNN Accelerator
UPduino + Himax Shield
– iCE40 UltraPlus FPGA
Embedded Vision Development
Kit
– ECP5 FPGA
1 mW, 5.5 mm2, 1/8/16 bits 1 W, 100 mm2, 1/8/16 bits
CUSTOM DESIGN SERVICES
Smart CarSmart Home Smart City Smart Factory
Neural Network Compiler
Ultra Low Power
Small Form Factor
Customizable
Neural Network Accelerators
Key Phrase
Detection
Object
Counting
Object
Identification
Human Presence
Detection
Face
Tracking
Hand Gesture
Detection

- NASDAQ: LSCC4
Focus Applications
Focus Applications
Object Detection Human Machine Interface (HMI) Object Identification
Defect detection in smart
security and embedded
vision cameras
Feature extraction
enabling navigation of
robots
Key Phrase
detection to control
smart appliances

- NASDAQ: LSCC5
Reference Design / Demo – Human Presence Detection
FEATURES
Sensor CMOS image sensor
Speed 5 frames per second
Power 7 mW on iCE40 UltraPlus
ALWAYS ON HUMAN DETECTION IN APPLIANCE
LOW POWER HUMAN DETECTION FOR WAKE ON APPROACH FOR
LAPTOPS AND PRINTERS

- NASDAQ: LSCC6
Reference Design / Demo Object Counting
FEATURES
Sensor CMOS image sensor
Speed
17 frames per second - Lower
Latency
Power 850 mW on ECP5-85K
HUMAN DETECTION IN VIDEO SECURITY DEVICES
HUMAN COUNTING IN RETAIL CAMERA
APPLICATIONS
DEFECT DETECTION AND OPERATOR COMPLIANCE IN
SMART FACTORY CAMERAS
Defect Detected
Type: Crack

- NASDAQ: LSCC7
Popular sensAI Accelerator Use Cases
Post Processing Preprocessing
PreprocessingStand-alone

- NASDAQ: LSCC8
Hardware Platforms
Modular Platforms for Rapid Prototyping
Key features
▪ Video and Audio sensors
▪ Compact 22 x 50 mm
▪ Includes HM01B0 image sensor board
▪ Arduino Micro form factor UltraPlus board
HM01B0 UPduino Shield Board
Key features
Key features
Embedded Vision Development Kit
Key features
▪ ECP5 FPGA consuming under 1 W of power
consumption
▪ Flexible video connectivity with support for MIPI
CSI-2, eDP, HDMI, GigE Vision, USB 3.0, and more

- NASDAQ: LSCC9
Software Tools
▪ Implement networks developed using
standard frameworks into Lattice FPGAs
without prior RTL experience
▪ Rapidly analyze, simulate, and compile
CNNs/BNNs for implementation on Lattice
sensAI IP cores
Key Features

- NASDAQ: LSCC10
Customizable Reference Designs
Trained Model Quantized Weights and Instructions
FPGA Bitstream
Training
FPGA Design
NN Models
NN IP
System
Interface
Training
Dataset
Training
Scripts
NN Compiler
Lattice sensAI Components Lattice FPGA Design Tools ML Frameworks

- NASDAQ: LSCC11
PERFORMANCE
POWER
1 fps
5 fps
MCU
2W
400mW
SoC
5x
FASTER
5x
LOWER
Sensors
MCU
Results
Lattice CrossLink-NX
SRAM
(weights /
activations)
Sensor
Interface
ALWAYS-ON HUMAN COUNTING
Higher Performance and Lower Power with CrossLink-NX
ECP5-45K NX-40K
10 fps
2x
FASTER
ECP5-45K
200mW
NX-40K2x
LOWER

- NASDAQ: LSCC12
HIGHER
ACCURACY
REFERENCE
DESIGNS
HIGHER
SPEED
LOWER
POWER
Summary of Latest sensAI Updates
CrossLink-NX, the
dedicated embedded vision
and AI inference FPGA,
provides the highest
accuracy at the lowest
power
ECP5 FPGA extends
support to MobileNet and
Resnet for higher speed
processing at high
accuracy
iCE40 UltraPlus, the ultra-
low power edge AI
accelerator now delivers
higher accuracy at the
lowest power
New and updated demos
and end-to-end reference
designs
Key Phrase Detection
Human Identification
Human Presence Detection
Object Counting with MobileNet

- NASDAQ: LSCC13
Reference Design
▪Where to find sensAI page
• Applications -> AI/Machine Learning

- NASDAQ: LSCC14
Where to find Demos and Reference Designs
▪ Demos:
• Provided as bitstream and
Quickstart Guide
• Allows easy demonstration of
functionality.
▪ RDs:
• Complete solution: RTL Code,
Training Scripts, Dataset,
Complete User Guide
• Allows user to reproduce
solution and reuse in own
design framework.

- NASDAQ: LSCC15
HARDWARE PLATFORMS
IP CORES
SOFTWARE TOOLS
REFERENCE DESIGNS / DEMOS
CNN Compact Accelerator CNN Accelerator
UPduino + Himax Shield
– iCE40 UltraPlus FPGA
Embedded Vision Development
Kit
– ECP5 FPGA
1 mW, 5.5 mm2, 1/8/16 bits 1 W, 100 mm2, 1/8/16 bits
CUSTOM DESIGN SERVICES
Smart CarSmart Home Smart City Smart Factory
Ultra Low Power
Small Form Factor
Customizable
Key Phrase
Detection
Object
Counting
Object
Identification
Human Presence
Detection
Face
Tracking
Hand Gesture
Detection

Implementing AI: Hardware Challenges

More Related Content

What's hot (10)

Similar to Implementing AI: Hardware Challenges (20)

More from KTN (20)

Recently uploaded (20)

Implementing AI: Hardware Challenges