AI Crash Course- Supercomputing

IntelfpgasforaiSupercomputing 2018

ScaleYourInnovation 2
WhyFPGAsWINInDeepLearning
Enabling real time AI in a wide range of
embedded, edge, and data center applications
FIRSTTOMARKETTOACCELERATE
EVoLVINGAIWORKLOADS
▪ Precision
▪ Latency
▪ Sparsity
▪ AdversarialNetworks
▪ ReinforcementLearning
▪ NeuromorphicComputing
▪ …
Lowlatencymemory
constrainedworkloads
▪ Rnn
▪ Lstm
▪ SpeechWL
DeliveringAI+forFlexible
systemlevelfunctionality
▪ AI+I/OIngest
▪ AI+Networking
▪ Ai+security
▪ Ai+pre/postprocessing
▪ …

Fpgas-flexibleforevolvingprecision
ResNet-34 1x Wide ResNet-34 2x Wide ResNet-34 3x Wide
Activation Weight Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc Eq TOPS Top-1 Acc
FP32 FP32 7 0.7359 NR NR NR NR
8-bit 8-bit 8 0.7093 2 NR 1 NR
8-bit Ternary 43 0.6919 11 NR 5 NR
8-bit Binary 52 NR 13 NR 6 NR
4-bit 4-bit 18 0.7033 5 0.7453 2 NR
3-bit 3-bit 51 NR 13 6 NR
2-bit 2-bit 85 0.6793 21 0.7332 9 NR
2-bit Ternary 98 0.6793 25 0.7332 11 NR
1-bit 1-bit 267 0.6054 67 0.6985 30 0.7238
▪ Explore precision
and accuracy
balance
▪ 4X performance
gain with the
same FPGA
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
Throughput and Accuracy for various PE configurations on ResNet Topologies

FpgassolveMemoryboundworkloads
Mozilla DeepSpeech topology implementation
▪ Intel® Stratix 10 MX can further
reduce latency by directly
ingesting the speech signal
*Estimations performed by Manjeera Design Systems Assumption: ~4.4 TOPs of 16b compute (8192 MACs at 266MHz) for Intel Stratix 10 MX
Stream Length FPGA (estimated) (16 bit)
1s 0.003s
10s 0.312s
20s 0.624s
40s 1.25s
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance. Copyright © 2017, Intel Corporation
▪ Intel Stratix 10 MX offers
512GBps bandwidth via multiple
integrated HBMs

ScaleYourInnovation
Intel®
Xeon®
Processor
5
AI+flexibleI/o&networking
Per-chip performance increases when scaled
AI + I/O & networking unlocks nonlinear performance gains through pooling
2x improvement
w/ ResNet-101
Intel®
Xeon®
Processor
Intel®
Xeon®
Processor
Intel®
Arria® 10
FPGA
Intel®
Arria® 10
FPGA
Intel®
Arria® 10
FPGA

AI+Pre/postprocessing&directI/oprovideslowlatency
FPGA
Compute
Latency
FPGAs can perform in-line, real-time
acceleration on the data ingest and
avoid costly data movement within
the system
Intel® Xeon®
Processor
Data Sources
LowerSystemlatency

AI Crash Course- Supercomputing

HowIntel®FPGAsenableDEEPLearningI/O
I/O
I/O
I/O
▪ Millions of reconfigurable logic elements & routing
fabric
▪ Thousands of 20Kb memory blocks & MLABs
▪ Thousands of variable precision digital signal
processing (DSP) blocks
▪ Hundreds of configurable I/O & high-speed
transceivers
▪ Programmable Datapath
▪ Customized Memory structure
▪ Configurable compute

Adaptingtoinnovation
Many efforts to improve efficiency
▪ Batching
▪ Reduce bit width
▪ Sparse weights
▪ Sparse activations
▪ Weight sharing
▪ Compact network
SparseCNN
[CVPR’15]
Spatially SparseCNN
[CIFAR-10 winner ‘14]
Pruning
[NIPS’15]
TernaryConnec
t [ICLR’16]
BinaryConnect
[NIPS’15]
DeepComp
[ICLR’16]
HashedNets
[ICML’15]
XNORNet
SqueezeNet
I
X
W
=
···
···
O
3 2
1 3
13
1
3
Shared Weights
LeNet
[IEEE}
AlexNet
[ILSVRC’12}
VGG
[ILSVRC’14}
GoogleNet
[ILSVRC’14}
ResNet
[ILSVRC’15}
I W O
2
3

Performanceimprovementovertime
Model
Sept-17
Baseline
Dec-17 Feb-18 Apr-18 Jun-18 Oct-18 Dec-18 (projected)
SqueezeNet 1x 1.13x 1.75x 2.61x 3.89x 4.33x 4.51x
GoogleNet 1x 1.13x 1.22x 1.46x 3.55x 4.11x 4.50x
▪ Continually adapting
the custom data flow,
memory hierarchy and
compute enables
improved performance
with the same power
footprint
Jun-17 Sep-17 Dec-17 Apr-18 Jul-18 Oct-18 Feb-19
Performance(img/s)
SqueezeNet and Googlenet
Performance over Time, Batch=1

Intel® FPGADeepLearning accelerationsuite
Pre-compiledGraphArchitecture ExampleTopologies
DDR
DDR
DDR
DDR
Configuration
Engine
AlexNet GoogleNet Tiny Yolo
SqueezeNetVGG16 ResNet 18
…*
ResNet 50ResNet 101
Memory
Reader
/Writer
Crossbar
CUST
OM*
PRIM
Conv
PE Array
Feature Map Cache
*Deeper customization options
COMING SOON!
PRIM PRIM
*More topologies added with every release
MobileNet ResNetSSD
SqueezeNet
SDD

OpenvinoTM toolkitforintelfpgas
Anall-in-onesolutiontoeasily
harnessthebenefitsofFPGAs
▪ Enables developers and data scientists to take
their prototype application to production
▪ Utilize API-based & direct coding to maximize
performance
▪ Deeper customization capabilities coming
soon
OpenVINO™ Toolkit
IntelDeepLearning
DeploymentToolkit
Inference
Engine
Model
Optimizer
Intel FPGA DL
Acceleration Suite
TODAY’S INTEL FPGA
SUPPORTED
DEEP LEARNING FRAMEWORKS
Intel
Xeon®
Processor
Intel
FPGAHeterogeneous
CPU/FPGA
Deployment
Free Download 
software.intel.com/openvino-toolkit

Yourapplicationaccelerationwithfpgapoweredplatforms
*Please contact Intel representative for complete list of ODM manufacturers. Other names and brands may be claimed as the property of others.
INTERFACE
CURRENTLY MANUFACTURED
BY*
Mustang F-100
PCIe x8
Develop NN Model; Deploy across Intel® CPU, GPU, VPU, FPGA; Leverage common algorithms
SOFTWARE
TOOLS
SUPPORTED
PLATFORMS FOR
FPGA
Intel Programmable
Acceleration Card with
Intel Arria 10
PCIe x8
Intel® Arria® 10
Development Kit
PCIe x8
INTEL® INTEL®
Openvino™toolkit

Usecase1:search
Solution Search
Looking for a quick path to deploy and accelerate instant
reverse image searches of products for retail convenience
Solution Success
Intel® FPGAs offered real-time AI inferencing using OpenVINO™
toolkit. This enabled engineers to map neural networks to FPGA,
accelerating image searches with increased throughput and lower
latency, all without the need for FPGA programming experience
Real-timeaioptimizedforperformance,powerandcost
OpenVINO™ Toolkit
Accelerating workloads,
enabling deep learning
capabilities for smarter and
faster ways to transform data
for competitive edge
Intel Programmable
Acceleration Card with
Intel Arria® 10 FPGA
Deployment ready PCIe-
based card with versatile
built-in multifunction
acceleration capabilities with
low-power dissipation and
low-profile form factor
Acceleration stack for
Intel® Xeon® CPU with
FPGAs
Abstracting programming
complexity and maximizing
ease of use by hot-swapping
accelerators and enabling
application portability for
Intel FPGA based
acceleration solutions

UseCase2:Microsoft’sAIforEarth
Microsoft leverages the multimode
capabilities of Intel FPGAs to push through
the memory wall to maximize performance
Project Brainwave with Intel®
Stratix® 10 gives Performance/$ 
only $42 of compute*
200M Images, 20TB
Land cover mapping for the whole US
10+ minutes
*Microsoft’s Blog

Summary
Delivering AI+ for Flexible system
level functionality
First to market to accelerate
evolving AI workloads
▪ OpenVINO™ Toolkit is free to download and enables you to deploy on Intel
FPGAs directly from TensorFlow or Caffe
▪ Intel’s FPGA architecture enables programmable datapath, custom
memory structure and configurable compute
INTELFPGASENABLE

resources
Intel FPGA Training
https://www.intel.com/content/www/us/en/programmable/support/training/overview.html
Get started quickly with:
▪ Find out more online at ww w.intel.com/ai and www.intel.com/fpga
▪ Intel Tech.Decoded online webinars, tool
how-tos & quick tips
▪ Hands-on in-person events
Support
▪ Connect with Intel engineers & AI experts via the public Community Forum
Download 
Free OPENVINO™ toolkit

AI Crash Course- Supercomputing

More Related Content

What's hot (20)

Similar to AI Crash Course- Supercomputing (20)

More from Intel IT Center (20)

Recently uploaded (20)

AI Crash Course- Supercomputing