SlideShare a Scribd company logo
Š2017, Amazon Web Services, Inc. or its affiliates. All rights reserved
FPGAs in the cloud?
Julien Simon, Principal Evangelist, AI/ML
@julsimon
Velocity Conference, NYC, 04/10/2017
Agenda
• The case for non-CPU architectures
• What is an FPGA?
• Using FPGAs on AWS
• Demo: running an FPGA image on AWS
• FPGAs and Deep Learning
• Resources
The case for non-CPU architectures
Source: Intel
Powering AWS instances: Intel Xeon E7 v4
• 7.1 billion transistors
– 456 mm2 (0.7 square inch)
• General-purpose architecture
– SISD with SIMD extension (AVX instruction set)
• Best single-core performance
• Low parallelism
– 24 cores, 48 hyperthreads
– Multi-threaded applications are hard to build
– OS and librairies need to be thread-friendly
• Thermal envelope: 168W
•https://ark.intel.com/products/96900/Intel-Xeon-Processor-E7-8894-v4-60M-Cache-2_40-GHz
Case study: Clemenson University
1.1 million vCPUs for Natural Language Processing
Optimized cost thanks to Spot Instances
https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/
Moore’s winter is (probably) coming
• « I guess I see Moore’s Law dying here in the next decade or so, but
that’s not surprising », Gordon Moore, 2015
• Technology limits: a Skylake transistor is around 100 atoms across
• New workloads require higher parallelism to achieve good
performance
– Genomics
– Financial computing
– Image and video processing
– Deep Learning
• The age of the GPU has come
http://www.economist.com/technology-quarterly/2016-03-12/after-moores-law
https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress
State of the art GPU: Nvidia V100
• 21.1 billion transistors
- 815 mm2 (1.36 square inch)
• Architecture optimized for floating point
– SIMT (Single Instruction, Multiple Threads)
• Massive parallelism
– 5120 CUDA cores, 640 Tensor cores
– CUDA programming model
– Large, high-bandwidth off-chip memory (DRAM)
• Thermal envelope: 250W
https://www.nvidia.com/en-us/data-center/tesla-v100/
https://devblogs.nvidia.com/parallelforall/inside-volta/
GPUs are not optimal for some applications
• Power consumption and efficiency (TOPS/Watt)
• Strict latency requirements
• Other requirements
– Custom data types, irregular parallelism, divergence
• Building your own ASIC may solve this, but:
– It’s a huge, costly and risky effort
– ASICs can’t be reconfigured
• Time for an FPGA renaissance?
What’s an FPGA?
The FPGA
• First commercial product by Xilink in 1985
• Field Programmable Gate Array
• Not a CPU (although you could build one with it)
• « Lego » hardware: logic cells, lookup tables, DSP, I/O
• Small amount of very fast on-chip memory
• Build custom logic to accelerate your SW application
FGPA architecture
Sources:
https://www.embedded-vision.com/industry-analysis/technical-articles/fpgas-deep-learning-based-vision-processing
http://www.bober-optosensorik.de/fpga-entwicklung.html
Developing FPGA applications
• Languages
– VHDL, Verilog
– OpenCL (C++)
• Software tools
– Design
– Simulation
– Synthesis
– Routing
• Hardware tools
– Evaluation boards
– Prototypes
Expensive and hard to scale
Using FPGAs on AWS
Amazon EC2 F1 Instances
• Up to 8 Xilinx UltraScale Plus VU9P FPGAs
• Each FPGA includes
• Local 64 GB DDR4 ECC protected memory
• Dedicated PCIe x16 connections
• Up to 400Gbps bidirectional ring connection for high-speed streaming
• Approximately 2.5 million logic elements, and approximately 6,800 DSP
engines
The FPGA Developer Amazon Machine Image
(AMI)
• Xilinx SDx 2017.1
– Free license for F1 FPGA development
– Supports VHDL, Verilog, OpenCL
• AWS FPGA SDK
– Amazon FPGA Image (AFI) Management Tools
– Linux drivers
– Command line
• AWS FPGA HDK
– Design files and scripts required to build an AFI
– Shell: platform logic to handle external peripherals, PCIe, DRAM, and interrupts
• Run simulation, design, etc. on a C4 to save money!
https://aws.amazon.com/marketplace/pp/B06VVYBLZZ
https://github.com/aws/aws-fpga
Amazon
Machine
Image (AMI)
Amazon FPGA
Image (AFI)
F1 Instance
CPU
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
DDR-4
Attached
Memory
FPGA Link
PCIe
DDR
Controllers
FPGA Acceleration Using F1 instances
AWS
Marketplac
e
Case study: Edico Genome
Highly Efficient
• Algorithms Implemented in Hardware
• Gate-Level Circuit Design
• No Instruction Set Overhead
Massively Parallel
• Massively Parallel Circuits
• Multiple Compute Engines
• Rapid FPGA Reconfigurability
Speeds Analysis of Whole Human Genomes from Hours to Minutes
Unprecedented Low Cost for Compute and Compressed Storage
http://www.edicogenome.com/
https://aws.amazon.com/marketplace/pp/B075JR57J1
Case study: NGCodec
• Provider of UHD video compression technology
• Up to 50x faster vs. software H.265
• Higher quality video than x265 ‘veryslow’ preset
– Same bit rate
– 60+ frames per second
• Lower latency between live stream and end
viewing
• Optimized cost
https://ngcodec.com/markets-cloud-transcoding/
https://aws.amazon.com/marketplace/pp/B074W1FPKR
Demo: OpenCL on F1 instance
Building the OpenCL application
git clone https://github.com/aws/aws-fpga.git
cd aws-fpga
source sdk_setup.sh
source hdk_setup.sh
source sdaccel_setup.sh
source $XILINX_SDX/settings64.sh
cd $SDACCEL_DIR/examples/xilinx/getting_started/host/helloworld_ocl/
make clean
make check TARGETS=sw_emu DEVICES=$AWS_PLATFORM all
make check TARGETS=hw_emu DEVICES=$AWS_PLATFORM all
make check TARGETS=hw DEVICES=$AWS_PLATFORM all
Creating Vivado project and starting FPGA synthesis
…
INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin
Total elapsed time: 2h 31m 7s
$(SDACCEL_DIR)/tools/create_sdaccel_afi.sh -xclbin=xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-
2pr_4_0.xclbin -o=vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0 -s3_bucket=jsimon-fpga -
s3_logs_key=logs -s3_dcp_key=dcp
…
Generated manifest file '17_10_02-163912_manifest.txt’
upload: ./17_10_02-163912_Developer_SDAccel_Kernel.tar to s3://jsimon-fpga/dcp/17_10_02-
163912_Developer_SDAccel_Kernel.tar17_10_02-163912_agfi_id.txt
Building the AFI
aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37
{ "FpgaImages": [{
"UpdateTime": "2017-10-02T16:39:17.000Z",
"Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin",
"FpgaImageGlobalId": "agfi-03a8031774fc4773f",
"Public": false,
"State": { "Code": "pending"},
"OwnerId": "6XXXXXXXXXXX",
"FpgaImageId": "afi-056fb17ddb8cedf37",
"CreateTime": "2017-10-02T16:39:17.000Z",
"Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }]
}
Loading the AFI and running the OpenCL
application
aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37
{ "FpgaImages": [{
"UpdateTime": "2017-10-02T16:39:17.000Z",
"Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin",
"FpgaImageGlobalId": "agfi-03a8031774fc4773f",
"Public": false,
"State": { "Code": "ready"},
"OwnerId": "6XXXXXXXXXXX",
"FpgaImageId": "afi-056fb17ddb8cedf37",
"CreateTime": "2017-10-02T16:39:17.000Z",
"Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }]
}
sudo fpga-load-local-image -S 0 -I agfi-03a8031774fc4773f
sudo fpga-describe-local-image -S 0
sudo sh
source /opt/Xilinx/SDx/2017.1.rte/setup.sh
./helloworld
sudo fpga-clear-local-image -S 0
FPGAs and Deep Learning
A chink in the GPU armor?
• GPUs are great for training,
but what about inference?
• Throughput and latency: pick one?
– Using batches increases latency
– Using single samples degrades throughput
• Power and memory requirements
– Floating-point operations are power-hungry
– Floating-point weights need more DRAM,
which is power-hungry too
• Neural networks can be implemented
on FPGA
Š HBO
Using custom logic to Multiply and Accumulate
Source: ÂŤ FPGA Implementations of Neural Networks Âť, Springer, 2006
Smaller weights  less gates, less data to load into the FPGA
Optimizing Deep Learning models for FPGAs
• Quantization: using integer weights
– 8/4/2-bit integers instead of 32-bit floats
– Reduces power consumption
– Simplifies the logic needed to implement the
model
– Reduces memory usage
• Pruning: removing useless connections
– Increases computation speed
– Reduces memory usage
• Compression: encoding weights
– Reduces model size
On-chip SRAM
becomes a
viable option
 More power-
effcient than
DRAM
 Faster than
off-chip DRAM
Published results
[Han, 2016] Optimizing CNNs on CPU and GPU
• AlexNet 35x smaller, VGG-16 49x smaller
• 3x to 4x speedup, 3x to 7x more energy-efficient
• No loss of accuracy
[Han, 2017] Optimizing LSTM on Xilinx FPGA
• FPGA vs CPU: 43x faster, 40x more energy-efficient
• FPGA vs GPU: 3x faster, 11.5x more energy-efficient
[Nurvitadhi, 2017] Optimizing CNNs on Intel FPGA
• FPGA vs GPU: 60% faster, 2.3x more energy-effcient
• <1% loss of accuracy
Nvidia Hardware for Deep Learning
• Open architecture for DL inference accelerators on IoT
devices
– Convolution Core – optimized high-performance convolution engine
– Single Data Processor – single-point lookup engine for activation functions
– Planar Data Processor – planar averaging engine for pooling
– Channel Data Processor – multi-channel averaging engine for normalization
functions
– Dedicated Memory and Data Reshape Engines – memory-to-memory
transformation acceleration for tensor reshape and copy operations.
• Verilog model + test suite
• F1 instances are supportedhttp://nvdla.org/
https://github.com/nvdla/
Conclusion
• CPU, GPU, FPGA: the battle rages on
• As always, pick the right tool for the job
– Application requirements: performance, power, cost, etc.
– Time to market
– Skills
– The AWS marketplace: the solution may be just a few clicks away!
• AWS offers you many options,
please explore them and give us feedback
Resources
https://aws.amazon.com/ec2/instance-types/f1
https://aws.amazon.com/ec2/instance-types/f1/partners/
https://github.com/aws/aws-fpga
[Han, 2016] ÂŤ Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization
and Huffman Coding Âť https://arxiv.org/abs/1510.00149
[Han, 2017] ÂŤ ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA Âť, Best Paper at
FPGA’17
https://arxiv.org/abs/1612.00694
« Deep Learning Tutorial and Recent Trends », FPGA’17
http://isfpga.org/slides/D1_S1_Tutorial.pdf
[Nurvitadhi, 2017] ÂŤ Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Âť,
FPGA’17 http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf
Thank you!
http://aws.amazon.com/evangelists/julien-simon
@julsimon

More Related Content

ODP
FPGA on the Cloud
jtsagata
 
PDF
Running BSD on AWS
Julien SIMON
 
PPTX
Advanced Scheduling with Amazon ECS (September 2017)
Julien SIMON
 
PPTX
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
PPTX
Picking the right AWS backend for your application (September 2017)
Julien SIMON
 
PPTX
ECS for Amazon Deep Learning and Amazon Machine Learning
Amanda Mackay (she/her)
 
PPTX
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
PDF
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 
FPGA on the Cloud
jtsagata
 
Running BSD on AWS
Julien SIMON
 
Advanced Scheduling with Amazon ECS (September 2017)
Julien SIMON
 
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
Picking the right AWS backend for your application (September 2017)
Julien SIMON
 
ECS for Amazon Deep Learning and Amazon Machine Learning
Amanda Mackay (she/her)
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
Deep Dive on Deep Learning (June 2018)
Julien SIMON
 

What's hot (10)

PPTX
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
PPTX
Machine Learning inference at the Edge
Julien SIMON
 
PDF
Deep Learning with AWS (November 2016)
Julien SIMON
 
PPTX
High Performance Computing (HPC) in cloud
Accubits Technologies
 
PDF
HPC on Azure for Reserach
JĂźrgen Ambrosi
 
PDF
Deep Dive on Amazon EC2 Instances (March 2017)
Julien SIMON
 
PDF
Machine Learning Inference at the Edge
Julien SIMON
 
PDF
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
PDF
deep learning in production cff 2017
Ari Kamlani
 
PDF
Deep Learning for Developers (October 2017)
Julien SIMON
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Julien SIMON
 
Machine Learning inference at the Edge
Julien SIMON
 
Deep Learning with AWS (November 2016)
Julien SIMON
 
High Performance Computing (HPC) in cloud
Accubits Technologies
 
HPC on Azure for Reserach
JĂźrgen Ambrosi
 
Deep Dive on Amazon EC2 Instances (March 2017)
Julien SIMON
 
Machine Learning Inference at the Edge
Julien SIMON
 
AWS meetup「Apache Spark on EMR」
SmartNews, Inc.
 
deep learning in production cff 2017
Ari Kamlani
 
Deep Learning for Developers (October 2017)
Julien SIMON
 
Ad

Similar to FPGAs in the cloud? (October 2017) (20)

PPTX
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Cesar Maciel
 
PDF
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Databricks
 
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de InformĂĄtica UCM
 
PPTX
SDAccel Design Contest: SDAccel and F1 Instances
NECST Lab @ Politecnico di Milano
 
PDF
Deep learning with FPGA
Ayush Singh, MS
 
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
PDF
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
PPTX
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
PDF
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
PDF
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
PPTX
Introduction to DPDK
Kernel TLV
 
PDF
ODSA Sub-Project Launch
ODSA Workgroup
 
PDF
ODSA Sub-Project Launch
Netronome
 
PDF
OpenCAPI next generation accelerator
Ganesan Narayanasamy
 
PDF
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Community
 
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Cesar Maciel
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Databricks
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de InformĂĄtica UCM
 
SDAccel Design Contest: SDAccel and F1 Instances
NECST Lab @ Politecnico di Milano
 
Deep learning with FPGA
Ayush Singh, MS
 
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Ganesan Narayanasamy
 
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Introduction to DPDK
Kernel TLV
 
ODSA Sub-Project Launch
ODSA Workgroup
 
ODSA Sub-Project Launch
Netronome
 
OpenCAPI next generation accelerator
Ganesan Narayanasamy
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
Ad

More from Julien SIMON (20)

PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
PDF
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
PDF
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
PDF
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
PDF
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
PDF
An introduction to computer vision with Hugging Face
Julien SIMON
 
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
PDF
Building NLP applications with Transformers
Julien SIMON
 
PPTX
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
PDF
Starting your AI/ML project right (May 2020)
Julien SIMON
 
PPTX
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
deep_dive_multihead_latent_attention.pdf
Julien SIMON
 
Deep Dive: Model Distillation with DistillKit
Julien SIMON
 
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Julien SIMON
 
Building High-Quality Domain-Specific Models with Mergekit
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive: Compiling Deep Learning Models
Julien SIMON
 
Tailoring Small Language Models for Enterprise Use Cases
Julien SIMON
 
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien SIMON
 
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien SIMON
 
Julien Simon - Deep Dive - Quantizing LLMs
Julien SIMON
 
Julien Simon - Deep Dive - Model Merging
Julien SIMON
 
An introduction to computer vision with Hugging Face
Julien SIMON
 
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Building NLP applications with Transformers
Julien SIMON
 
Building Machine Learning Models Automatically (June 2020)
Julien SIMON
 
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Scale Machine Learning from zero to millions of users (April 2020)
Julien SIMON
 

Recently uploaded (20)

PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Software Development Methodologies in 2025
KodekX
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 

FPGAs in the cloud? (October 2017)

  • 1. Š2017, Amazon Web Services, Inc. or its affiliates. All rights reserved FPGAs in the cloud? Julien Simon, Principal Evangelist, AI/ML @julsimon Velocity Conference, NYC, 04/10/2017
  • 2. Agenda • The case for non-CPU architectures • What is an FPGA? • Using FPGAs on AWS • Demo: running an FPGA image on AWS • FPGAs and Deep Learning • Resources
  • 3. The case for non-CPU architectures
  • 5. Powering AWS instances: Intel Xeon E7 v4 • 7.1 billion transistors – 456 mm2 (0.7 square inch) • General-purpose architecture – SISD with SIMD extension (AVX instruction set) • Best single-core performance • Low parallelism – 24 cores, 48 hyperthreads – Multi-threaded applications are hard to build – OS and librairies need to be thread-friendly • Thermal envelope: 168W •https://ark.intel.com/products/96900/Intel-Xeon-Processor-E7-8894-v4-60M-Cache-2_40-GHz
  • 6. Case study: Clemenson University 1.1 million vCPUs for Natural Language Processing Optimized cost thanks to Spot Instances https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/
  • 7. Moore’s winter is (probably) coming • ÂŤ I guess I see Moore’s Law dying here in the next decade or so, but that’s not surprising Âť, Gordon Moore, 2015 • Technology limits: a Skylake transistor is around 100 atoms across • New workloads require higher parallelism to achieve good performance – Genomics – Financial computing – Image and video processing – Deep Learning • The age of the GPU has come http://www.economist.com/technology-quarterly/2016-03-12/after-moores-law https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress
  • 8. State of the art GPU: Nvidia V100 • 21.1 billion transistors - 815 mm2 (1.36 square inch) • Architecture optimized for floating point – SIMT (Single Instruction, Multiple Threads) • Massive parallelism – 5120 CUDA cores, 640 Tensor cores – CUDA programming model – Large, high-bandwidth off-chip memory (DRAM) • Thermal envelope: 250W https://www.nvidia.com/en-us/data-center/tesla-v100/ https://devblogs.nvidia.com/parallelforall/inside-volta/
  • 9. GPUs are not optimal for some applications • Power consumption and efficiency (TOPS/Watt) • Strict latency requirements • Other requirements – Custom data types, irregular parallelism, divergence • Building your own ASIC may solve this, but: – It’s a huge, costly and risky effort – ASICs can’t be reconfigured • Time for an FPGA renaissance?
  • 11. The FPGA • First commercial product by Xilink in 1985 • Field Programmable Gate Array • Not a CPU (although you could build one with it) • ÂŤ Lego Âť hardware: logic cells, lookup tables, DSP, I/O • Small amount of very fast on-chip memory • Build custom logic to accelerate your SW application
  • 13. Developing FPGA applications • Languages – VHDL, Verilog – OpenCL (C++) • Software tools – Design – Simulation – Synthesis – Routing • Hardware tools – Evaluation boards – Prototypes Expensive and hard to scale
  • 15. Amazon EC2 F1 Instances • Up to 8 Xilinx UltraScale Plus VU9P FPGAs • Each FPGA includes • Local 64 GB DDR4 ECC protected memory • Dedicated PCIe x16 connections • Up to 400Gbps bidirectional ring connection for high-speed streaming • Approximately 2.5 million logic elements, and approximately 6,800 DSP engines
  • 16. The FPGA Developer Amazon Machine Image (AMI) • Xilinx SDx 2017.1 – Free license for F1 FPGA development – Supports VHDL, Verilog, OpenCL • AWS FPGA SDK – Amazon FPGA Image (AFI) Management Tools – Linux drivers – Command line • AWS FPGA HDK – Design files and scripts required to build an AFI – Shell: platform logic to handle external peripherals, PCIe, DRAM, and interrupts • Run simulation, design, etc. on a C4 to save money! https://aws.amazon.com/marketplace/pp/B06VVYBLZZ https://github.com/aws/aws-fpga
  • 17. Amazon Machine Image (AMI) Amazon FPGA Image (AFI) F1 Instance CPU DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory DDR-4 Attached Memory FPGA Link PCIe DDR Controllers FPGA Acceleration Using F1 instances AWS Marketplac e
  • 18. Case study: Edico Genome Highly Efficient • Algorithms Implemented in Hardware • Gate-Level Circuit Design • No Instruction Set Overhead Massively Parallel • Massively Parallel Circuits • Multiple Compute Engines • Rapid FPGA Reconfigurability Speeds Analysis of Whole Human Genomes from Hours to Minutes Unprecedented Low Cost for Compute and Compressed Storage http://www.edicogenome.com/ https://aws.amazon.com/marketplace/pp/B075JR57J1
  • 19. Case study: NGCodec • Provider of UHD video compression technology • Up to 50x faster vs. software H.265 • Higher quality video than x265 ‘veryslow’ preset – Same bit rate – 60+ frames per second • Lower latency between live stream and end viewing • Optimized cost https://ngcodec.com/markets-cloud-transcoding/ https://aws.amazon.com/marketplace/pp/B074W1FPKR
  • 20. Demo: OpenCL on F1 instance
  • 21. Building the OpenCL application git clone https://github.com/aws/aws-fpga.git cd aws-fpga source sdk_setup.sh source hdk_setup.sh source sdaccel_setup.sh source $XILINX_SDX/settings64.sh cd $SDACCEL_DIR/examples/xilinx/getting_started/host/helloworld_ocl/ make clean make check TARGETS=sw_emu DEVICES=$AWS_PLATFORM all make check TARGETS=hw_emu DEVICES=$AWS_PLATFORM all make check TARGETS=hw DEVICES=$AWS_PLATFORM all Creating Vivado project and starting FPGA synthesis … INFO: [XOCC 60-586] Created xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin Total elapsed time: 2h 31m 7s $(SDACCEL_DIR)/tools/create_sdaccel_afi.sh -xclbin=xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr- 2pr_4_0.xclbin -o=vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0 -s3_bucket=jsimon-fpga - s3_logs_key=logs -s3_dcp_key=dcp … Generated manifest file '17_10_02-163912_manifest.txt’ upload: ./17_10_02-163912_Developer_SDAccel_Kernel.tar to s3://jsimon-fpga/dcp/17_10_02- 163912_Developer_SDAccel_Kernel.tar17_10_02-163912_agfi_id.txt
  • 22. Building the AFI aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37 { "FpgaImages": [{ "UpdateTime": "2017-10-02T16:39:17.000Z", "Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin", "FpgaImageGlobalId": "agfi-03a8031774fc4773f", "Public": false, "State": { "Code": "pending"}, "OwnerId": "6XXXXXXXXXXX", "FpgaImageId": "afi-056fb17ddb8cedf37", "CreateTime": "2017-10-02T16:39:17.000Z", "Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }] }
  • 23. Loading the AFI and running the OpenCL application aws ec2 describe-fpga-images --fpga-image-id afi-056fb17ddb8cedf37 { "FpgaImages": [{ "UpdateTime": "2017-10-02T16:39:17.000Z", "Name": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin", "FpgaImageGlobalId": "agfi-03a8031774fc4773f", "Public": false, "State": { "Code": "ready"}, "OwnerId": "6XXXXXXXXXXX", "FpgaImageId": "afi-056fb17ddb8cedf37", "CreateTime": "2017-10-02T16:39:17.000Z", "Description": "xclbin/vector_addition.hw.xilinx_aws-vu9p-f1_4ddr-xpr-2pr_4_0.xclbin" }] } sudo fpga-load-local-image -S 0 -I agfi-03a8031774fc4773f sudo fpga-describe-local-image -S 0 sudo sh source /opt/Xilinx/SDx/2017.1.rte/setup.sh ./helloworld sudo fpga-clear-local-image -S 0
  • 24. FPGAs and Deep Learning
  • 25. A chink in the GPU armor? • GPUs are great for training, but what about inference? • Throughput and latency: pick one? – Using batches increases latency – Using single samples degrades throughput • Power and memory requirements – Floating-point operations are power-hungry – Floating-point weights need more DRAM, which is power-hungry too • Neural networks can be implemented on FPGA Š HBO
  • 26. Using custom logic to Multiply and Accumulate Source: ÂŤ FPGA Implementations of Neural Networks Âť, Springer, 2006 Smaller weights  less gates, less data to load into the FPGA
  • 27. Optimizing Deep Learning models for FPGAs • Quantization: using integer weights – 8/4/2-bit integers instead of 32-bit floats – Reduces power consumption – Simplifies the logic needed to implement the model – Reduces memory usage • Pruning: removing useless connections – Increases computation speed – Reduces memory usage • Compression: encoding weights – Reduces model size On-chip SRAM becomes a viable option  More power- effcient than DRAM  Faster than off-chip DRAM
  • 28. Published results [Han, 2016] Optimizing CNNs on CPU and GPU • AlexNet 35x smaller, VGG-16 49x smaller • 3x to 4x speedup, 3x to 7x more energy-efficient • No loss of accuracy [Han, 2017] Optimizing LSTM on Xilinx FPGA • FPGA vs CPU: 43x faster, 40x more energy-efficient • FPGA vs GPU: 3x faster, 11.5x more energy-efficient [Nurvitadhi, 2017] Optimizing CNNs on Intel FPGA • FPGA vs GPU: 60% faster, 2.3x more energy-effcient • <1% loss of accuracy
  • 29. Nvidia Hardware for Deep Learning • Open architecture for DL inference accelerators on IoT devices – Convolution Core – optimized high-performance convolution engine – Single Data Processor – single-point lookup engine for activation functions – Planar Data Processor – planar averaging engine for pooling – Channel Data Processor – multi-channel averaging engine for normalization functions – Dedicated Memory and Data Reshape Engines – memory-to-memory transformation acceleration for tensor reshape and copy operations. • Verilog model + test suite • F1 instances are supportedhttp://nvdla.org/ https://github.com/nvdla/
  • 30. Conclusion • CPU, GPU, FPGA: the battle rages on • As always, pick the right tool for the job – Application requirements: performance, power, cost, etc. – Time to market – Skills – The AWS marketplace: the solution may be just a few clicks away! • AWS offers you many options, please explore them and give us feedback
  • 31. Resources https://aws.amazon.com/ec2/instance-types/f1 https://aws.amazon.com/ec2/instance-types/f1/partners/ https://github.com/aws/aws-fpga [Han, 2016] ÂŤ Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Âť https://arxiv.org/abs/1510.00149 [Han, 2017] ÂŤ ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA Âť, Best Paper at FPGA’17 https://arxiv.org/abs/1612.00694 ÂŤ Deep Learning Tutorial and Recent Trends Âť, FPGA’17 http://isfpga.org/slides/D1_S1_Tutorial.pdf [Nurvitadhi, 2017] ÂŤ Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? Âť, FPGA’17 http://jaewoong.org/pubs/fpga17-next-generation-dnns.pdf

Editor's Notes

  • #2: OK
  • #3: OK
  • #4: OK
  • #5: Intel 4004: 15 novembre 1971, 4-bit architecture 0.74MHz, 2300 transistors, 10um
  • #6: XXX is this Skylake ?
  • #10: OK
  • #11: OK
  • #15: Amazon EC2 instances: F1 family FPGA Developer AMI AWS SDK and HDK
  • #16: Available in North Virginia, Oregon and Ireland regions
  • #18: An F1 instance can have any number of AFIs An AFI can be loaded into the FPGA in less than 1 second
  • #19: Edico Genome ported their genomics platform (DRAGEN) to F1 enabling real-time genomic analysis while saving cost and dramatically scaling its availability. This offering has the potential to be transformative for hospitals, academic institutions, drug developers and sequencing centers results, as it enables them to analyze whole genome data in under an hour, which offers up to 10x improvement compared to comparable state of the art algorithms both on-prem and in the cloud.
  • #20: off-the-shelf images for customers revenue stream for FPGA developers