SlideShare a Scribd company logo
Proprietary and confidential. Do not distribute.
Nervana and the
Future of Computing
26 April 2016
Arjun Bansal
Co-founder & VP Algorithms, Nervana
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.
AI on demand using Deep Learning
2
DL
Image
Classification
Object
Localization
Video
Indexing
Text
Analysis
Nervana Platform
Machine
Translation
Proprietary and confidential. Do not distribute.
Image classification and video activity detection
3
Deep learning model Potential applications
• Trained on a public dataset1 of
13K videos in 100 categories
• Training was approximately 3
times faster than competitive
framework
• Can be extended to perform
scene and object detection,
action similarity labeling, video
retrieval, anomaly detection
1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php
• Activity detection and
monitoring for security
• Automatic editing of captured
moments from video camera
• Facial recognition and image
based retrieval
• Sense and avoid systems for
autonomous driving
• Baggage screening at airports
and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
Proprietary and confidential. Do not distribute.ner va na
Object localization and recognition
4
Proprietary and confidential. Do not distribute.ner va na
Speech to text
5
https://youtu.be/NaqZkV_fBIM
Proprietary and confidential. Do not distribute.ner va na
Question answering
6
Stories
Mary journeyed to Texas.
John went to Maryland.
Mary went to Iowa.
John travelled to Florida.
Questions
Answers
Where is John located?
Florida
Proprietary and confidential. Do not distribute.ner va na
Reinforcement learning
7
Pong Breakout
https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
Proprietary and confidential. Do not distribute.ner va na
Application areas
8
Healthcare Agriculture Finance
Online Services Automotive Energy
Proprietary and confidential. Do not distribute.
Nervana is building the future of computing
9
The Economist, March 12, 2016
Cloud Computing
Custom ASIC
Deep Learning / AI
Proprietary and confidential. Do not distribute.ner va na
nervana cloud
10
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support Models
• Convnet
• RNN, LSTM
• MLP
• DQN
• NTM
Domains
• Images
• Video
• Speech
• Text
• Time series
Proprietary and confidential. Do not distribute.ner va na
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
11
• Fastest library
• Model support
• Cloud integration
Proprietary and confidential. Do not distribute.ner va na
Backends
• CPU
• GPU
• Multiple GPUs
• Parameter server
• (Xeon Phi)
• nervana TPU
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
• Optimized at assembler level
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
=1
nervana
engine
10 GPUs
200 CPUs
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
Instruction
and data
memory
Ctrl
ALU
CPU
Data
Memory
Ctrl
Nervana
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• 10-100x gain
• Architecture optimized for
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
Proprietary and confidential. Do not distribute.ner va na
Special purpose computation
13
1940s: Turing Bombe
Motivation: Automating
calculations, code breaking
Proprietary and confidential. Do not distribute.ner va na
General purpose computation
14
2000s: SoC
Motivation: reduce power
and cost, fungible
computing.
Enabled inexpensive
mobile devices.
Proprietary and confidential. Do not distribute.ner va na
Dennard scaling has ended
15
What business and
technology constraints do
we have now?
Proprietary and confidential. Do not distribute.ner va na
Many-core tiled architectures
16
Tile Processor Architecture Overview for the TILEPro Series 5
and provides high bandwidth and extremely low latency communication among tiles. The Tile
Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-
ble multicore processor. External memory and I/O interfaces are connected to the tiles via the
iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s
structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-
ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a
three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC),
cache, and DMA subsystem. An individual tile is capable of executing up to three operations per
cycle.
CDN
TDN
IDN
MDN
STN
UDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI
(10GbE)
TDN
IDN
MDN
STN
UDN
LEGEND:
Tile Detail
port2
msh0
port0
port2 port1 port0
DDR2
DDR2
port0
msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII
(GbE)
XAUI
(10GbE)
FlexI/O
PCIe
(x4 lane)
I2C, JTAG,
HPI, UART,
SPI ROM
FlexI/O
PCIe
(x4 lane)
port1 port1
msh3 msh2
port2
msh0
port0
port2 port1 port0
port0
msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port0
7,0
port0
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
Switch
Engine
Cache
Engine
Processor
Engine
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
STNSTN
TDNTDN
IDNIDN
MDNMDN
UDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased
performance without clock
rate increase or smaller
devices.
Requires changes in
programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi
Knight’s landing
Proprietary and confidential. Do not distribute.ner va na
FPGA architectures
17
Altera Arria 10
Motivation: fine grained
parallelism, reconfigurable,
lots of IO, scalable.
Slow clock speed, lacks
compute density for
machine learning.
Proprietary and confidential. Do not distribute.ner va na
Neuromorphic architectures
18
IBM TrueNorth
dress for the target axon and
addresses representing core
ension to the target core). This
coded into a packet that is in-
entering spikes (Fig. 2I). Spikes leaving the mesh
are tagged with their row (for spikes traveling
east-west) or column (for spikes traveling north-
south) before being merged onto a shared link
ters (31,232 bits), destination addresses (6656
bits), and axonal delays (1024 bits). In terms of
efficiency, TrueNorth’s power density is 20 mW
per cm2
, whereas that of a typical central processing
Proprietary and confidential. Do not distribute.ner va na
Neural network parallelism
20
Data chunk 1 Data chunk n
…
Processor 1 Processor n
…
parameter server
Full deep
network on
each processor
Parameter coordination
Data parallelism Model parallelism
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
nervana compute topology
22
CPU
CPU
S
S
D
IB
10
G
S
S
D
IB
10
G
nn
n n
nn
nn
PCIE SW
PCIE SW
Proprietary and confidential. Do not distribute.ner va na
Distributed linear algebra and convolution
23
02/27/2014! CS267 Lecture 12! 50!
52!
SUMMA – n x n matmul on P1/2 x P1/2 grid
•  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!
•  A[i,k] is n/P1/2 x b submatrix of A!
•  B[k,j] is b x n/P1/2 submatrix of B !
•  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !
•  summation over submatrices!
•  Need not be square processor grid !
* =
i"
j"
A[i,k]"
k"
k"
B[k,j]"
C[i,j]
02/27/2014! CS267 Lecture 12!
SUMMA distributed matrix multiply C=A*B
(Jim Demmel, CS267 lecture notes)
Matrix multiplication on multidimensional torus networks
Edgar Solomonik and James Demmel
Division of Computer Science
University of California at Berkeley, CA, USA
solomon@cs.berkeley.edu, demmel@cs.berkeley.edu
Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have
a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of
Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.
This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection
bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
Proprietary and confidential. Do not distribute.ner va na
Summary
24
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better

More Related Content

PDF
Startup.Ml: Using neon for NLP and Localization Applications
Intel Nervana
 
PDF
Rethinking computation: A processor architecture for machine intelligence
Intel Nervana
 
PDF
Deep Learning at Scale
Intel Nervana
 
PDF
Urs Köster Presenting at RE-Work DL Summit in Boston
Intel Nervana
 
PDF
Introduction to Deep Learning and neon at Galvanize
Intel Nervana
 
PDF
ODSC West
Intel Nervana
 
PDF
Urs Köster - Convolutional and Recurrent Neural Networks
Intel Nervana
 
PPTX
Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana
 
Startup.Ml: Using neon for NLP and Localization Applications
Intel Nervana
 
Rethinking computation: A processor architecture for machine intelligence
Intel Nervana
 
Deep Learning at Scale
Intel Nervana
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Intel Nervana
 
Introduction to Deep Learning and neon at Galvanize
Intel Nervana
 
ODSC West
Intel Nervana
 
Urs Köster - Convolutional and Recurrent Neural Networks
Intel Nervana
 
Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana
 

What's hot (20)

PDF
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Intel Nervana
 
PPTX
Deep Learning for Robotics
Intel Nervana
 
PDF
Introduction to Deep Learning with Will Constable
Intel Nervana
 
PDF
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA Taiwan
 
PDF
Deep learning on spark
Satyendra Rana
 
PPTX
Nervana Systems
Nand Dalal
 
PDF
Using neon for pattern recognition in audio data
Intel Nervana
 
PDF
RE-Work Deep Learning Summit - September 2016
Intel Nervana
 
PDF
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana
 
PPTX
Squeezing Deep Learning Into Mobile Phones
Anirudh Koul
 
PDF
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
PDF
Improving Hardware Efficiency for DNN Applications
Chester Chen
 
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
HPCC Systems
 
PDF
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA Taiwan
 
PDF
Introduction to Deep Learning (NVIDIA)
Rakuten Group, Inc.
 
PPTX
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Data Con LA
 
PDF
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
PDF
Deep Learning Computer Build
PetteriTeikariPhD
 
PPTX
Deep learning on mobile
Anirudh Koul
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Intel Nervana
 
Deep Learning for Robotics
Intel Nervana
 
Introduction to Deep Learning with Will Constable
Intel Nervana
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA Taiwan
 
Deep learning on spark
Satyendra Rana
 
Nervana Systems
Nand Dalal
 
Using neon for pattern recognition in audio data
Intel Nervana
 
RE-Work Deep Learning Summit - September 2016
Intel Nervana
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana
 
Squeezing Deep Learning Into Mobile Phones
Anirudh Koul
 
A Platform for Accelerating Machine Learning Applications
NVIDIA Taiwan
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 
Improving Hardware Efficiency for DNN Applications
Chester Chen
 
Moving Toward Deep Learning Algorithms on HPCC Systems
HPCC Systems
 
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA Taiwan
 
Introduction to Deep Learning (NVIDIA)
Rakuten Group, Inc.
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Data Con LA
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
Deep Learning Computer Build
PetteriTeikariPhD
 
Deep learning on mobile
Anirudh Koul
 
Ad

Viewers also liked (14)

PDF
An Analysis of Convolution for Inference
Intel Nervana
 
PDF
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Intel Nervana
 
PDF
Google I/O 2016 Highlights That You Should Know
Appinventiv
 
PDF
Video Activity Recognition and NLP Q&A Model Example
Intel Nervana
 
PDF
GPUDirect RDMA and Green Multi-GPU Architectures
inside-BigData.com
 
PDF
Anil Thomas - Object recognition
Intel Nervana
 
PPT
Region Of Interest Extraction
Gopi Krishnan Nambiar
 
PDF
High-Performance GPU Programming for Deep Learning
Intel Nervana
 
PDF
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
PPTX
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
지운 배
 
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
PDF
Object Detection and Recognition
Intel Nervana
 
PDF
AWS CLOUD 2017 - AWS 신규 서비스를 통해 본 클라우드의 미래 (김봉환 솔루션즈 아키텍트)
Amazon Web Services Korea
 
PDF
Aeroprobing A.I. Drone with TX1
NVIDIA Taiwan
 
An Analysis of Convolution for Inference
Intel Nervana
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Intel Nervana
 
Google I/O 2016 Highlights That You Should Know
Appinventiv
 
Video Activity Recognition and NLP Q&A Model Example
Intel Nervana
 
GPUDirect RDMA and Green Multi-GPU Architectures
inside-BigData.com
 
Anil Thomas - Object recognition
Intel Nervana
 
Region Of Interest Extraction
Gopi Krishnan Nambiar
 
High-Performance GPU Programming for Deep Learning
Intel Nervana
 
Evolution of Supermicro GPU Server Solution
NVIDIA Taiwan
 
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
지운 배
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
Object Detection and Recognition
Intel Nervana
 
AWS CLOUD 2017 - AWS 신규 서비스를 통해 본 클라우드의 미래 (김봉환 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Aeroprobing A.I. Drone with TX1
NVIDIA Taiwan
 
Ad

Similar to Nervana and the Future of Computing (20)

PDF
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
PDF
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA Taiwan
 
PDF
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
Edge AI and Vision Alliance
 
PDF
Possibilities of generative models
Alison B. Lowndes
 
PDF
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
PDF
PowerDRC/LVS 2.0 Overview
Alexander Grudanov
 
PDF
Gömülü Sistemlerde Derin Öğrenme Uygulamaları
Ferhat Kurt
 
PDF
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
Newprolab
 
PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
PDF
QuAI platform
Teddy Kuo
 
PDF
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
PPTX
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
Jack DiGiovanna
 
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
PDF
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
Edge AI and Vision Alliance
 
PDF
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
Indrajit Poddar
 
PDF
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
PDF
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA Taiwan
 
“Using a Neural Processor for Always-sensing Cameras,” a Presentation from Ex...
Edge AI and Vision Alliance
 
Possibilities of generative models
Alison B. Lowndes
 
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
PowerDRC/LVS 2.0 Overview
Alexander Grudanov
 
Gömülü Sistemlerde Derin Öğrenme Uygulamaları
Ferhat Kurt
 
Data Science Week 2016. NVIDIA. "Платформы и инструменты для реализации систе...
Newprolab
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
QuAI platform
Teddy Kuo
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
Introduction to HPC & Supercomputing in AI
Tyrone Systems
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
Jack DiGiovanna
 
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
Edge AI and Vision Alliance
 
Scalable TensorFlow Deep Learning as a Service with Docker, OpenPOWER, and GPUs
Indrajit Poddar
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 

Recently uploaded (20)

PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Artificial Intelligence (AI)
Mukul
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 

Nervana and the Future of Computing

  • 1. Proprietary and confidential. Do not distribute. Nervana and the Future of Computing 26 April 2016 Arjun Bansal Co-founder & VP Algorithms, Nervana MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute. AI on demand using Deep Learning 2 DL Image Classification Object Localization Video Indexing Text Analysis Nervana Platform Machine Translation
  • 3. Proprietary and confidential. Do not distribute. Image classification and video activity detection 3 Deep learning model Potential applications • Trained on a public dataset1 of 13K videos in 100 categories • Training was approximately 3 times faster than competitive framework • Can be extended to perform scene and object detection, action similarity labeling, video retrieval, anomaly detection 1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php • Activity detection and monitoring for security • Automatic editing of captured moments from video camera • Facial recognition and image based retrieval • Sense and avoid systems for autonomous driving • Baggage screening at airports and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
  • 4. Proprietary and confidential. Do not distribute.ner va na Object localization and recognition 4
  • 5. Proprietary and confidential. Do not distribute.ner va na Speech to text 5 https://youtu.be/NaqZkV_fBIM
  • 6. Proprietary and confidential. Do not distribute.ner va na Question answering 6 Stories Mary journeyed to Texas. John went to Maryland. Mary went to Iowa. John travelled to Florida. Questions Answers Where is John located? Florida
  • 7. Proprietary and confidential. Do not distribute.ner va na Reinforcement learning 7 Pong Breakout https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
  • 8. Proprietary and confidential. Do not distribute.ner va na Application areas 8 Healthcare Agriculture Finance Online Services Automotive Energy
  • 9. Proprietary and confidential. Do not distribute. Nervana is building the future of computing 9 The Economist, March 12, 2016 Cloud Computing Custom ASIC Deep Learning / AI
  • 10. Proprietary and confidential. Do not distribute.ner va na nervana cloud 10 Images Text Tabular Speech Time series Video Data import trainbuild deploy Cloud
  • 11. Proprietary and confidential. Do not distribute.ner va na nervana neon 11
  • 12. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  • 13. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  • 14. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM Domains • Images • Video • Speech • Text • Time series
  • 15. Proprietary and confidential. Do not distribute.ner va na Running locally: % python rnn.py # or neon rnn.yaml Running in nervana cloud: % ncloud submit —py rnn.py # or —yaml rnn.yaml % ncloud show <model_id> % ncloud list % ncloud deploy <model_id> % ncloud predict <model_id> <data> # or use REST api nervana neon 11 • Fastest library • Model support • Cloud integration
  • 16. Proprietary and confidential. Do not distribute.ner va na Backends • CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends
  • 17. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends • Optimized at assembler level
  • 18. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12
  • 19. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density =1 nervana engine 10 GPUs 200 CPUs
  • 20. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture
  • 21. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation Instruction and data memory Ctrl ALU CPU Data Memory Ctrl Nervana
  • 22. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference
  • 23. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision
  • 24. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  • 25. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • 10-100x gain • Architecture optimized for • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  • 26. Proprietary and confidential. Do not distribute.ner va na Special purpose computation 13 1940s: Turing Bombe Motivation: Automating calculations, code breaking
  • 27. Proprietary and confidential. Do not distribute.ner va na General purpose computation 14 2000s: SoC Motivation: reduce power and cost, fungible computing. Enabled inexpensive mobile devices.
  • 28. Proprietary and confidential. Do not distribute.ner va na Dennard scaling has ended 15 What business and technology constraints do we have now?
  • 29. Proprietary and confidential. Do not distribute.ner va na Many-core tiled architectures 16 Tile Processor Architecture Overview for the TILEPro Series 5 and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma- ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect. Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure. Figure 2-1. Tile Processor Hardware Architecture Each tile is a powerful, full-featured computing system that can independently run an entire oper- ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle. CDN TDN IDN MDN STN UDN 1,1 6,1 3,2 4,2 5,2 6,2 7,2 XAUI (10GbE) TDN IDN MDN STN UDN LEGEND: Tile Detail port2 msh0 port0 port2 port1 port0 DDR2 DDR2 port0 msh1 port2 port0 port1 port2 DDR2 DDR2 RGMII (GbE) XAUI (10GbE) FlexI/O PCIe (x4 lane) I2C, JTAG, HPI, UART, SPI ROM FlexI/O PCIe (x4 lane) port1 port1 msh3 msh2 port2 msh0 port0 port2 port1 port0 port0 msh1 port2 port0 port1 port2 port1 port1 msh3 msh2 gpio1 port0 port1 port1 port0 port1 xgbe0 gbe0 xgbe1 port0 gpio1 port1 port0 port1 gbe1 port0 port1 xgbe0 xgbe1 port0 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 7,00,0 1,0 2,0 3,0 4,0 5,0 6,0 0,1 1,1 6,12,1 3,1 4,1 5,1 7,1 3,2 4,2 5,2 6,2 7,20,2 1,2 2,2 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 port0 7,0 port0 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 Switch Engine Cache Engine Processor Engine U D N S T N M D N I D N T D N C D N U D N S T N M D N I D N T D N C D N STNSTN TDNTDN IDNIDN MDNMDN UDNUDN CDNCDN 2010s: multi-core, GPGPU Motivation: increased performance without clock rate increase or smaller devices. Requires changes in programming paradigm. NVIDIA GM204Tilera Intel Xeon Phi Knight’s landing
  • 30. Proprietary and confidential. Do not distribute.ner va na FPGA architectures 17 Altera Arria 10 Motivation: fine grained parallelism, reconfigurable, lots of IO, scalable. Slow clock speed, lacks compute density for machine learning.
  • 31. Proprietary and confidential. Do not distribute.ner va na Neuromorphic architectures 18 IBM TrueNorth dress for the target axon and addresses representing core ension to the target core). This coded into a packet that is in- entering spikes (Fig. 2I). Spikes leaving the mesh are tagged with their row (for spikes traveling east-west) or column (for spikes traveling north- south) before being merged onto a shared link ters (31,232 bits), destination addresses (6656 bits), and axonal delays (1024 bits). In terms of efficiency, TrueNorth’s power density is 20 mW per cm2 , whereas that of a typical central processing
  • 32. Proprietary and confidential. Do not distribute.ner va na Neural network parallelism 20 Data chunk 1 Data chunk n … Processor 1 Processor n … parameter server Full deep network on each processor Parameter coordination Data parallelism Model parallelism
  • 33. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  • 34. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  • 35. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  • 36. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  • 37. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G
  • 38. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  • 39. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  • 40. Proprietary and confidential. Do not distribute.ner va na nervana compute topology 22 CPU CPU S S D IB 10 G S S D IB 10 G nn n n nn nn PCIE SW PCIE SW
  • 41. Proprietary and confidential. Do not distribute.ner va na Distributed linear algebra and convolution 23 02/27/2014! CS267 Lecture 12! 50! 52! SUMMA – n x n matmul on P1/2 x P1/2 grid •  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij! •  A[i,k] is n/P1/2 x b submatrix of A! •  B[k,j] is b x n/P1/2 submatrix of B ! •  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] ! •  summation over submatrices! •  Need not be square processor grid ! * = i" j" A[i,k]" k" k" B[k,j]" C[i,j] 02/27/2014! CS267 Lecture 12! SUMMA distributed matrix multiply C=A*B (Jim Demmel, CS267 lecture notes) Matrix multiplication on multidimensional torus networks Edgar Solomonik and James Demmel Division of Computer Science University of California at Berkeley, CA, USA [email protected], [email protected] Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
  • 42. Proprietary and confidential. Do not distribute.ner va na Summary 24 • Computers are tools for solving problems of their time • Was: Coding, calculation, graphics, web • Today: Learning and Inference on data • Deep learning as a computational paradigm • Custom architecture can do vastly better