Nervana and the Future of Computing

Proprietary and conﬁdential. Do not distribute.
Nervana and the
Future of Computing
26 April 2016
Arjun Bansal
Co-founder & VP Algorithms, Nervana
MAKING MACHINES SMARTER.™

AI on demand using Deep Learning
2
DL
Image
Classification
Object
Localization
Video
Indexing
Text
Analysis
Nervana Platform
Machine
Translation

Image classification and video activity detection
3
Deep learning model Potential applications
• Trained on a public dataset1 of
13K videos in 100 categories
• Training was approximately 3
times faster than competitive
framework
• Can be extended to perform
scene and object detection,
action similarity labeling, video
retrieval, anomaly detection
1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php
• Activity detection and
monitoring for security
• Automatic editing of captured
moments from video camera
• Facial recognition and image
based retrieval
• Sense and avoid systems for
autonomous driving
• Baggage screening at airports
and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw

Proprietary and conﬁdential. Do not distribute.ner va na
Object localization and recognition
4

Speech to text
5
https://youtu.be/NaqZkV_fBIM

Question answering
6
Stories
Mary journeyed to Texas.
John went to Maryland.
Mary went to Iowa.
John travelled to Florida.
Questions
Answers
Where is John located?
Florida

Reinforcement learning
7
Pong Breakout
https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg

Application areas
8
Healthcare Agriculture Finance
Online Services Automotive Energy

Nervana is building the future of computing
9
The Economist, March 12, 2016
Cloud Computing
Custom ASIC
Deep Learning / AI

nervana cloud
10
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud

nervana neon
11

nervana neon
11
• Fastest library

nervana neon
11
• Fastest library
• Model support Models
• Convnet
• RNN, LSTM
• MLP
• DQN
• NTM
Domains
• Images
• Video
• Speech
• Text
• Time series

Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
11
• Fastest library
• Model support
• Cloud integration

Backends
• CPU
• GPU
• Multiple GPUs
• Parameter server
• (Xeon Phi)
• nervana TPU
nervana neon
11
• Fastest library
• Model support
• Multiple backends

nervana neon
11
• Fastest library
• Model support
• Multiple backends
• Optimized at assembler level

nervana tensor processing unit (TPU)
12

12
• Unprecedented compute density
=1
nervana
engine
10 GPUs
200 CPUs

12
• Scalable distributed architecture

12
• Memory near computation
Instruction
and data
memory
Ctrl
ALU
CPU
Data
Memory
Ctrl
Nervana

12
• Learning and inference

12
• Exploit limited precision

12
• Power efficiency

12
• 10-100x gain
• Architecture optimized for
• Power efficiency

Special purpose computation
13
1940s: Turing Bombe
Motivation: Automating
calculations, code breaking

General purpose computation
14
2000s: SoC
Motivation: reduce power
and cost, fungible
computing.
Enabled inexpensive
mobile devices.

Dennard scaling has ended
15
What business and
technology constraints do
we have now?

Many-core tiled architectures
16
Tile Processor Architecture Overview for the TILEPro Series 5
and provides high bandwidth and extremely low latency communication among tiles. The Tile
Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-
ble multicore processor. External memory and I/O interfaces are connected to the tiles via the
iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s
structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-
ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a
three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC),
cache, and DMA subsystem. An individual tile is capable of executing up to three operations per
cycle.
CDN
TDN
IDN
MDN
STN
UDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI
(10GbE)
TDN
IDN
MDN
STN
UDN
LEGEND:
Tile Detail
port2
msh0
port0
port2 port1 port0
DDR2
DDR2
port0
msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII
(GbE)
XAUI
(10GbE)
FlexI/O
PCIe
(x4 lane)
I2C, JTAG,
HPI, UART,
SPI ROM
FlexI/O
PCIe
(x4 lane)
port1 port1
msh3 msh2
port2
msh0
port0
port2 port1 port0
port0
msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port0
7,0
port0
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
Switch
Engine
Cache
Engine
Processor
Engine
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
STNSTN
TDNTDN
IDNIDN
MDNMDN
UDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased
performance without clock
rate increase or smaller
devices.
Requires changes in
programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi
Knight’s landing

FPGA architectures
17
Altera Arria 10
Motivation: fine grained
parallelism, reconfigurable,
lots of IO, scalable.
Slow clock speed, lacks
compute density for
machine learning.

Neuromorphic architectures
18
IBM TrueNorth
dress for the target axon and
addresses representing core
ension to the target core). This
coded into a packet that is in-
entering spikes (Fig. 2I). Spikes leaving the mesh
are tagged with their row (for spikes traveling
east-west) or column (for spikes traveling north-
south) before being merged onto a shared link
ters (31,232 bits), destination addresses (6656
bits), and axonal delays (1024 bits). In terms of
efficiency, TrueNorth’s power density is 20 mW
per cm2
, whereas that of a typical central processing

Neural network parallelism
20
Data chunk 1 Data chunk n
…
Processor 1 Processor n
…
parameter server
Full deep
network on
each processor
Parameter coordination
Data parallelism Model parallelism

Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G

21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW

21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G

21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW

nervana compute topology
22
CPU
CPU
S
S
D
IB
10
G
S
S
D
IB
10
G
nn
n n
nn
nn
PCIE SW
PCIE SW

Distributed linear algebra and convolution
23
02/27/2014! CS267 Lecture 12! 50!
52!
SUMMA – n x n matmul on P1/2 x P1/2 grid
•  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!
•  A[i,k] is n/P1/2 x b submatrix of A!
•  B[k,j] is b x n/P1/2 submatrix of B !
•  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !
•  summation over submatrices!
•  Need not be square processor grid !
* =
i"
j"
A[i,k]"
k"
k"
B[k,j]"
C[i,j]
02/27/2014! CS267 Lecture 12!
SUMMA distributed matrix multiply C=A*B
(Jim Demmel, CS267 lecture notes)
Matrix multiplication on multidimensional torus networks
Edgar Solomonik and James Demmel
Division of Computer Science
University of California at Berkeley, CA, USA
solomon@cs.berkeley.edu, demmel@cs.berkeley.edu
Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have
a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of
Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.
This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection
bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon

Summary
24
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better

Nervana and the Future of Computing

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Nervana and the Future of Computing (20)

Recently uploaded (20)

Nervana and the Future of Computing