Spark Meetup TensorFrames

TensorFrames:
Google Tensorflow on
Apache Spark
Tim Hunter
Spark Summit 2016 - Meetup

How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I contribute to
its development
2

How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3

Founded by the team who
created ApacheSpark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Clustermanagement
- Productionenvironment
About Databricks
4

Software engineerat Databricks
Apache Spark contributor
Ph.D. UC Berkeleyin Machine Learning
(and Spark user since Spark 0.2)
About me
5

Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
6

Numerical computing for Data
Science
• Queries are data-heavy
• However algorithmsare computation-heavy
• They operate on simple data types: integers,floats,
doubles, vectors, matrices
7

The case for speed
• Numerical bottlenecksare good targets for
optimization
• Let data scientists get faster results
• Faster turnaroundfor experimentations
• How can we run these numerical algorithmsfaster?
8

Evolution of computing power
9
Failure is not an option:
it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scale up

10
NLTK
Theano
Today’s talk:
Spark + TensorFlow

• Processorspeedcannotkeep up with memory and
network improvements
• Accessto the processoris the new bottleneck
• ProjectTungstenin Spark: leverage the processor’s
heuristicsfor executingcode and fetching memory
• Does not accountfor the fact that the problem is
numerical
11

Asynchronous vs. synchronous
• Asynchronousalgorithms performupdates concurrently
• Spark is synchronousmodel, deep learningframeworks
usuallyasynchronous
• A large number of ML computations are synchronous
• Even deep learningmay benefit from synchronousupdates
12

Outline
• The future
13

GPGPUs
14
• Graphics Processing Units for General Purpose
computations
6000
Theoretical peak
throughput
GPU CPU
Theoretical peak
bandwidth
GPU CPU

• Library for writing “machine intelligence”algorithms
• Very popular for deep learning and neural networks
• Can also be used for general purpose numerical
computations
• Interface in C++ and Python
15
Google TensorFlow

Numerical dataflow with
Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z

Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
x:
int32
y:
int32
mul 3
z

Outline
• The future
18

19
It is a communication problem
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object

20
TensorFrames: native embedding
of TensorFlow
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object

• Estimation of distribution
from samples
• Non-parametric
• Unknown bandwidth
parameter
• Can be evaluatedwith
goodness of fit
An example: kernel density scoring
21

• In practice, compute:
with:
• In a nutshell:a complexnumerical function
An example: kernel density scoring
22

23
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Run time (sec)
def score(x: Double): Double = {
val dis = points.map { z_k =>
- (x - z_k) * (x - z_k) / ( 2 * b * b)
}
val minDis = dis.min
val exps = dis.map(d => math.exp(d - minDis))
minDis - math.log(b * N) + math.log(exps.sum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()

24
Speedup
0
60
120
180
Run time (sec)
def score(x: Double): Double = {
val dis = new Array[Double](N)
varidx = 0
while(idx < N) {
val z_k = points(idx)
dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b)
idx += 1
}
val minDis = dis.min
varexpSum = 0.0
idx = 0
while(idx < N) {
expSum += math.exp(dis(idx) - minDis)
idx += 1
}
minDis - math.log(b * N) + math.log(expSum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()

25
Speedup
0
60
120
180
Run time (sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()

26
Speedup
0
60
120
180
Run time (sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
with device("/gpu"):
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()

Outline
• The future
29

30
Improving communication
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
Direct memory copy
Columnar
storage

The future
• Integrationwith Tungsten:
• Direct memory copy
• Columnar storage
• Betterintegration with MLlib data types
• GPU instances in Databricks:
Official support coming this summer
31

Recap
• Spark: an efficient framework for running
computations on thousands of computers
• TensorFlow: high-performance numerical
framework
• Get the best of both with TensorFrames:
• Simple API for distributed numerical computing
• Can leveragethe hardwareof the cluster
32

Try these demos yourself
• TensorFrames source code and documentation:
github.com/tjhunter/tensorframes
spark-packages.org/package/tjhunter/tensorframes
• Demo available lateron Databricks
• The official TensorFlow website:
www.tensorflow.org
• More questions and attending the Spark summit?
We will hold office hours at the Databricks booth.
33

Spark Meetup TensorFrames

More Related Content

What's hot (17)

Similar to Spark Meetup TensorFrames (20)

More from Jen Aman (20)

Recently uploaded (20)

Spark Meetup TensorFrames