SlideShare a Scribd company logo
TensorFrames:
Google Tensorflow on
Apache Spark
Tim Hunter
Spark Summit 2016 - Meetup
How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I contribute to
its development
2
How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3
Founded by the team who
created ApacheSpark
Offers a hosted service:
- Apache Spark in the cloud
- Notebooks
- Clustermanagement
- Productionenvironment
About Databricks
4
Software engineerat Databricks
Apache Spark contributor
Ph.D. UC Berkeleyin Machine Learning
(and Spark user since Spark 0.2)
About me
5
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
6
Numerical computing for Data
Science
• Queries are data-heavy
• However algorithmsare computation-heavy
• They operate on simple data types: integers,floats,
doubles, vectors, matrices
7
The case for speed
• Numerical bottlenecksare good targets for
optimization
• Let data scientists get faster results
• Faster turnaroundfor experimentations
• How can we run these numerical algorithmsfaster?
8
Evolution of computing power
9
Failure	is	not	an	option:
it	is	a	fact
When	you	can	afford	your	dedicated	chip
GPGPU
Scale	out
Scale	up
Evolution of computing power
10
NLTK
Theano
Today’s	talk:
Spark	+	TensorFlow
Evolution of computing power
• Processorspeedcannotkeep up with memory and
network improvements
• Accessto the processoris the new bottleneck
• ProjectTungstenin Spark: leverage the processor’s
heuristicsfor executingcode and fetching memory
• Does not accountfor the fact that the problem is
numerical
11
Asynchronous vs. synchronous
• Asynchronousalgorithms performupdates concurrently
• Spark is synchronousmodel, deep learningframeworks
usuallyasynchronous
• A large number of ML computations are synchronous
• Even deep learningmay benefit from synchronousupdates
12
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
13
GPGPUs
14
• Graphics Processing Units for General Purpose
computations
6000
Theoretical	peak
throughput
GPU CPU
Theoretical	peak
bandwidth
GPU CPU
• Library for writing “machine intelligence”algorithms
• Very popular for deep learning and neural networks
• Can also be used for general purpose numerical
computations
• Interface in C++ and Python
15
Google TensorFlow
Numerical dataflow with
Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z
Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df:	DataFrame[x:	int,	y:	int]
output_df:	
DataFrame[x:	int,	y:	int,	z:	int]
x:
int32
y:
int32
mul 3
z
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
18
19
It is a communication problem
Spark	worker	process Worker	python	process
C++
buffer
Python	
pickle
Tungsten	
binary	
format
Python	
pickle
Java
object
20
TensorFrames: native embedding
of TensorFlow
Spark	worker	process
C++
buffer
Tungsten	
binary	
format
Java
object
• Estimation of distribution
from samples
• Non-parametric
• Unknown bandwidth
parameter
• Can be evaluatedwith
goodness of fit
An example: kernel density scoring
21
• In practice, compute:
with:
• In a nutshell:a complexnumerical function
An example: kernel density scoring
22
23
Speedup
0
60
120
180
Scala	UDF Scala	UDF	(optimized) TensorFrames TensorFrames	+	GPU
Run	time	(sec)
def score(x:	Double):	Double =	{
val dis	=	points.map {		z_k =>
- (x	- z_k)	*	(x	- z_k)	/	(	2	*	b	*	b)
}
val minDis =	dis.min
val exps =	dis.map(d	=>	math.exp(d	- minDis))
minDis - math.log(b	*	N)	+	math.log(exps.sum)
}
val scoreUDF =	sqlContext.udf.register("scoreUDF",	 score	_)
sql("select	sum(scoreUDF(sample))	 from	samples").collect()
24
Speedup
0
60
120
180
Scala	UDF Scala	UDF	(optimized) TensorFrames TensorFrames	+	GPU
Run	time	(sec)
def score(x:	Double):	Double =	{
val dis	=	new Array[Double](N)
varidx =	0
while(idx <	N)	{
val z_k =	points(idx)
dis(idx)	=	- (x	- z_k)	*	(x	- z_k)	/	(	2	*	b	*	b)
idx +=	1
}
val minDis =	dis.min
varexpSum =	0.0
idx =	0	
while(idx <	N)	{
expSum +=	math.exp(dis(idx) - minDis)
idx +=	1
}
minDis - math.log(b	*	N)	+	math.log(expSum)
}
val scoreUDF =	sqlContext.udf.register("scoreUDF",	score	_)
sql("select	sum(scoreUDF(sample))	from	samples").collect()
25
Speedup
0
60
120
180
Scala	UDF Scala	UDF	(optimized) TensorFrames TensorFrames	+	GPU
Run	time	(sec)
def cost_fun(block,	bandwidth):
distances	=	- square(constant(X)	- sample)	/	(2	*	b	*	b)
m	=	reduce_max(distances,	0)
x	=	log(reduce_sum(exp(distances	- m),	0))
return identity(x	+	m	- log(b	*	N),	name="score”)
sample	=	tfs.block(df,	"sample")
score	=	cost_fun(sample,	bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score,	df))).collect()
26
Speedup
0
60
120
180
Scala	UDF Scala	UDF	(optimized) TensorFrames TensorFrames	+	GPU
Run	time	(sec)
def cost_fun(block,	bandwidth):
distances	=	- square(constant(X)	- sample)	/	(2	*	b	*	b)
m	=	reduce_max(distances,	0)
x	=	log(reduce_sum(exp(distances	- m),	0))
return identity(x	+	m	- log(b	*	N),	name="score”)
with device("/gpu"):
sample	=	tfs.block(df,	"sample")
score	=	cost_fun(sample,	bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score,	df))).collect()
Demo: Deep dreams
27
Demo: Deep dreams
28
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
29
30
Improving communication
Spark	worker	process
C++
buffer
Tungsten	
binary	
format
Java
object
Direct	memory	copy
Columnar
storage
The future
• Integrationwith Tungsten:
• Direct memory copy
• Columnar storage
• Betterintegration with MLlib data types
• GPU instances in Databricks:
Official support coming this summer
31
Recap
• Spark: an efficient framework for running
computations on thousands of computers
• TensorFlow: high-performance numerical
framework
• Get the best of both with TensorFrames:
• Simple API for distributed numerical computing
• Can leveragethe hardwareof the cluster
32
Try these demos yourself
• TensorFrames source code and documentation:
github.com/tjhunter/tensorframes
spark-packages.org/package/tjhunter/tensorframes
• Demo available lateron Databricks
• The official TensorFlow website:
www.tensorflow.org
• More questions and attending the Spark summit?
We will hold office hours at the Databricks booth.
33
Thank you.

More Related Content

What's hot (17)

PDF
Scaling Deep Learning with MXNet
AI Frontiers
 
PDF
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 
PPTX
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
PDF
GDG-Shanghai 2017 TensorFlow Summit Recap
Jiang Jun
 
PPTX
An Introduction to TensorFlow architecture
Mani Goswami
 
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
PDF
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
PDF
TensorFlow example for AI Ukraine2016
Andrii Babii
 
PDF
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
PPTX
Networks are like onions: Practical Deep Learning with TensorFlow
Barbara Fusinska
 
PDF
Introduction to TensorFlow 2.0
Databricks
 
PDF
Teaching Recurrent Neural Networks using Tensorflow (May 2016)
Rajiv Shah
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PPTX
Introduction to Machine Learning with TensorFlow
Paolo Tomeo
 
PPTX
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
PPTX
Keras on tensorflow in R & Python
Longhow Lam
 
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Scaling Deep Learning with MXNet
AI Frontiers
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 
Deep Learning with Apache Spark: an Introduction
Emanuele Bezzi
 
GDG-Shanghai 2017 TensorFlow Summit Recap
Jiang Jun
 
An Introduction to TensorFlow architecture
Mani Goswami
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
TensorFlow example for AI Ukraine2016
Andrii Babii
 
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Networks are like onions: Practical Deep Learning with TensorFlow
Barbara Fusinska
 
Introduction to TensorFlow 2.0
Databricks
 
Teaching Recurrent Neural Networks using Tensorflow (May 2016)
Rajiv Shah
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Introduction to Machine Learning with TensorFlow
Paolo Tomeo
 
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Keras on tensorflow in R & Python
Longhow Lam
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 

Similar to Spark Meetup TensorFrames (20)

PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
PPTX
Meetup tensorframes
Paolo Platter
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
PPTX
Deep Learning with Spark and GPUs
DataWorks Summit
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
PDF
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
PPTX
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
PDF
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
PPTX
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Spark Summit
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Meetup tensorframes
Paolo Platter
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
Deep Learning with Spark and GPUs
DataWorks Summit
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
Large scale logistic regression and linear support vector machines using spark
Mila, Université de Montréal
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
Spark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Ad

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
PDF
Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
PDF
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
PDF
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
PDF
Livy: A REST Web Service For Apache Spark
Jen Aman
 
PDF
GPU Computing With Apache Spark And Python
Jen Aman
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
PDF
Spark on Mesos
Jen Aman
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Low Latency Execution For Apache Spark
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark on Mesos
Jen Aman
 
Ad

Recently uploaded (20)

PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 

Spark Meetup TensorFrames

  • 1. TensorFrames: Google Tensorflow on Apache Spark Tim Hunter Spark Summit 2016 - Meetup
  • 2. How familiar are you with Spark? 1. What is Apache Spark? 2. I have used Spark 3. I am using Spark in production or I contribute to its development 2
  • 3. How familiar are you with TensorFlow? 1. What is TensorFlow? 2. I have heard about it 3. I am training my own neural networks 3
  • 4. Founded by the team who created ApacheSpark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Clustermanagement - Productionenvironment About Databricks 4
  • 5. Software engineerat Databricks Apache Spark contributor Ph.D. UC Berkeleyin Machine Learning (and Spark user since Spark 0.2) About me 5
  • 6. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 6
  • 7. Numerical computing for Data Science • Queries are data-heavy • However algorithmsare computation-heavy • They operate on simple data types: integers,floats, doubles, vectors, matrices 7
  • 8. The case for speed • Numerical bottlenecksare good targets for optimization • Let data scientists get faster results • Faster turnaroundfor experimentations • How can we run these numerical algorithmsfaster? 8
  • 9. Evolution of computing power 9 Failure is not an option: it is a fact When you can afford your dedicated chip GPGPU Scale out Scale up
  • 10. Evolution of computing power 10 NLTK Theano Today’s talk: Spark + TensorFlow
  • 11. Evolution of computing power • Processorspeedcannotkeep up with memory and network improvements • Accessto the processoris the new bottleneck • ProjectTungstenin Spark: leverage the processor’s heuristicsfor executingcode and fetching memory • Does not accountfor the fact that the problem is numerical 11
  • 12. Asynchronous vs. synchronous • Asynchronousalgorithms performupdates concurrently • Spark is synchronousmodel, deep learningframeworks usuallyasynchronous • A large number of ML computations are synchronous • Even deep learningmay benefit from synchronousupdates 12
  • 13. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 13
  • 14. GPGPUs 14 • Graphics Processing Units for General Purpose computations 6000 Theoretical peak throughput GPU CPU Theoretical peak bandwidth GPU CPU
  • 15. • Library for writing “machine intelligence”algorithms • Very popular for deep learning and neural networks • Can also be used for general purpose numerical computations • Interface in C++ and Python 15 Google TensorFlow
  • 16. Numerical dataflow with Tensorflow 16 x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) session = tf.Session() output_value = session.run(output, {x: 3, y: 5}) x: int32 y: int32 mul 3 z
  • 17. Numerical dataflow with Spark df = sqlContext.createDataFrame(…) x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) output_df = tfs.map_rows(output, df) output_df.collect() df: DataFrame[x: int, y: int] output_df: DataFrame[x: int, y: int, z: int] x: int32 y: int32 mul 3 z
  • 18. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 18
  • 19. 19 It is a communication problem Spark worker process Worker python process C++ buffer Python pickle Tungsten binary format Python pickle Java object
  • 20. 20 TensorFrames: native embedding of TensorFlow Spark worker process C++ buffer Tungsten binary format Java object
  • 21. • Estimation of distribution from samples • Non-parametric • Unknown bandwidth parameter • Can be evaluatedwith goodness of fit An example: kernel density scoring 21
  • 22. • In practice, compute: with: • In a nutshell:a complexnumerical function An example: kernel density scoring 22
  • 23. 23 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Run time (sec) def score(x: Double): Double = { val dis = points.map { z_k => - (x - z_k) * (x - z_k) / ( 2 * b * b) } val minDis = dis.min val exps = dis.map(d => math.exp(d - minDis)) minDis - math.log(b * N) + math.log(exps.sum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  • 24. 24 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Run time (sec) def score(x: Double): Double = { val dis = new Array[Double](N) varidx = 0 while(idx < N) { val z_k = points(idx) dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b) idx += 1 } val minDis = dis.min varexpSum = 0.0 idx = 0 while(idx < N) { expSum += math.exp(dis(idx) - minDis) idx += 1 } minDis - math.log(b * N) + math.log(expSum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  • 25. 25 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Run time (sec) def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  • 26. 26 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Run time (sec) def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) with device("/gpu"): sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  • 29. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 29
  • 31. The future • Integrationwith Tungsten: • Direct memory copy • Columnar storage • Betterintegration with MLlib data types • GPU instances in Databricks: Official support coming this summer 31
  • 32. Recap • Spark: an efficient framework for running computations on thousands of computers • TensorFlow: high-performance numerical framework • Get the best of both with TensorFrames: • Simple API for distributed numerical computing • Can leveragethe hardwareof the cluster 32
  • 33. Try these demos yourself • TensorFrames source code and documentation: github.com/tjhunter/tensorframes spark-packages.org/package/tjhunter/tensorframes • Demo available lateron Databricks • The official TensorFlow website: www.tensorflow.org • More questions and attending the Spark summit? We will hold office hours at the Databricks booth. 33