SlideShare a Scribd company logo
TensorFrames:
Google Tensorflow on
Apache Spark
Tim Hunter
Meetup 08/2016 - Salesforce
How familiar are you with Spark?
1. What is Apache Spark?
2. I have used Spark
3. I am using Spark in production or I
contribute to its development
2
How familiar are you with TensorFlow?
1. What is TensorFlow?
2. I have heard about it
3. I am training my own neural networks
3
Founded by the team who
created Apache Spark
Offers a hosted service:
- Apache Spark in the
cloud
- Notebooks
- Cluster management
- Production environment
About Databricks
4
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine
Learning
(and Spark user since Spark 0.5)
About me
5
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
6
Numerical computing for Data
Science
• Queries are data-heavy
• However algorithms are computation-heavy
• They operate on simple data types: integers,
floats, doubles, vectors, matrices
7
The case for speed
• Numerical bottlenecks are good targets for
optimization
• Let data scientists get faster results
• Faster turnaround for experimentations
• How can we run these numerical algorithms
faster?
8
Evolution of computing power
9
Failure is not an option:
it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scaleup
Evolution of computing power
10
NLTK
Theano
Today’s talk:
Spark + TensorFlow
Evolution of computing power
• Processor speed cannot keep up with memory
and network improvements
• Access to the processor is the new bottleneck
• Project Tungsten in Spark: leverage the
processor’s heuristics for executing code and
fetching memory
• Does not account for the fact that the problem is
numerical
11
Asynchronous vs. synchronous
• Asynchronous algorithms perform updates concurrently
• Spark is synchronous model, deep learning frameworks
usually asynchronous
• A large number of ML computations are synchronous
• Even deep learning may benefit from synchronous
updates
12
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
13
GPGPUs
14
• Graphics Processing Units for General Purpose
computations
6000
Theoretical peak
throughput
GPU CPU
Theoretical peak
bandwidth
GPU CPU
• Library for writing “machine intelligence”
algorithms
• Very popular for deep learning and neural
networks
• Can also be used for general purpose
numerical computations
• Interface in C++ and Python
15
Google TensorFlow
Numerical dataflow with Tensorflow
16
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z
Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
x:
int32
y:
int32
mul 3
z
Demo
18
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
19
20
It is a communication problem
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object
21
TensorFrames: native embedding of
TensorFlow
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
• Estimation of
distribution from
samples
• Non-parametric
• Unknown bandwidth
parameter
• Can be evaluated with
goodness of fit
An example: kernel density scoring
22
• In practice, compute:
with:
• In a nutshell: a complex numerical function
An example: kernel density scoring
23
24
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Runtime(sec)
def score(x: Double): Double = {
val dis = points.map { z_k =>
- (x - z_k) * (x - z_k) / ( 2 * b * b)
}
val minDis = dis.min
val exps = dis.map(d => math.exp(d - minDis))
minDis - math.log(b * N) + math.log(exps.sum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()
25
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Runtime(sec)
def score(x: Double): Double = {
val dis = new Array[Double](N)
var idx = 0
while(idx < N) {
val z_k = points(idx)
dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b)
idx += 1
}
val minDis = dis.min
var expSum = 0.0
idx = 0
while(idx < N) {
expSum += math.exp(dis(idx) - minDis)
idx += 1
}
minDis - math.log(b * N) + math.log(expSum)
}
val scoreUDF = sqlContext.udf.register("scoreUDF", score _)
sql("select sum(scoreUDF(sample)) from samples").collect()
26
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Runtime(sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()
27
Speedup
0
60
120
180
Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU
Runtime(sec)
def cost_fun(block, bandwidth):
distances = - square(constant(X) - sample) / (2 * b * b)
m = reduce_max(distances, 0)
x = log(reduce_sum(exp(distances - m), 0))
return identity(x + m - log(b * N), name="score”)
with device("/gpu"):
sample = tfs.block(df, "sample")
score = cost_fun(sample, bandwidth=0.5)
df.agg(sum(tfs.map_blocks(score, df))).collect()
Demo: Deep dreams
28
Demo: Deep dreams
29
Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
30
31
Improving communication
Spark worker process
C++
buffer
Tungsten
binary
format
Java
object
Direct memory copy
Columnar
storage
The future
• Integration with Tungsten:
• Direct memory copy
• Columnar storage
• Better integration with MLlib data types
• GPU instances in Databricks:
Official support coming this fall
32
Recap
• Spark: an efficient framework for running
computations on thousands of computers
• TensorFlow: high-performance numerical
framework
• Get the best of both with TensorFrames:
• Simple API for distributed numerical computing
• Can leverage the hardware of the cluster
33
Try these demos yourself
• TensorFrames source code and documentation:
github.com/databricks/tensorframes
spark-packages.org/package/databricks/tensorframes
• Demo notebooks available on Databricks
• The official TensorFlow website:
www.tensorflow.org
34
Spark Summit EU 2016
15% Discount Code: DatabricksEU16
35
Thank you.

More Related Content

What's hot (19)

PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
PDF
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
AI Frontiers
 
PPTX
An Introduction to TensorFlow architecture
Mani Goswami
 
PPTX
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
Applying your Convolutional Neural Networks
Databricks
 
PPTX
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
PDF
Introduction to TensorFlow
Matthias Feys
 
PPTX
Big data app meetup 2016-06-15
Illia Polosukhin
 
PDF
TensorFlow Dev Summit 2017 요약
Jin Joong Kim
 
PDF
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
PDF
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
PPTX
Tensorflow
marwa Ayad Mohamed
 
PPTX
Neural networks and google tensor flow
Shannon McCormick
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
AI Frontiers
 
An Introduction to TensorFlow architecture
Mani Goswami
 
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Applying your Convolutional Neural Networks
Databricks
 
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017
Ashish Bansal
 
Introduction to TensorFlow
Matthias Feys
 
Big data app meetup 2016-06-15
Illia Polosukhin
 
TensorFlow Dev Summit 2017 요약
Jin Joong Kim
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
TensorFlow and Keras: An Overview
Poo Kuan Hoong
 
Tensorflow
marwa Ayad Mohamed
 
Neural networks and google tensor flow
Shannon McCormick
 

Similar to TensorFrames: Google Tensorflow on Apache Spark (20)

PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
Spark Summit EU talk by Tim Hunter
Spark Summit
 
PPTX
Deep Learning with Spark and GPUs
DataWorks Summit
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
PDF
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
PPTX
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Jim Dowling
 
PDF
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
PDF
Multithreading to Construct Neural Networks
Altoros
 
PDF
Tensor flow white paper
Ying wei (Joe) Chou
 
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
Evans Ye
 
PDF
1605.08695.pdf
mohammadA42
 
Spark Meetup TensorFrames
Jen Aman
 
Spark Summit EU talk by Tim Hunter
Spark Summit
 
Deep Learning with Spark and GPUs
DataWorks Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Holden Karau
 
Powering tensor flow with big data using apache beam, flink, and spark cern...
Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Holden Karau
 
Simplifying training deep and serving learning models with big data in python...
Holden Karau
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Keith Kraus
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Jim Dowling
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Multithreading to Construct Neural Networks
Altoros
 
Tensor flow white paper
Ying wei (Joe) Chou
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
Evans Ye
 
1605.08695.pdf
mohammadA42
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Import Data Form Excel to Tally Services
Tally xperts
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Human Resources Information System (HRIS)
Amity University, Patna
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 

TensorFrames: Google Tensorflow on Apache Spark

  • 1. TensorFrames: Google Tensorflow on Apache Spark Tim Hunter Meetup 08/2016 - Salesforce
  • 2. How familiar are you with Spark? 1. What is Apache Spark? 2. I have used Spark 3. I am using Spark in production or I contribute to its development 2
  • 3. How familiar are you with TensorFlow? 1. What is TensorFlow? 2. I have heard about it 3. I am training my own neural networks 3
  • 4. Founded by the team who created Apache Spark Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment About Databricks 4
  • 5. Software engineer at Databricks Apache Spark contributor Ph.D. UC Berkeley in Machine Learning (and Spark user since Spark 0.5) About me 5
  • 6. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 6
  • 7. Numerical computing for Data Science • Queries are data-heavy • However algorithms are computation-heavy • They operate on simple data types: integers, floats, doubles, vectors, matrices 7
  • 8. The case for speed • Numerical bottlenecks are good targets for optimization • Let data scientists get faster results • Faster turnaround for experimentations • How can we run these numerical algorithms faster? 8
  • 9. Evolution of computing power 9 Failure is not an option: it is a fact When you can afford your dedicated chip GPGPU Scale out Scaleup
  • 10. Evolution of computing power 10 NLTK Theano Today’s talk: Spark + TensorFlow
  • 11. Evolution of computing power • Processor speed cannot keep up with memory and network improvements • Access to the processor is the new bottleneck • Project Tungsten in Spark: leverage the processor’s heuristics for executing code and fetching memory • Does not account for the fact that the problem is numerical 11
  • 12. Asynchronous vs. synchronous • Asynchronous algorithms perform updates concurrently • Spark is synchronous model, deep learning frameworks usually asynchronous • A large number of ML computations are synchronous • Even deep learning may benefit from synchronous updates 12
  • 13. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 13
  • 14. GPGPUs 14 • Graphics Processing Units for General Purpose computations 6000 Theoretical peak throughput GPU CPU Theoretical peak bandwidth GPU CPU
  • 15. • Library for writing “machine intelligence” algorithms • Very popular for deep learning and neural networks • Can also be used for general purpose numerical computations • Interface in C++ and Python 15 Google TensorFlow
  • 16. Numerical dataflow with Tensorflow 16 x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) session = tf.Session() output_value = session.run(output, {x: 3, y: 5}) x: int32 y: int32 mul 3 z
  • 17. Numerical dataflow with Spark df = sqlContext.createDataFrame(…) x = tf.placeholder(tf.int32, name=“x”) y = tf.placeholder(tf.int32, name=“y”) output = tf.add(x, 3 * y, name=“z”) output_df = tfs.map_rows(output, df) output_df.collect() df: DataFrame[x: int, y: int] output_df: DataFrame[x: int, y: int, z: int] x: int32 y: int32 mul 3 z
  • 19. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 19
  • 20. 20 It is a communication problem Spark worker process Worker python process C++ buffer Python pickle Tungsten binary format Python pickle Java object
  • 21. 21 TensorFrames: native embedding of TensorFlow Spark worker process C++ buffer Tungsten binary format Java object
  • 22. • Estimation of distribution from samples • Non-parametric • Unknown bandwidth parameter • Can be evaluated with goodness of fit An example: kernel density scoring 22
  • 23. • In practice, compute: with: • In a nutshell: a complex numerical function An example: kernel density scoring 23
  • 24. 24 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Runtime(sec) def score(x: Double): Double = { val dis = points.map { z_k => - (x - z_k) * (x - z_k) / ( 2 * b * b) } val minDis = dis.min val exps = dis.map(d => math.exp(d - minDis)) minDis - math.log(b * N) + math.log(exps.sum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  • 25. 25 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Runtime(sec) def score(x: Double): Double = { val dis = new Array[Double](N) var idx = 0 while(idx < N) { val z_k = points(idx) dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b) idx += 1 } val minDis = dis.min var expSum = 0.0 idx = 0 while(idx < N) { expSum += math.exp(dis(idx) - minDis) idx += 1 } minDis - math.log(b * N) + math.log(expSum) } val scoreUDF = sqlContext.udf.register("scoreUDF", score _) sql("select sum(scoreUDF(sample)) from samples").collect()
  • 26. 26 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Runtime(sec) def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  • 27. 27 Speedup 0 60 120 180 Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU Runtime(sec) def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”) with device("/gpu"): sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5) df.agg(sum(tfs.map_blocks(score, df))).collect()
  • 30. Outline • Numerical computing with Apache Spark • Using GPUs with Spark and TensorFlow • Performance details • The future 30
  • 31. 31 Improving communication Spark worker process C++ buffer Tungsten binary format Java object Direct memory copy Columnar storage
  • 32. The future • Integration with Tungsten: • Direct memory copy • Columnar storage • Better integration with MLlib data types • GPU instances in Databricks: Official support coming this fall 32
  • 33. Recap • Spark: an efficient framework for running computations on thousands of computers • TensorFlow: high-performance numerical framework • Get the best of both with TensorFrames: • Simple API for distributed numerical computing • Can leverage the hardware of the cluster 33
  • 34. Try these demos yourself • TensorFrames source code and documentation: github.com/databricks/tensorframes spark-packages.org/package/databricks/tensorframes • Demo notebooks available on Databricks • The official TensorFlow website: www.tensorflow.org 34
  • 35. Spark Summit EU 2016 15% Discount Code: DatabricksEU16 35

Editor's Notes

  • #4: Explain that TensorFlow is a library for deep learning
  • #8: list a few algorithms: deep learning, clustering, classification, etc. business logic and analysis more concerned usually with complex structures: text, lists, associations like dictionaries The bread and butter of data science can be told in 3 words: integers, floats and doubles. Slicing and dicing data: matrices, vectors, reals
  • #9: not everybody is a fortran or C++ programmer. There is considerable friction in writing optimized algorithms. How can we lower the barrier?
  • #10: scale up or scale The Holy Grail:a large number of specialized processors you have 2 options: better computers or more computers
  • #11: For all these configurations of hardware, there are even more frameworks and libraries to access them, and each of them has strengths and weaknesses the classics for single machine use the distributed frameworks: Spark, Mahout, MapReduce the libraries to access specialized hardware: CUDA and OpenCL for parallel programming in the middle, MPI: it is hard to program and it is not very resilient to hardware failures Then frameworks built on top of these in the recent years for deep learning and computer vision The trend is to have multiple graphic cards communicate
  • #23: MLlib has KDE, but how about making it work for other data types like floats, or other kernels?
  • #24: my phd adviser used to tell me that you always have to include one equation to show that you mean serious business
  • #25: do not talk about UDF, simply say you can wrap scala function inside the SQL engine UDF: it is a scala function and you can run it inside a SQL query
  • #29: start from login,homepage disable debug menu go more slowly for demo