SlideShare a Scribd company logo
Deep Learning and
Streaming in Apache Spark
2.x
Matei Zaharia
@matei_zaharia
Welcome to Spark Summit Europe
Our largest European summit yet
102talks
1200attendees
11tracks
What’s New in Spark?
Cost-based optimizer (Spark 2.2)
Python and R improvements
• PyPI & CRAN packages (Spark 2.2)
• Python ML plugins (Spark 2.3)
• Vectorized Pandas UDFs (Spark 2.3)
Kubernetes support (targeting 2.3)
0
10
20
30
40
50
Time(s)
Spark 2.2
Vectorized UDFs
0
25
50
75
100
125
Q1 Q2
Time(s)
Pandas
Spark
Spark: The Definitive Guide
To be released this winter
Free preview chapters and
code on Databricks website:
dbricks.co/spark-guide
Two Fast-Growing Workloads
Both are important but complex with current tools
We think we can simplify both with Apache Spark!
Streaming Deep
Learning
&
Why are Streaming and DL
Hard?Similar to early big data tools!
Tremendous potential, but very hard to use at first:
• Low-level APIs (MapReduce)
• Separate systems for each task (SQL, ETL, ML,
etc)
Spark’s Approach
1) Composable, high-level APIs
• Build apps from components
2) Unified engine
• Run complete, end-to-end apps
SQLStreaming ML Graph
…
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Structured Streaming
Streaming today requires separate APIs & systems
Structured Streaming is a high-level, end-to-end API
• Simple interface: run any DataFrame or SQL code incrementally
• Complete apps: combine with batch & interactive queries
• End-to-end reliability: exactly-once processing
Became GA in Apache Spark 2.2
Structured Streaming Use
Cases
Monitor quality of live video streaming
Anomaly detection on millions of WiFi hotspots
100s of customer apps in production on Databricks
Largest apps process tens of trillions of records per month
Real-time game analytics at scale
KTable<String, String> kCampaigns = builder.table("campaigns", "cmp-state");
KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> {
Map<String, String> campMap = Json.parser.readValue(value);
return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id"));
});
KStream<String, String> joined =
filtered.join(deserCampaigns, (value1, value2) -> {
return value2.campaign_id;
},
Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(),
new ProjectedEventDeserializer()));
KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> {
return value.event_type.equals("view");
}).mapValues((value) -> {
return new ProjectedEvent(value.ad_id, value.event_time);
});
KStream<String, String> keyedData = joined.selectKey((key, value) -> value);
KTable<Windowed<String>, Long> counts = keyedData.groupByKey()
.count(TimeWindows.of(10000), "time-windows");
streams
Example:
Benchmark DataFrames
events
.where("event_type = 'view'")
.join(table("campaigns"), "ad_id")
.groupBy(
window('event_time, "10 seconds"),
'campaign_id)
.count()
Batch Plan Incremental Plan
Scan Files
Aggregate
Write to Sink
Scan New
Files
Stateful Agg.
Update Sink
automatic
transformation
4xlower cost
Structured Streaming
reuses the Spark SQL
Optimizer and Tungsten
Engine.
https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark
Performance:
Benchmark System Throughput
700K
15M
65M
0
10
20
30
40
50
60
70
Kafka
Streams
Apache Flink Structured
Streaming
Millionsofrecords/s
4xfewer nodes
What About Latency?
Continuous processing mode to run without
microbatches
• <1 ms latency (same as per-record streaming systems)
• No changes to user code
• Proposal in SPARK-20928
Key idea: same API can target both streaming & batch
Find out more in today’s deep dive
Expanding Spark to New
Areas
Structured Streaming
Deep Learning
1
2
Deep Learning has Huge
PotentialUnprecedented ability to work with unstructured
data such as images and text
But Deep Learning is Hard to
UseCurrent APIs (TensorFlow, Keras, etc) are low-level
• Build a computation graph from scratch
Scale-out requires manual parallelization
Hard to use models in larger applications
Very similar to early big data APIs
Deep Learning on Spark
Image support in MLlib: SPARK-21866 (Spark 2.3)
DL framework integrations: TensorFlowOnSpark,
MMLSpark, Intel BigDL
Higher-level APIs: Deep Learning Pipelines
New in TensorFlowOnSpark
Library to run distributed TF on Spark clusters & data
• Built at Yahoo!, where it powers photos, videos & more
Yahoo! and Databricks collaborated to add:
• ML pipeline APIs
• Support for non-YARN and AWS clusters
github.com/yahoo/TensorFlowOnSpark
talk
tomorrow
at 17:00
Deep Learning Pipelines
Low-level DL frameworks are powerful, but common
use cases should be much simpler to build
Goal: Enable an order of magnitude more
users to build production apps using deep
learning
Deep Learning Pipelines
Key idea: High-level API built on ML Pipelines model
• Common use cases are just a few lines of code
• All operators automatically scale over Spark
• Expose models in batch, streaming & SQL apps
Uses existing DL engines (TensorFlow, Keras, etc)
Example: Using Existing
Modelpredictor = DeepImagePredictor(inputCol="image",
outputCol="labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
SELECT image, my_predictor(image) AS labels
FROM uploaded_images
Example: Model Search
est = KerasImageFileEstimator()
grid = ParamGridBuilder() 
.addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) 
.addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) 
.build()
CrossValidator(est, eval, grid).fit(image_df)
InceptionV3
batch size 32
ResNet50
batch size 32
InceptionV3
batch size 64
ResNet50
batch size 64
Spark
Driver
Deep Learning Pipelines
DemoSue Ann Hong

More Related Content

What's hot (20)

PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
Ā 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
Ā 
PPTX
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Legacy Typesafe (now Lightbend)
Ā 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
Ā 
PDF
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Lightbend
Ā 
PDF
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
Ā 
PDF
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
Ā 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
Ā 
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
Ā 
PDF
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Lightbend
Ā 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
Ā 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
Ā 
PDF
Spark Summit EU talk by Oscar Castaneda
Spark Summit
Ā 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
Ā 
PDF
Koalas: Unifying Spark and pandas APIs
Xiao Li
Ā 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
Ā 
PDF
Spark Summit EU talk by Emlyn Whittick
Spark Summit
Ā 
PDF
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Lightbend
Ā 
PDF
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
Ā 
PDF
Do's and don'ts when deploying akka in production
jglobal
Ā 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
Ā 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
Ā 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Legacy Typesafe (now Lightbend)
Ā 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
Ā 
Making Scala Faster: 3 Expert Tips For Busy Development Teams
Lightbend
Ā 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
Ā 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Lightbend
Ā 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
Ā 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
Ā 
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Lightbend
Ā 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
Ā 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
Ā 
Spark Summit EU talk by Oscar Castaneda
Spark Summit
Ā 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
Ā 
Koalas: Unifying Spark and pandas APIs
Xiao Li
Ā 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
Ā 
Spark Summit EU talk by Emlyn Whittick
Spark Summit
Ā 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Lightbend
Ā 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Databricks
Ā 
Do's and don'ts when deploying akka in production
jglobal
Ā 

Similar to Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia (20)

PDF
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
Ā 
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
Ā 
PDF
What's New in Upcoming Apache Spark 2.3
Databricks
Ā 
PDF
Spark streaming state of the union
Databricks
Ā 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
Ā 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
Ā 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
Ā 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
Ā 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
Ā 
PPTX
Apache Spark Components
Girish Khanzode
Ā 
PDF
Introduction to Spark Streaming
datamantra
Ā 
PDF
Bds session 13 14
Infinity Tech Solutions
Ā 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Ā 
PDF
Apache spark 2.4 and beyond
Xiao Li
Ā 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
Ā 
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
Ā 
PDF
Spark-summit-2013 Matei Zaharia
Prabeesh K
Ā 
PDF
Austin Data Meetup 092014 - Spark
Steve Blackmon
Ā 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
Ā 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
Ā 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
Ā 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
Ā 
What's New in Upcoming Apache Spark 2.3
Databricks
Ā 
Spark streaming state of the union
Databricks
Ā 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
Ā 
Media_Entertainment_Veriticals
Peyman Mohajerian
Ā 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
Ā 
Strata NYC 2015: What's new in Spark Streaming
Databricks
Ā 
Simplifying Big Data Analytics with Apache Spark
Databricks
Ā 
Apache Spark Components
Girish Khanzode
Ā 
Introduction to Spark Streaming
datamantra
Ā 
Bds session 13 14
Infinity Tech Solutions
Ā 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Ā 
Apache spark 2.4 and beyond
Xiao Li
Ā 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
Ā 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
Ā 
Spark-summit-2013 Matei Zaharia
Prabeesh K
Ā 
Austin Data Meetup 092014 - Spark
Steve Blackmon
Ā 
Large-Scale Data Science in Apache Spark 2.0
Databricks
Ā 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
Ā 
Ad

More from Jen Aman (20)

PDF
Snorkel: Dark Data and Machine Learning with Christopher RĆ©
Jen Aman
Ā 
PDF
Deep Learning on ApacheĀ® Sparkā„¢: Workflows and Best Practices
Jen Aman
Ā 
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
Ā 
PDF
Spatial Analysis On Histological Images Using Spark
Jen Aman
Ā 
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
Ā 
PDF
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
Ā 
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
Ā 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
Ā 
PDF
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
Ā 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
PDF
Low Latency Execution For Apache Spark
Jen Aman
Ā 
PDF
GPU Computing With Apache Spark And Python
Jen Aman
Ā 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
Ā 
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
Ā 
PDF
Spark on Mesos
Jen Aman
Ā 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
Ā 
PDF
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
Ā 
PDF
Spark Uber Development Kit
Jen Aman
Ā 
PDF
EclairJS = Node.Js + Apache Spark
Jen Aman
Ā 
Snorkel: Dark Data and Machine Learning with Christopher RĆ©
Jen Aman
Ā 
Deep Learning on ApacheĀ® Sparkā„¢: Workflows and Best Practices
Jen Aman
Ā 
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
Ā 
Spatial Analysis On Histological Images Using Spark
Jen Aman
Ā 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
Ā 
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
Ā 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
Ā 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
Ā 
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
Ā 
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
Re-Architecting Spark For Performance Understandability
Jen Aman
Ā 
Low Latency Execution For Apache Spark
Jen Aman
Ā 
GPU Computing With Apache Spark And Python
Jen Aman
Ā 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
Ā 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
Ā 
Spark on Mesos
Jen Aman
Ā 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
Ā 
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
Ā 
Spark Uber Development Kit
Jen Aman
Ā 
EclairJS = Node.Js + Apache Spark
Jen Aman
Ā 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
Ā 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
Ā 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
šŸ“Š Markus Baersch
Ā 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
Ā 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
Ā 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
Ā 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
Ā 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
Ā 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
PDF
Research Methodology Overview Introduction
ayeshagul29594
Ā 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
Ā 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
Ā 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
Ā 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
Ā 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
Ā 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
Ā 
JavaScript - Good or Bad? Tips for Google Tag Manager
šŸ“Š Markus Baersch
Ā 
BinarySearchTree in datastructures in detail
kichokuttu
Ā 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
Ā 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
Ā 
Powerful Uses of Data Analytics You Should Know
subhashenia
Ā 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
Ā 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
Research Methodology Overview Introduction
ayeshagul29594
Ā 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
Ā 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
Ā 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
Ā 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
Ā 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
Ā 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
Ā 

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

  • 1. Deep Learning and Streaming in Apache Spark 2.x Matei Zaharia @matei_zaharia
  • 2. Welcome to Spark Summit Europe Our largest European summit yet 102talks 1200attendees 11tracks
  • 3. What’s New in Spark? Cost-based optimizer (Spark 2.2) Python and R improvements • PyPI & CRAN packages (Spark 2.2) • Python ML plugins (Spark 2.3) • Vectorized Pandas UDFs (Spark 2.3) Kubernetes support (targeting 2.3) 0 10 20 30 40 50 Time(s) Spark 2.2 Vectorized UDFs 0 25 50 75 100 125 Q1 Q2 Time(s) Pandas Spark
  • 4. Spark: The Definitive Guide To be released this winter Free preview chapters and code on Databricks website: dbricks.co/spark-guide
  • 5. Two Fast-Growing Workloads Both are important but complex with current tools We think we can simplify both with Apache Spark! Streaming Deep Learning &
  • 6. Why are Streaming and DL Hard?Similar to early big data tools! Tremendous potential, but very hard to use at first: • Low-level APIs (MapReduce) • Separate systems for each task (SQL, ETL, ML, etc)
  • 7. Spark’s Approach 1) Composable, high-level APIs • Build apps from components 2) Unified engine • Run complete, end-to-end apps SQLStreaming ML Graph …
  • 8. Expanding Spark to New Areas Structured Streaming Deep Learning 1 2
  • 9. Structured Streaming Streaming today requires separate APIs & systems Structured Streaming is a high-level, end-to-end API • Simple interface: run any DataFrame or SQL code incrementally • Complete apps: combine with batch & interactive queries • End-to-end reliability: exactly-once processing Became GA in Apache Spark 2.2
  • 10. Structured Streaming Use Cases Monitor quality of live video streaming Anomaly detection on millions of WiFi hotspots 100s of customer apps in production on Databricks Largest apps process tens of trillions of records per month Real-time game analytics at scale
  • 11. KTable<String, String> kCampaigns = builder.table("campaigns", "cmp-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filtered.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filtered = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedData = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedData.groupByKey() .count(TimeWindows.of(10000), "time-windows"); streams Example: Benchmark DataFrames events .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( window('event_time, "10 seconds"), 'campaign_id) .count() Batch Plan Incremental Plan Scan Files Aggregate Write to Sink Scan New Files Stateful Agg. Update Sink automatic transformation
  • 12. 4xlower cost Structured Streaming reuses the Spark SQL Optimizer and Tungsten Engine. https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark Performance: Benchmark System Throughput 700K 15M 65M 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Structured Streaming Millionsofrecords/s 4xfewer nodes
  • 13. What About Latency? Continuous processing mode to run without microbatches • <1 ms latency (same as per-record streaming systems) • No changes to user code • Proposal in SPARK-20928 Key idea: same API can target both streaming & batch Find out more in today’s deep dive
  • 14. Expanding Spark to New Areas Structured Streaming Deep Learning 1 2
  • 15. Deep Learning has Huge PotentialUnprecedented ability to work with unstructured data such as images and text
  • 16. But Deep Learning is Hard to UseCurrent APIs (TensorFlow, Keras, etc) are low-level • Build a computation graph from scratch Scale-out requires manual parallelization Hard to use models in larger applications Very similar to early big data APIs
  • 17. Deep Learning on Spark Image support in MLlib: SPARK-21866 (Spark 2.3) DL framework integrations: TensorFlowOnSpark, MMLSpark, Intel BigDL Higher-level APIs: Deep Learning Pipelines
  • 18. New in TensorFlowOnSpark Library to run distributed TF on Spark clusters & data • Built at Yahoo!, where it powers photos, videos & more Yahoo! and Databricks collaborated to add: • ML pipeline APIs • Support for non-YARN and AWS clusters github.com/yahoo/TensorFlowOnSpark talk tomorrow at 17:00
  • 19. Deep Learning Pipelines Low-level DL frameworks are powerful, but common use cases should be much simpler to build Goal: Enable an order of magnitude more users to build production apps using deep learning
  • 20. Deep Learning Pipelines Key idea: High-level API built on ML Pipelines model • Common use cases are just a few lines of code • All operators automatically scale over Spark • Expose models in batch, streaming & SQL apps Uses existing DL engines (TensorFlow, Keras, etc)
  • 21. Example: Using Existing Modelpredictor = DeepImagePredictor(inputCol="image", outputCol="labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df) SELECT image, my_predictor(image) AS labels FROM uploaded_images
  • 22. Example: Model Search est = KerasImageFileEstimator() grid = ParamGridBuilder() .addGrid(est.modelFile, ["InceptionV3", "ResNet50"]) .addGrid(est.kerasParams, [{'batch': 32}, {'batch': 64}]) .build() CrossValidator(est, eval, grid).fit(image_df) InceptionV3 batch size 32 ResNet50 batch size 32 InceptionV3 batch size 64 ResNet50 batch size 64 Spark Driver

Editor's Notes

  • #10: Make this more about how easy it is.
  • #13: Comparable latency to flink
  • #14: We’ve been experimenting with this at DB and we’re excited to contribute it back.