SlideShare a Scribd company logo
Build, Scale, and Deploy
Deep Learning Pipelines
Using Apache Spark
Tim Hunter, Databricks
Spark Meetup London, March 2018
About Me
Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user (Spark 0.0.2)
• Co-creator of GraphFrames, TensorFrames,
Joint work with
Sue Ann Hong
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com
This talk
• Deep Learning at scale: current state
• Deep Learning Pipelines: the vision
• End-to-end workflow with DL Pipelines
• Future
Deep Learning at Scale
: current state
5put	your	#assignedhashtag	here	by	setting	the
What is Deep Learning?
• A set of machine learning techniques that use layers that
transform numerical inputs
• Classification
• Regression
• Arbitrary mapping
• Popular in the 80’s as Neural Networks
• Recently came back thanks to advances in data collection,
computation techniques, and hardware.
t
Success of Deep Learning
Tremendous success for applications with complex data
• AlphaGo
• Image interpretation
• Automatic translation
• Speech recognition
But requires a lot of effort
• No exact science around deep learning
• Success requires many engineer-hours
• Low level APIs with steep learning curve
• Not well integrated with other enterprise tools
• Tedious to distribute computations
What does Spark offer?
Very little in Apache Spark MLlib itself (multilayer perceptron)
Many Spark packages
Integrations with existing DL libraries
• Deep Learning Pipelines (from Databricks)
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• mxnet
• Paddle
• TensorFlow (TensorFlow on Spark, TensorFrames)
• CNTK (mmlspark)
Implementations of DL on Spark
• BigDL
• DeepDist
• DeepLearning4J
• MLlib
• SparkCL
• SparkNet
Deep Learning in industry
• Currently limited adoption
• Huge potential beyond the industrial giants
• How do we accelerate the road to massive availability?
Deep Learning Pipelines
11put	your	#assignedhashtag	here	by	setting	the
Deep Learning Pipelines:
Deep Learning with Simplicity
• Open-source Databricks library
• Focuses on ease of use and integration
• without sacrificing performance
• Primary language: Python
• Uses Apache Spark for scaling out common tasks
• Integrates with MLlib Pipelines to capture the ML workflow
concisely
s
A typical Deep Learning workflow
• Load data (images, text, time series, …)
• Interactive work
• Train
• Select an architecture for a neural network
• Optimize the weights of the NN
• Evaluate results, potentially re-train
• Apply:
• Pass the data through the NN to produce new features or output
Load data
Interactive work
Train
Evaluate
Apply
A typical Deep Learning workflow
Load data
Interactive work
Train
Evaluate
Apply
• Image	loading	in	Spark
• Distributed	batch	prediction
• Deploying	models	in	SQL
• Transfer	learning
• Distributed	tuning
• Pre-trained	models
End-to-End Workflow
with Deep Learning Pipelines
15put	your	#assignedhashtag	here	by	setting	the
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
t
Built-in support in Spark
• In Spark 2.3
• Collaboration with Microsoft
• ImageSchema, reader, conversion functions to/from numpy arrays
• Most of the tools we’ll describe work on ImageSchema columns
images = spark.readImages(img_dir,
recursive = True,
sampleRatio = 0.1)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Applying popular models
• Popular pre-trained models accessible through MLlib
Transformers
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Applying popular models
predictor = DeepImagePredictor(inputCol="image",
outputCol="predicted_labels",
modelName="InceptionV3")
predictions_df = predictor.transform(image_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transfer	learning
s
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Hyperparameter tuning
Transfer	learning
Transfer learning
• Pre-trained models may not be directly applicable
• New domain, e.g. shoes
• Training from scratch requires
• Enormous amounts of data
• A lot of compute resources & time
• Idea: intermediate representations learned for one task may be useful
for other related tasks
Transfer Learning
SoftMax
GIANT PANDA 0.9
RACCOON 0.05
RED PANDA 0.01
…
Transfer Learning
Transfer Learning
Classifier
Transfer Learning
Classifier
Rose: 0.7
Daisy: 0.3
MLlib Pipelines primer
• MLlib: the machine learning library included with Spark
• Transformer
• Takes in a Spark dataframe
• Returns a Spark dataframe with new column(s) containing “transformed” data
• e.g. a Model is a Transformer
• Estimator
• A learning algorithm, e.g. lr = LogisticRegression()
• Produces a Model via lr.fit()
• Pipeline: a sequence of Transformers and Estimators
Transfer Learning as a Pipeline
ClassifierDeepImageFeaturizer
Rose / Daisy
Transfer Learning as a Pipeline
DeepImageFeaturizer
Image
Loading Preprocessing
Logistic
Regression
MLlib Pipeline
Transfer Learning as a Pipeline
31put	your	#assignedhashtag	here	by	setting	the	
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName="InceptionV3")
lr = LogisticRegression(labelCol="label")
p = Pipeline(stages=[featurizer, lr])
p_model = p.fit(train_df)
Transfer Learning
• Usually for classification tasks
• Similar task, new domain
• But other forms of learning leveraging learned representations
can be loosely considered transfer learning
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Logistic
Regression
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation
Featurization for similarity-based ML
DeepImageFeaturizer
Image
Loading Preprocessing
Clustering
KMeans
GaussianMixture
Nearest Neighbor
KNN LSH
Distance
computation
Duplicate
Detection
Recommendation
Anomaly
Detection
Search result
diversification
Keras
37
model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))
• A popular, declarative interface to build DL models
• High level, expressive API in python
• Executes on TensorFlow, Theano, CNTK
model = Sequential()
model.add(...)
model.save(model_filename)
estimator = KerasImageFileEstimator(
kerasOptimizer=“adam“,
kerasLoss=“categorical_crossentropy“,
kerasFitParams={“batch_size“:100},
modelFile=model_filename)
model = model.fit(dataframe)
38
Keras Estimator
39
Keras Estimator in Model Selection
estimator = KerasImageFileEstimator(
kerasOptimizer=“adam“,
kerasLoss=“categorical_crossentropy“)
paramGrid = ( ParamGridBuilder()
.addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}])
.addGrid(modelFile=[model1, model2]) )
cv = CrossValidator(estimator=estimator,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=3)
best_model = cv.fit(train_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
s
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
Batch prediction as an MLlib Transformer
• Recall a model is a Transformer in MLlib
predictor = XXTransformer(inputCol="image",
outputCol=”predictions",
modelSpecification={…})
predictions = predictor.transform(test_df)
Deep Learning Pipelines
• Load data
• Interactive work
• Train
• Evaluate model
• Apply
Spark	SQL
Batch	prediction
s
Shipping predictors in SQL
Take a trained model / Pipeline, register a SQL UDF usable by
anyone in the organization
In Spark SQL:
registerKerasUDF(”my_object_recognition_function",
keras_model_file="/mymodels/007model.h5")
select image, my_object_recognition_function(image) as objects
from traffic_imgs
This means you can apply deep learning models in streaming!
Deep Learning Pipelines : Future
In progress
• Scala API for DeepImageFeaturizer
• Text featurization (embeddings)
• TFTransformer for arbitrary vectors
Future
• Distributed training
• Support for more backends, e.g. MXNet, PyTorch, BigDL
Deep Learning without Deep Pockets
• Simple API for Deep Learning, integrated with MLlib
• Scales common tasks with transformers and estimators
• Embeds Deep Learning models in MLlib and SparkSQL
• Check out https://github.com/databricks/spark-deep-
learning !
Resources
Blog posts & webinars (http://databricks.com/blog)
• Deep Learning Pipelines
• GPU acceleration in Databricks
• BigDL on Databricks
• Deep Learning and Apache Spark
Docs for Deep Learning on Databricks (http://docs.databricks.com)
• Getting started
• Deep Learning Pipelines Example
• Spark integration
49
WWW.DATABRICKS.COM/SPARKAISUMMIT
DATE: June 4-6, 2018
LOCATION: San Francisco -
Moscone
TRACKS: Artificial
Intelligence, Spark Use
Cases, Enterprise,
Productionizing ML, Deep
Learning, Hardware in the
Cloud
ATTENDEES: 4000+ Data
Scientists, Data Engineers,
Analysts, & VP/CxOs
https://databricks.com/company/careers
GREAT
Thank You!
Questions?
Happy Sparking & Deep Learning!

More Related Content

What's hot (20)

PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
PDF
DASK and Apache Spark
Databricks
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
PDF
Designing Distributed Machine Learning on Apache Spark
Databricks
 
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
DASK and Apache Spark
Databricks
 
Dev Ops Training
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Designing Distributed Machine Learning on Apache Spark
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 

Similar to Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark (20)

PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Deep learning and Apache Spark
QuantUniversity
 
PDF
Fighting Fraud with Apache Spark
Miklos Christine
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PPTX
Metail and Elastic MapReduce
Gareth Rogers
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PPTX
Enterprise Deep Learning with DL4J
Josh Patterson
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PDF
Apache spark
Hitesh Dua
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Deep learning and Apache Spark
QuantUniversity
 
Fighting Fraud with Apache Spark
Miklos Christine
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache Spark MLlib
Zahra Eskandari
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Metail and Elastic MapReduce
Gareth Rogers
 
BDM25 - Spark runtime internal
David Lauzon
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Enterprise Deep Learning with DL4J
Josh Patterson
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Apache spark
Hitesh Dua
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

  • 1. Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark Tim Hunter, Databricks Spark Meetup London, March 2018
  • 2. About Me Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user (Spark 0.0.2) • Co-creator of GraphFrames, TensorFrames, Joint work with Sue Ann Hong
  • 3. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple Try for free today. databricks.com
  • 4. This talk • Deep Learning at scale: current state • Deep Learning Pipelines: the vision • End-to-end workflow with DL Pipelines • Future
  • 5. Deep Learning at Scale : current state 5put your #assignedhashtag here by setting the
  • 6. What is Deep Learning? • A set of machine learning techniques that use layers that transform numerical inputs • Classification • Regression • Arbitrary mapping • Popular in the 80’s as Neural Networks • Recently came back thanks to advances in data collection, computation techniques, and hardware. t
  • 7. Success of Deep Learning Tremendous success for applications with complex data • AlphaGo • Image interpretation • Automatic translation • Speech recognition
  • 8. But requires a lot of effort • No exact science around deep learning • Success requires many engineer-hours • Low level APIs with steep learning curve • Not well integrated with other enterprise tools • Tedious to distribute computations
  • 9. What does Spark offer? Very little in Apache Spark MLlib itself (multilayer perceptron) Many Spark packages Integrations with existing DL libraries • Deep Learning Pipelines (from Databricks) • Caffe (CaffeOnSpark) • Keras (Elephas) • mxnet • Paddle • TensorFlow (TensorFlow on Spark, TensorFrames) • CNTK (mmlspark) Implementations of DL on Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
  • 10. Deep Learning in industry • Currently limited adoption • Huge potential beyond the industrial giants • How do we accelerate the road to massive availability?
  • 12. Deep Learning Pipelines: Deep Learning with Simplicity • Open-source Databricks library • Focuses on ease of use and integration • without sacrificing performance • Primary language: Python • Uses Apache Spark for scaling out common tasks • Integrates with MLlib Pipelines to capture the ML workflow concisely s
  • 13. A typical Deep Learning workflow • Load data (images, text, time series, …) • Interactive work • Train • Select an architecture for a neural network • Optimize the weights of the NN • Evaluate results, potentially re-train • Apply: • Pass the data through the NN to produce new features or output Load data Interactive work Train Evaluate Apply
  • 14. A typical Deep Learning workflow Load data Interactive work Train Evaluate Apply • Image loading in Spark • Distributed batch prediction • Deploying models in SQL • Transfer learning • Distributed tuning • Pre-trained models
  • 15. End-to-End Workflow with Deep Learning Pipelines 15put your #assignedhashtag here by setting the
  • 16. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply t
  • 17. Built-in support in Spark • In Spark 2.3 • Collaboration with Microsoft • ImageSchema, reader, conversion functions to/from numpy arrays • Most of the tools we’ll describe work on ImageSchema columns images = spark.readImages(img_dir, recursive = True, sampleRatio = 0.1)
  • 18. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
  • 19. Applying popular models • Popular pre-trained models accessible through MLlib Transformers predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 20. Applying popular models predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
  • 21. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning s
  • 22. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning
  • 23. Transfer learning • Pre-trained models may not be directly applicable • New domain, e.g. shoes • Training from scratch requires • Enormous amounts of data • A lot of compute resources & time • Idea: intermediate representations learned for one task may be useful for other related tasks
  • 24. Transfer Learning SoftMax GIANT PANDA 0.9 RACCOON 0.05 RED PANDA 0.01 …
  • 28. MLlib Pipelines primer • MLlib: the machine learning library included with Spark • Transformer • Takes in a Spark dataframe • Returns a Spark dataframe with new column(s) containing “transformed” data • e.g. a Model is a Transformer • Estimator • A learning algorithm, e.g. lr = LogisticRegression() • Produces a Model via lr.fit() • Pipeline: a sequence of Transformers and Estimators
  • 29. Transfer Learning as a Pipeline ClassifierDeepImageFeaturizer Rose / Daisy
  • 30. Transfer Learning as a Pipeline DeepImageFeaturizer Image Loading Preprocessing Logistic Regression MLlib Pipeline
  • 31. Transfer Learning as a Pipeline 31put your #assignedhashtag here by setting the featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") lr = LogisticRegression(labelCol="label") p = Pipeline(stages=[featurizer, lr]) p_model = p.fit(train_df)
  • 32. Transfer Learning • Usually for classification tasks • Similar task, new domain • But other forms of learning leveraging learned representations can be loosely considered transfer learning
  • 34. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Logistic Regression
  • 35. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation
  • 36. Featurization for similarity-based ML DeepImageFeaturizer Image Loading Preprocessing Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation Duplicate Detection Recommendation Anomaly Detection Search result diversification
  • 37. Keras 37 model = Sequential() model.add(Dense(32, input_dim=784)) model.add(Activation('relu')) • A popular, declarative interface to build DL models • High level, expressive API in python • Executes on TensorFlow, Theano, CNTK
  • 38. model = Sequential() model.add(...) model.save(model_filename) estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“, kerasFitParams={“batch_size“:100}, modelFile=model_filename) model = model.fit(dataframe) 38 Keras Estimator
  • 39. 39 Keras Estimator in Model Selection estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“) paramGrid = ( ParamGridBuilder() .addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}]) .addGrid(modelFile=[model1, model2]) ) cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=3) best_model = cv.fit(train_df)
  • 40. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
  • 41. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
  • 42. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction
  • 43. Batch prediction as an MLlib Transformer • Recall a model is a Transformer in MLlib predictor = XXTransformer(inputCol="image", outputCol=”predictions", modelSpecification={…}) predictions = predictor.transform(test_df)
  • 44. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
  • 45. Shipping predictors in SQL Take a trained model / Pipeline, register a SQL UDF usable by anyone in the organization In Spark SQL: registerKerasUDF(”my_object_recognition_function", keras_model_file="/mymodels/007model.h5") select image, my_object_recognition_function(image) as objects from traffic_imgs This means you can apply deep learning models in streaming!
  • 46. Deep Learning Pipelines : Future In progress • Scala API for DeepImageFeaturizer • Text featurization (embeddings) • TFTransformer for arbitrary vectors Future • Distributed training • Support for more backends, e.g. MXNet, PyTorch, BigDL
  • 47. Deep Learning without Deep Pockets • Simple API for Deep Learning, integrated with MLlib • Scales common tasks with transformers and estimators • Embeds Deep Learning models in MLlib and SparkSQL • Check out https://github.com/databricks/spark-deep- learning !
  • 48. Resources Blog posts & webinars (http://databricks.com/blog) • Deep Learning Pipelines • GPU acceleration in Databricks • BigDL on Databricks • Deep Learning and Apache Spark Docs for Deep Learning on Databricks (http://docs.databricks.com) • Getting started • Deep Learning Pipelines Example • Spark integration
  • 49. 49 WWW.DATABRICKS.COM/SPARKAISUMMIT DATE: June 4-6, 2018 LOCATION: San Francisco - Moscone TRACKS: Artificial Intelligence, Spark Use Cases, Enterprise, Productionizing ML, Deep Learning, Hardware in the Cloud ATTENDEES: 4000+ Data Scientists, Data Engineers, Analysts, & VP/CxOs