SlideShare a Scribd company logo
Deploying MLlib for Scoring in
Structured Streaming
Joseph Bradley
June 5, 2018
Spark + AI Summit
About me
Joseph Bradley
• Software engineer at Databricks
• Apache Spark committer & PMC member
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com
App: monitoring web sessions for bots
Web Activity Logs
Compute Features
Kill User’s Login
Session
Run Prediction
API Check
Cached
Predictions
streaming web app
App: monitoring web sessions for bots
Web Activity Logs
Kill User’s Login
Session
Compute Features Run Prediction
API Check
Cached
Predictions
streaming web app
Productionizing Machine Learning
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users
Challenge: teams & environments
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users
Challenge: featurization logic
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
Feature
Logic
↓
Feature
Logic
↓
Feature
Logic
↓
Model
End Users
Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
ML Pipelines in Apache Spark
Original
dataset
11
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
ML Pipelines: featurization
12
Feature
extraction
Original
dataset
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
ML Pipelines: model
13
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature
extraction
Original
dataset
Predictive
model
ML Pipelines: successes
• Apache Spark integration simplifies
• Deployment
• ETL
• Integration into complete analytics pipelines with SQL (& streaming!)
• Scalability & speed
• Pipelines for featurization, modeling & tuning
ML Pipelines: adoption
• 1000s of commits
• 100s of contributors
• 10,000s of users
(on Databricks alone)
• Many production use cases
Structured Streaming
One single API Dataset / DataFrame for batch & streaming
End-to-end exactly-once guarantees
• The guarantees extend into the sources/sinks, e.g. MySQL, S3
Understands external event-time
• Handles late-arriving data
• Supports sessionization based on event time
Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
ML Pipeline
Persistence
Apache Spark
deployments
Featurization in
Pipelines
Backwards
compatibility
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
2-pass Transformers
Algorithmic pattern
• Scan data to collect stats
• Collect stats to driver
• Scan data to apply transform
(using stats)
VectorAssembler
• Find lengths of Vector cols
• Compute total # features
• Create new Vector column
(of length # features)
Scan-collect-scan pattern fails with Structured Streaming.
Handling invalid values
Invalid values include:
• NaN and null values
• Out-of-bounds values (e.g., for Bucketizer)
• Incorrect Vector lengths (e.g., for VectorAssembler)
Robust deployments must handle invalid data.
ML Pipelines use the handleInvalid Param
with options “skip” / “keep” / “error”
— but have only partial coverage.
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
Most Transformers & Models “just work”
As of Apache Spark 2.3, batch & streaming scoring/transform
are basically identical:
• PipelineModel.transform() works on Streaming
Datasets and DataFrames.
• New unit test framework covers batch & streaming tests.
Fixes & tests tracked in SPARK-21926 & SPARK-22644.
Fixes for 2-pass Transformers
VectorAssembler
• Assemble multiple columns
into 1 feature Vector
• Needs lengths of Vector
columns
• Extract from metadata (added
by, e.g., OneHotEncoder)
• Compute from data
Fails with Structured
Streaming
VectorSizeHint
• Manually adds Vector
length to column metadata
• Required only for
Structured Streaming
Fixes for 2-pass Transformers
OneHotEncoder
• Transform categorical
column to 0/1 Vector
• Needs # categories:
• Extract from metadata (added
by, e.g., StringIndexer)
• Compute from data
OneHotEncoderEstimator
• fit() stores categories for
use in transform()
• Match behavior at training &
test time
Bug if train & test data have
different categories (state)
Handling invalid values
Improvements in Spark 2.3
• VectorIndexer, StringIndexer, OneHotEncoderEstimator
• Bucketizer, QuantileDiscretizer
• RFormula
• Most coverage handles NaN. Some handles null.
Fixes targeted for Spark 2.4
• VectorAssembler
• RFormula: Pass handleInvalid to all sub-stages
Demo: Streaming Scoring in 2.3
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
Cheat sheet: fixing your Pipeline to work
with Structured Streaming
• Update uses of OneHotEncoder, VectorAssembler.
(RFormula should be OK).
• Check how invalid values are handled.
• Beware using handleInvalid=“skip”, which drops invalid Rows.
• Test!
• In custom logic (custom SQL, Transformers, Models),
beware of 2-pass Transformers (hidden state).
Remaining work
• Locality Sensitive Hashing (LSH) Models do not work
(SPARK-24465)
• Require Spark SQL to support nested UDTs (SPARK-12878)
• VectorAssemblerEstimator: nicer API than VectorSizeHint
(SPARK-24467)
• Handling invalid values
• Expanded support
• Better defaults for handleInvalid Param
Beyond this talk
This talk:
Deployment
in streaming
Deployment
outside of
Spark
Deployment
in batch jobs
Model
management
Feature
management
Experiment
management
Monitoring A/B testing Serving APIs
Resources
Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-
your-machine-learning-models
Batch scoring
Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-
loading-pipelines
Streaming scoring
Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and-
stuctured-streaming.html
Sub-second scoring
Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-
apache-spark-mllib-models-for-real-time-prediction-serving
https://databricks.com/careers
Thank You!
Questions?
Shout out to Bago Amirbekian,
Weichen Xu, and to the many
other contributors on this work.
Office hours today @ 3:50pm at
Databricks booth

More Related Content

What's hot (20)

PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
PDF
Semi-Supervised Learning In An Adversarial Environment
DataWorks Summit
 
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
PDF
MLflow Model Serving
Databricks
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
PDF
Using Databricks as an Analysis Platform
Databricks
 
PDF
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
PDF
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
Building an ML Platform with Ray and MLflow
Databricks
 
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
PDF
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
PDF
Structured Streaming Use-Cases at Apple
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
Semi-Supervised Learning In An Adversarial Environment
DataWorks Summit
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
MLflow Model Serving
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
Using Databricks as an Analysis Platform
Databricks
 
PandasUDFs: One Weird Trick to Scaled Ensembles
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Spark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
Apache Spark Model Deployment
Databricks
 
Building an ML Platform with Ray and MLflow
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
Whirlpools in the Stream with Jayesh Lalwani
Databricks
 
Structured Streaming Use-Cases at Apple
Databricks
 

Similar to Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley (20)

PDF
Distributed ML in Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PDF
Productionalizing Spark ML
datamantra
 
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Practical Machine Learning Pipelines with MLlib
Databricks
 
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Distributed ML in Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Spark DataFrames and ML Pipelines
Databricks
 
Productionalizing Spark ML
datamantra
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Practical Machine Learning Pipelines with MLlib
Databricks
 
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

  • 1. Deploying MLlib for Scoring in Structured Streaming Joseph Bradley June 5, 2018 Spark + AI Summit
  • 2. About me Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member
  • 3. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple Try for free today. databricks.com
  • 4. App: monitoring web sessions for bots Web Activity Logs Compute Features Kill User’s Login Session Run Prediction API Check Cached Predictions streaming web app
  • 5. App: monitoring web sessions for bots Web Activity Logs Kill User’s Login Session Compute Features Run Prediction API Check Cached Predictions streaming web app
  • 6. Productionizing Machine Learning Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  • 7. Challenge: teams & environments Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  • 8. Challenge: featurization logic Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions Feature Logic ↓ Feature Logic ↓ Feature Logic ↓ Model End Users
  • 9. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future
  • 10. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 11. ML Pipelines in Apache Spark Original dataset 11 Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 12. ML Pipelines: featurization 12 Feature extraction Original dataset Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 13. ML Pipelines: model 13 Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7 Feature extraction Original dataset Predictive model
  • 14. ML Pipelines: successes • Apache Spark integration simplifies • Deployment • ETL • Integration into complete analytics pipelines with SQL (& streaming!) • Scalability & speed • Pipelines for featurization, modeling & tuning
  • 15. ML Pipelines: adoption • 1000s of commits • 100s of contributors • 10,000s of users (on Databricks alone) • Many production use cases
  • 16. Structured Streaming One single API Dataset / DataFrame for batch & streaming End-to-end exactly-once guarantees • The guarantees extend into the sources/sinks, e.g. MySQL, S3 Understands external event-time • Handles late-arriving data • Supports sessionization based on event time
  • 17. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future ML Pipeline Persistence Apache Spark deployments Featurization in Pipelines Backwards compatibility
  • 18. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 19. 2-pass Transformers Algorithmic pattern • Scan data to collect stats • Collect stats to driver • Scan data to apply transform (using stats) VectorAssembler • Find lengths of Vector cols • Compute total # features • Create new Vector column (of length # features) Scan-collect-scan pattern fails with Structured Streaming.
  • 20. Handling invalid values Invalid values include: • NaN and null values • Out-of-bounds values (e.g., for Bucketizer) • Incorrect Vector lengths (e.g., for VectorAssembler) Robust deployments must handle invalid data. ML Pipelines use the handleInvalid Param with options “skip” / “keep” / “error” — but have only partial coverage.
  • 21. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 22. Most Transformers & Models “just work” As of Apache Spark 2.3, batch & streaming scoring/transform are basically identical: • PipelineModel.transform() works on Streaming Datasets and DataFrames. • New unit test framework covers batch & streaming tests. Fixes & tests tracked in SPARK-21926 & SPARK-22644.
  • 23. Fixes for 2-pass Transformers VectorAssembler • Assemble multiple columns into 1 feature Vector • Needs lengths of Vector columns • Extract from metadata (added by, e.g., OneHotEncoder) • Compute from data Fails with Structured Streaming VectorSizeHint • Manually adds Vector length to column metadata • Required only for Structured Streaming
  • 24. Fixes for 2-pass Transformers OneHotEncoder • Transform categorical column to 0/1 Vector • Needs # categories: • Extract from metadata (added by, e.g., StringIndexer) • Compute from data OneHotEncoderEstimator • fit() stores categories for use in transform() • Match behavior at training & test time Bug if train & test data have different categories (state)
  • 25. Handling invalid values Improvements in Spark 2.3 • VectorIndexer, StringIndexer, OneHotEncoderEstimator • Bucketizer, QuantileDiscretizer • RFormula • Most coverage handles NaN. Some handles null. Fixes targeted for Spark 2.4 • VectorAssembler • RFormula: Pass handleInvalid to all sub-stages
  • 27. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 28. Cheat sheet: fixing your Pipeline to work with Structured Streaming • Update uses of OneHotEncoder, VectorAssembler. (RFormula should be OK). • Check how invalid values are handled. • Beware using handleInvalid=“skip”, which drops invalid Rows. • Test! • In custom logic (custom SQL, Transformers, Models), beware of 2-pass Transformers (hidden state).
  • 29. Remaining work • Locality Sensitive Hashing (LSH) Models do not work (SPARK-24465) • Require Spark SQL to support nested UDTs (SPARK-12878) • VectorAssemblerEstimator: nicer API than VectorSizeHint (SPARK-24467) • Handling invalid values • Expanded support • Better defaults for handleInvalid Param
  • 30. Beyond this talk This talk: Deployment in streaming Deployment outside of Spark Deployment in batch jobs Model management Feature management Experiment management Monitoring A/B testing Serving APIs
  • 31. Resources Overview of productionizing Apache Spark ML models Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize- your-machine-learning-models Batch scoring Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and- loading-pipelines Streaming scoring Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and- stuctured-streaming.html Sub-second scoring Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing- apache-spark-mllib-models-for-real-time-prediction-serving
  • 33. Thank You! Questions? Shout out to Bago Amirbekian, Weichen Xu, and to the many other contributors on this work. Office hours today @ 3:50pm at Databricks booth