Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

Deploying MLlib for Scoring in
Structured Streaming
Joseph Bradley
June 5, 2018
Spark + AI Summit

About me
Joseph Bradley
• Software engineer at Databricks
• Apache Spark committer & PMC member

TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com

App: monitoring web sessions for bots
Web Activity Logs
Compute Features
Kill User’s Login
Session
Run Prediction
API Check
Cached
Predictions
streaming web app

App: monitoring web sessions for bots
Web Activity Logs
Kill User’s Login
Session
Compute Features Run Prediction
API Check
Cached
Predictions
streaming web app

Productionizing Machine Learning
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users

Challenge: teams & environments
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users

Challenge: featurization logic
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
Feature
Logic
↓
Feature
Logic
↓
Feature
Logic
↓
Model
End Users

Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future

In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources

ML Pipelines in Apache Spark
Original
dataset
11
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3

ML Pipelines: featurization
12
Feature
extraction
Original
dataset
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]

ML Pipelines: model
13
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature
extraction
Original
dataset
Predictive
model

ML Pipelines: successes
• Apache Spark integration simplifies
• Deployment
• ETL
• Integration into complete analytics pipelines with SQL (& streaming!)
• Scalability & speed
• Pipelines for featurization, modeling & tuning

ML Pipelines: adoption
• 1000s of commits
• 100s of contributors
• 10,000s of users
(on Databricks alone)
• Many production use cases

One single API Dataset / DataFrame for batch & streaming
End-to-end exactly-once guarantees
• The guarantees extend into the sources/sinks, e.g. MySQL, S3
Understands external event-time
• Handles late-arriving data
• Supports sessionization based on event time

Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
ML Pipeline
Persistence
Apache Spark
deployments
Featurization in
Pipelines
Backwards
compatibility

2-pass Transformers
Algorithmic pattern
• Scan data to collect stats
• Collect stats to driver
• Scan data to apply transform
(using stats)
VectorAssembler
• Find lengths of Vector cols
• Compute total # features
• Create new Vector column
(of length # features)
Scan-collect-scan pattern fails with Structured Streaming.

Handling invalid values
Invalid values include:
• NaN and null values
• Out-of-bounds values (e.g., for Bucketizer)
• Incorrect Vector lengths (e.g., for VectorAssembler)
Robust deployments must handle invalid data.
ML Pipelines use the handleInvalid Param
with options “skip” / “keep” / “error”
— but have only partial coverage.

Most Transformers & Models “just work”
As of Apache Spark 2.3, batch & streaming scoring/transform
are basically identical:
• PipelineModel.transform() works on Streaming
Datasets and DataFrames.
• New unit test framework covers batch & streaming tests.
Fixes & tests tracked in SPARK-21926 & SPARK-22644.

Fixes for 2-pass Transformers
VectorAssembler
• Assemble multiple columns
into 1 feature Vector
• Needs lengths of Vector
columns
• Extract from metadata (added
by, e.g., OneHotEncoder)
• Compute from data
Fails with Structured
Streaming
VectorSizeHint
• Manually adds Vector
length to column metadata
• Required only for

Fixes for 2-pass Transformers
OneHotEncoder
• Transform categorical
column to 0/1 Vector
• Needs # categories:
• Extract from metadata (added
by, e.g., StringIndexer)
• Compute from data
OneHotEncoderEstimator
• fit() stores categories for
use in transform()
• Match behavior at training &
test time
Bug if train & test data have
different categories (state)

Handling invalid values
Improvements in Spark 2.3
• VectorIndexer, StringIndexer, OneHotEncoderEstimator
• Bucketizer, QuantileDiscretizer
• RFormula
• Most coverage handles NaN. Some handles null.
Fixes targeted for Spark 2.4
• VectorAssembler
• RFormula: Pass handleInvalid to all sub-stages

Demo: Streaming Scoring in 2.3

Cheat sheet: fixing your Pipeline to work
with Structured Streaming
• Update uses of OneHotEncoder, VectorAssembler.
(RFormula should be OK).
• Check how invalid values are handled.
• Beware using handleInvalid=“skip”, which drops invalid Rows.
• Test!
• In custom logic (custom SQL, Transformers, Models),
beware of 2-pass Transformers (hidden state).

Remaining work
• Locality Sensitive Hashing (LSH) Models do not work
(SPARK-24465)
• Require Spark SQL to support nested UDTs (SPARK-12878)
• VectorAssemblerEstimator: nicer API than VectorSizeHint
(SPARK-24467)
• Handling invalid values
• Expanded support
• Better defaults for handleInvalid Param

Beyond this talk
This talk:
Deployment
in streaming
Deployment
outside of
Spark
Deployment
in batch jobs
Model
management
Feature
management
Experiment
management
Monitoring A/B testing Serving APIs

Resources
Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-
your-machine-learning-models
Batch scoring
Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-
loading-pipelines
Streaming scoring
Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and-
stuctured-streaming.html
Sub-second scoring
Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-
apache-spark-mllib-models-for-real-time-prediction-serving

https://databricks.com/careers

Thank You!
Questions?
Shout out to Bago Amirbekian,
Weichen Xu, and to the many
other contributors on this work.
Office hours today @ 3:50pm at
Databricks booth

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley

More Related Content

What's hot (20)

Similar to Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley (20)

More from Databricks (20)

Recently uploaded (20)

Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley