Apache Spark Model Deployment

Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics

About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2

About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3

MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4

Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}

Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)

The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results

The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk

But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters

ML Pipelines
11
Train model
Evaluate
Load data
Extract features
A very simple pipeline

ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline

Why ML persistence?
13
Data
Science
Software
Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model

Why ML persistence?
14
Data
Science
Software
Engineering
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline

With ML persistence...
15
Data
Science
Software
Engineering
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production

Demo
Model Serialization in Apache Spark 2.0 using Parquet

What are the Requirements
for a Robust Model
Deployment System?

Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment

Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)

Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results

Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang

• Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,

Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps

• Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring

Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model

Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning

Model Deployment
Architectures

Architecture #1
Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in
Web / Mobile
Nightly Batch

Architecture #2
Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming

Architecture #3
Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy
Model to
Production
Predictions
Run Spark Local

Demo
• Example of Offline Recommendations using ALS and
Redis as a NoSQL Cache

Try Databricks Community Edition

Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34

Apache Spark Model Deployment

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark Model Deployment (20)

More from Databricks (20)

Recently uploaded (20)

Apache Spark Model Deployment