SlideShare a Scribd company logo
Apache Spark(™)
Model Deployment
Bay Area Spark Meetup – June 30, 2016
Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
About Me
Richard L Garris
• rlgarris@databricks.com
• @rlgarris [Twitter]
Big Data Solutions Architect @ Databricks
12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
Prior Work Experience PwC, Google, Skytree
Ohio State Buckeye and CMU Alumni
2
About Apache Spark MLlib
Started at Berkeley AMPLab
(Apache Spark 0.8)
Now (Apache Spark 2.0)
• Contributions from 75+ orgs, ~250 individuals
• Development driven by Databricks: roadmap + 50% of
PRs
• Growing coverage of distributed algorithms
Spark
SparkSQL Streaming MLlib GraphFrames
3
MLlib Goals
General Machine Learning library for big data
• Scalable & robust
• Coverage of common algorithms
• Leverages Apache Spark
Tools for practical workflows
Integration with existing data science tools
4
Apache Spark MLlib
• spark.mllib
• Pre Mllib < Spark 1.4
• Spark Mllib was a lower
level library that used
Spark RDDs
• Uses LabeledPoint,
Vectors and Tuples
• Maintenance Mode only
after Spark 2.X
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split('
').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations, st
epSize)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Apache Spark – ML Pipelines
• spark.ml
• Spark > 1.4
• Spark.ML pipelines –
able to create more
complex models
• Integrated with
DataFrames
// Let's initialize our linear regression learner
val lr = new LinearRegression()
// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
.setLabelCol("PE").setMaxIter(100).setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you
have worked with scikit-learn this will be very
familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what
we get
val lrModel = lrPipeline.fit(trainingSet)
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
The Agile Modeling Process
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Focus of this
talk
What is a Model?
•
But What Really is a Model?
A model is a complex pipeline of components
• Data Sources
• Joins
• Featurization Logic
• Algorithm(s)
• Transformers
• Estimators
• Tuning Parameters
ML Pipelines
11
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
12
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract features
Extract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline
Why ML persistence?
13
Data
Science
Software
Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
Why ML persistence?
14
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline for
production (Java)
Deploy Pipeline
With ML persistence...
15
Data
Science
Software
Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Demo
Model Serialization in Apache Spark 2.0 using Parquet
What are the Requirements
for a Robust Model
Deployment System?
Customer SLAs
• Response time
• Throughput (predictions per second)
• Uptime / Reliability
Tech Stack
• C / C++
• Legacy (mainframe)
• Java
• Docker
Your Model Scoring Environment
Offline
• Internal Use (batch)
• Emails, Notifications
(batch)
• Offline – schedule based or
event trigger based
Model Scoring Offline vs Online
Online
• Customer Waiting on the
Response (human real-time)
• Super low-latency with fixed
response window
(transactional fraud, ad
bidding)
Not All Models Return a Yes / No
Model Scoring Considerations
Example: Login Bot Detector
Different behavior depending on
probability score
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Challenge Question
0.6 to 0.75 ☞ Send SMS
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Example: Item Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional – pass context sensitive information
to tailor results
Model Updates and Versioning
• Model Update Frequency
(nightly, weekly, monthly, quarterly)
• Model Version Tracking
• Model Release Process
• Dev ‣ Test ‣ Staging ‣ Production
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
• Models can have both reward and risk to the business
– Well designed models prevent fraud, reduce churn, increase sales
– Poorly designed models increase fraud, could impact the company’s brand,
cause compliance violations or other risks
• Models should be governed by the company's policies and procedures,
laws and regulations and the organization's management goals
Model Governance
Considerations
• Models have to be transparent, explainable, traceable and interpretable for
auditors / regulators
• Models may need reason codes for rejections (e.g. if I decline someone credit why?)
• Models should have an approval and release process
• Models also cannot violate any discrimination laws or use features that could be
traced to religion, gender, ethnicity,
Model A/B Testing
Set Business
Goals
Understand
Your Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate
Results
• A/B testing – comparing two
versions to see what performs
better
• Historical data works for
evaluating models in testing, but
production experiments required
to validate model hypothesis
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Big Bang
A/B Framework should support these steps
• Monitoring is the process of
observing the model’s
performance, logging it’s
behavior and alerting when the
model degrades
• Logging should log exactly the
data feed into the model at the
time of scoring
• Model alerting is critical to
detect unusual or unexpected
behaviors
Model Monitoring
Open Loop vs Closed Loop
• Open Loop – human being involved
• Closed Loop – no human involved
Model Scoring – almost always closed loop, some models alert
agents or customer service
Model Training – usually open loop with a data scientist in the
loop to update the model
Online Learning
• closed loop, entirely machine driven modeling is
risky
• need to have proper model monitoring and
safeguards to prevent abuse / sensitivity to noise
• Mllib supports online through streaming models (k-
means, logistic regression support online)
• Alternative – use a more complex model to better fit
new data rather than using online learning
Model Deployment
Architectures
Architecture #1
Offline Recommendations
Train ALS Model Send Offers to Customers
Save Offers to NoSQL
Ranked Offers
Display Ranked Offers in
Web / Mobile
Nightly Batch
Architecture #2
Precomputed Features with Streaming
Web Logs
Kill User’s Login SessionPre-compute Features Features
Spark Streaming
Architecture #3
Local Apache Spark(™)
Train Model in Spark Save Model to S3 / HDFS
New Data
Copy
Model to
Production
Predictions
Run Spark Local
Demo
• Example of Offline Recommendations using ALS and
Redis as a NoSQL Cache
Try Databricks Community Edition
2016 Apache Spark Survey
33
Spark Summit EU
Brussels
October 25-27
The CFP closes at 11:59pm on July 1st
For more information and to submit:
https://spark-summit.org/eu-2016/
34

More Related Content

What's hot (20)

PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
Synchronous Commands over Apache Kafka (Neil Buesing, Object Partners, Inc) K...
confluent
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PPTX
Enable GoldenGate Monitoring with OEM 12c/JAgent
Bobby Curtis
 
PDF
How Impala Works
Yue Chen
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Cassandra 101
Nader Ganayem
 
PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
HostedbyConfluent
 
PPTX
Sharding
MongoDB
 
PDF
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PDF
When NOT to use Apache Kafka?
Kai Wähner
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Spark DataFrames and ML Pipelines
Databricks
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Synchronous Commands over Apache Kafka (Neil Buesing, Object Partners, Inc) K...
confluent
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 
Apache Kafka - Martin Podval
Martin Podval
 
Enable GoldenGate Monitoring with OEM 12c/JAgent
Bobby Curtis
 
How Impala Works
Yue Chen
 
Introduction to Apache Kafka
Jeff Holoman
 
Cassandra 101
Nader Ganayem
 
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
HostedbyConfluent
 
Sharding
MongoDB
 
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
An Introduction to Apache Kafka
Amir Sedighi
 
When NOT to use Apache Kafka?
Kai Wähner
 
Cassandra Introduction & Features
DataStax Academy
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 

Viewers also liked (20)

PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
sparktc
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
PPTX
Get most out of Spark on YARN
DataWorks Summit
 
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
PDF
Spark on yarn
datamantra
 
PDF
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
PPTX
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop and Spark Analytics over Better Storage
Sandeep Patil
 
PPT
SocSciBot(01 Mar2010) - Korean Manual
WCU Webometrics Institute
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PDF
Why your Spark job is failing
Sandy Ryza
 
PPT
Proxy Servers
Sourav Roy
 
PPT
Proxy Server
guest095022
 
PDF
Spark 2.x Troubleshooting Guide
IBM
 
Use r tutorial part1, introduction to sparkr
Databricks
 
Scalable Data Science with SparkR
DataWorks Summit
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
sparktc
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Get most out of Spark on YARN
DataWorks Summit
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
gethue
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Spark on yarn
datamantra
 
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Hadoop and Spark Analytics over Better Storage
Sandeep Patil
 
SocSciBot(01 Mar2010) - Korean Manual
WCU Webometrics Institute
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Why your Spark job is failing
Sandy Ryza
 
Proxy Servers
Sourav Roy
 
Proxy Server
guest095022
 
Spark 2.x Troubleshooting Guide
IBM
 
Ad

Similar to Apache Spark Model Deployment (20)

PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
PDF
Pragmatic Machine Learning @ ML Spain
Louis Dorard
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
From Prototyping to Deployment at Scale with R and sparklyr with Kevin Kuo
Databricks
 
PDF
Deploying spark ml models
subhojit banerjee
 
PPTX
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
PDF
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
PDF
Machine learning in production
Turi, Inc.
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
PDF
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Apache Spark MLlib
Zahra Eskandari
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Pragmatic Machine Learning @ ML Spain
Louis Dorard
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
From Prototyping to Deployment at Scale with R and sparklyr with Kevin Kuo
Databricks
 
Deploying spark ml models
subhojit banerjee
 
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Machine learning in production
Turi, Inc.
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 

Apache Spark Model Deployment

  • 1. Apache Spark(™) Model Deployment Bay Area Spark Meetup – June 30, 2016 Richard Garris – Big Data Solution Architect Focused on Advanced Analytics
  • 2. About Me Richard L Garris • [email protected] • @rlgarris [Twitter] Big Data Solutions Architect @ Databricks 12+ years designing Enterprise Data Solutions for everyone from startups to Global 2000 Prior Work Experience PwC, Google, Skytree Ohio State Buckeye and CMU Alumni 2
  • 3. About Apache Spark MLlib Started at Berkeley AMPLab (Apache Spark 0.8) Now (Apache Spark 2.0) • Contributions from 75+ orgs, ~250 individuals • Development driven by Databricks: roadmap + 50% of PRs • Growing coverage of distributed algorithms Spark SparkSQL Streaming MLlib GraphFrames 3
  • 4. MLlib Goals General Machine Learning library for big data • Scalable & robust • Coverage of common algorithms • Leverages Apache Spark Tools for practical workflows Integration with existing data science tools 4
  • 5. Apache Spark MLlib • spark.mllib • Pre Mllib < Spark 1.4 • Spark Mllib was a lower level library that used Spark RDDs • Uses LabeledPoint, Vectors and Tuples • Maintenance Mode only after Spark 2.X // Load and parse the data val data = sc.textFile("data/mllib/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() // Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, st epSize) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }
  • 6. Apache Spark – ML Pipelines • spark.ml • Spark > 1.4 • Spark.ML pipelines – able to create more complex models • Integrated with DataFrames // Let's initialize our linear regression learner val lr = new LinearRegression() // Now we set the parameters for the method lr.setPredictionCol("Predicted_PE") .setLabelCol("PE").setMaxIter(100).setRegParam(0.1) // We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar. val lrPipeline = new Pipeline() lrPipeline.setStages(Array(vectorizer, lr)) // Let's first train on the entire dataset to see what we get val lrModel = lrPipeline.fit(trainingSet)
  • 7. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 8. The Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results Focus of this talk
  • 9. What is a Model? •
  • 10. But What Really is a Model? A model is a complex pipeline of components • Data Sources • Joins • Featurization Logic • Algorithm(s) • Transformers • Estimators • Tuning Parameters
  • 11. ML Pipelines 11 Train model Evaluate Load data Extract features A very simple pipeline
  • 12. ML Pipelines 12 Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract features Extract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline
  • 13. Why ML persistence? 13 Data Science Software Engineering Prototype (Python/R) Create model Re-implement model for production (Java) Deploy model
  • 14. Why ML persistence? 14 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implement Pipeline for production (Java) Deploy Pipeline
  • 15. With ML persistence... 15 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 16. Demo Model Serialization in Apache Spark 2.0 using Parquet
  • 17. What are the Requirements for a Robust Model Deployment System?
  • 18. Customer SLAs • Response time • Throughput (predictions per second) • Uptime / Reliability Tech Stack • C / C++ • Legacy (mainframe) • Java • Docker Your Model Scoring Environment
  • 19. Offline • Internal Use (batch) • Emails, Notifications (batch) • Offline – schedule based or event trigger based Model Scoring Offline vs Online Online • Customer Waiting on the Response (human real-time) • Super low-latency with fixed response window (transactional fraud, ad bidding)
  • 20. Not All Models Return a Yes / No Model Scoring Considerations Example: Login Bot Detector Different behavior depending on probability score 0.0-0.4 ☞ Allow login 0.4-0.6 ☞ Challenge Question 0.6 to 0.75 ☞ Send SMS 0.75 to 0.9 ☞ Refer to Agent 0.9 - 1.0 ☞ Block Example: Item Recommendations Output is a ranking of the top n items API – send user ID + number of items Return sorted set of items to recommend Optional – pass context sensitive information to tailor results
  • 21. Model Updates and Versioning • Model Update Frequency (nightly, weekly, monthly, quarterly) • Model Version Tracking • Model Release Process • Dev ‣ Test ‣ Staging ‣ Production • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang
  • 22. • Models can have both reward and risk to the business – Well designed models prevent fraud, reduce churn, increase sales – Poorly designed models increase fraud, could impact the company’s brand, cause compliance violations or other risks • Models should be governed by the company's policies and procedures, laws and regulations and the organization's management goals Model Governance Considerations • Models have to be transparent, explainable, traceable and interpretable for auditors / regulators • Models may need reason codes for rejections (e.g. if I decline someone credit why?) • Models should have an approval and release process • Models also cannot violate any discrimination laws or use features that could be traced to religion, gender, ethnicity,
  • 23. Model A/B Testing Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results • A/B testing – comparing two versions to see what performs better • Historical data works for evaluating models in testing, but production experiments required to validate model hypothesis • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Big Bang A/B Framework should support these steps
  • 24. • Monitoring is the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades • Logging should log exactly the data feed into the model at the time of scoring • Model alerting is critical to detect unusual or unexpected behaviors Model Monitoring
  • 25. Open Loop vs Closed Loop • Open Loop – human being involved • Closed Loop – no human involved Model Scoring – almost always closed loop, some models alert agents or customer service Model Training – usually open loop with a data scientist in the loop to update the model
  • 26. Online Learning • closed loop, entirely machine driven modeling is risky • need to have proper model monitoring and safeguards to prevent abuse / sensitivity to noise • Mllib supports online through streaming models (k- means, logistic regression support online) • Alternative – use a more complex model to better fit new data rather than using online learning
  • 28. Architecture #1 Offline Recommendations Train ALS Model Send Offers to Customers Save Offers to NoSQL Ranked Offers Display Ranked Offers in Web / Mobile Nightly Batch
  • 29. Architecture #2 Precomputed Features with Streaming Web Logs Kill User’s Login SessionPre-compute Features Features Spark Streaming
  • 30. Architecture #3 Local Apache Spark(™) Train Model in Spark Save Model to S3 / HDFS New Data Copy Model to Production Predictions Run Spark Local
  • 31. Demo • Example of Offline Recommendations using ALS and Redis as a NoSQL Cache
  • 33. 2016 Apache Spark Survey 33
  • 34. Spark Summit EU Brussels October 25-27 The CFP closes at 11:59pm on July 1st For more information and to submit: https://spark-summit.org/eu-2016/ 34