Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer

Brandon Hamric & Alex Meyer, Eventbrite
Deploying Python Machine Learning
Models with Apache Spark
#SAISDS2

About Eventbrite
• Global ticketing and event technology platform that provides creators of
events of all shapes and sizes with tools and resources to seamlessly plan,
promote, and produce live experiences around the world
• Can be accessed online or via mobile apps, scales from basic registration
and ticketing to a fully featured event management platform
• 203 million tickets processed in 2017
• Powered 3 million events in 170+ countries in 2017
• 700k creators supported in 2017
3#SAISDS2

About Us
• Eventbrite
• We're data engineers
• We ship models for Eventbrite data scientists
• Started out at Eventbrite on Discovery - Event Recommendations
• Built new data infrastructure to support all business needs
• Our team creates, maintains, and supports the data infrastructure,
tasks, and pipelines that serve other engineers, business insights,
and product
4#SAISDS2

Brandon Hamric - bhamric@eventbrite.com
• Principal Data Engineer/Architect @ Eventbrite
• Co-founded Rescue Forensics (YC W15)
• 10 years experience in data engineering
• Worked with Spark since 2014
5#SAISDS2

• Senior Data Engineer @ Eventbrite
• MS in Computer Science - Distributed Systems
(Vanderbilt University)
• 4 years experience in data engineering
• Worked with Spark since 2014
6#SAISDS2
Alex Meyer - alexm@eventbrite.com

Structured Predictors
#SAISDS2

Common Predictor Workflow
• High coupling between
engineers and data
scientists
• Mostly serial workflow
• High barrier to entry
• Too many contributors
• Code duplication
8#SAISDS2

Improved Predictor Workflow
• Low coupling between
engineers and data
scientists
• Independent
Workflows
• Data scientists own
their models end-to-
end
• Data Engineering isn't
a bottleneck
9#SAISDS2

Predictor Code
10#SAISDS2
Model ManagementData prep and cleanup
● Training and
prediction code
can be
inconsistent
● Sample data
prep can be
different than
prod data prep
Feature Extraction Prediction
● Training and
prediction code
can be
inconsistent
● Mostly written
for vertical
scaling
● Version
management is
hard
● Can use a lot of
memory
● It can be hard to
switch between
models
● Bulk vs single-
item

Predictor Deployment Problems
• Shared code between dev, batch, and streaming is an afterthought
• Most models are written for vertical scaling first
• Deployment is ad-hoc without a common structure
• Model iteration is slow because of lack of automation
• Model versioning isn't consistent without a library
11#SAISDS2

Predictor Structure
12#SAISDS2
Notebook Offline Prediction Streaming Prediction
Data prep and
cleanup
Query to a local csv Convert to incremental
query
Convert to read stream
Feature Extraction Pandas dataframes and
python functions
Convert to spark
dataframe or rdd
operations
Convert to dataframe
operations or
foreachBatch in Spark
2.4
Load Model Load from a local pickle Load from s3 or hdfs
onto executors
Load from s3 or hdfs
onto executors
Predict Mixed into scoring logic Mapper or UDF on
features
UDF on feature rows

13
Predictor Class
• Manages Model
– Versioning
– Storage
– Loading
• Outlines structure
– Data loading
– Feature extraction
– Prediction
• Batch and streaming
• Enables Automation
#SAISDS2

Example Predictor and Demo
#SAISDS2

Demo - Latent Dirichlet Allocation (LDA)
• Generate topics on Eventbrite's event description corpus
• Get topic probabilities per event
• We can use topics to improve search, browse, and personalization
• LDA Wiki
• LDA Scikit Learn Model
<open notebook>
15#SAISDS2

Takeaways
• Consistent predictor structure makes distributed
prediction easy to automate deployment
• Streaming and batch prediction can share code
• Use bulk feature extraction and prediction often
• We may opensource our predictor library
• We're hiring!
16

Thanks!
Questions? Feel free to reach out!
17#SAISDS2

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer

More Related Content

What's hot (20)

Similar to Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer (20)

More from Databricks (20)

Recently uploaded (20)

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer