SlideShare a Scribd company logo
Brandon Hamric & Alex Meyer, Eventbrite
Deploying Python Machine Learning
Models with Apache Spark
#SAISDS2
Introduction
#SAISDS2
About Eventbrite
• Global ticketing and event technology platform that provides creators of
events of all shapes and sizes with tools and resources to seamlessly plan,
promote, and produce live experiences around the world
• Can be accessed online or via mobile apps, scales from basic registration
and ticketing to a fully featured event management platform
• 203 million tickets processed in 2017
• Powered 3 million events in 170+ countries in 2017
• 700k creators supported in 2017
3#SAISDS2
About Us
• Eventbrite
• We're data engineers
• We ship models for Eventbrite data scientists
• Started out at Eventbrite on Discovery - Event Recommendations
• Built new data infrastructure to support all business needs
• Our team creates, maintains, and supports the data infrastructure,
tasks, and pipelines that serve other engineers, business insights,
and product
4#SAISDS2
Brandon Hamric - bhamric@eventbrite.com
• Principal Data Engineer/Architect @ Eventbrite
• Co-founded Rescue Forensics (YC W15)
• 10 years experience in data engineering
• Worked with Spark since 2014
5#SAISDS2
• Senior Data Engineer @ Eventbrite
• MS in Computer Science - Distributed Systems
(Vanderbilt University)
• 4 years experience in data engineering
• Worked with Spark since 2014
6#SAISDS2
Alex Meyer - alexm@eventbrite.com
Structured Predictors
#SAISDS2
Common Predictor Workflow
• High coupling between
engineers and data
scientists
• Mostly serial workflow
• High barrier to entry
• Too many contributors
• Code duplication
8#SAISDS2
Improved Predictor Workflow
• Low coupling between
engineers and data
scientists
• Independent
Workflows
• Data scientists own
their models end-to-
end
• Data Engineering isn't
a bottleneck
9#SAISDS2
Predictor Code
10#SAISDS2
Model ManagementData prep and cleanup
● Training and
prediction code
can be
inconsistent
● Sample data
prep can be
different than
prod data prep
Feature Extraction Prediction
● Training and
prediction code
can be
inconsistent
● Mostly written
for vertical
scaling
● Version
management is
hard
● Can use a lot of
memory
● It can be hard to
switch between
models
● Bulk vs single-
item
Predictor Deployment Problems
• Shared code between dev, batch, and streaming is an afterthought
• Most models are written for vertical scaling first
• Deployment is ad-hoc without a common structure
• Model iteration is slow because of lack of automation
• Model versioning isn't consistent without a library
11#SAISDS2
Predictor Structure
12#SAISDS2
Notebook Offline Prediction Streaming Prediction
Data prep and
cleanup
Query to a local csv Convert to incremental
query
Convert to read stream
Feature Extraction Pandas dataframes and
python functions
Convert to spark
dataframe or rdd
operations
Convert to dataframe
operations or
foreachBatch in Spark
2.4
Load Model Load from a local pickle Load from s3 or hdfs
onto executors
Load from s3 or hdfs
onto executors
Predict Mixed into scoring logic Mapper or UDF on
features
UDF on feature rows
13
Predictor Class
• Manages Model
– Versioning
– Storage
– Loading
• Outlines structure
– Data loading
– Feature extraction
– Prediction
• Batch and streaming
• Enables Automation
#SAISDS2
Example Predictor and Demo
#SAISDS2
Demo - Latent Dirichlet Allocation (LDA)
• Generate topics on Eventbrite's event description corpus
• Get topic probabilities per event
• We can use topics to improve search, browse, and personalization
• LDA Wiki
• LDA Scikit Learn Model
<open notebook>
15#SAISDS2
Takeaways
• Consistent predictor structure makes distributed
prediction easy to automate deployment
• Streaming and batch prediction can share code
• Use bulk feature extraction and prediction often
• We may opensource our predictor library
• We're hiring!
16
Thanks!
Questions? Feel free to reach out!
17#SAISDS2

More Related Content

What's hot (20)

PDF
Tik-Tok-Pitch-Deck.pdf
MichiyoHayashi1
 
PDF
Marketplace in motion - AdKDD keynote - 2020
Roelof van Zwol
 
PDF
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
PDF
Building Identity Graph at Scale for Programmatic Media Buying Using Apache S...
Databricks
 
PPTX
Are TikTok Ads Right for Your Business?
Michael Paredrakos
 
PPTX
ABM Charter Template and Explanation
Demandbase
 
PDF
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
PPTX
Social media analytics powered by data science
Navin Manaswi
 
PDF
How to Utilize TikTok in Your Content Marketing Strategy
introtodigital
 
PDF
You're Too Focused on Product/Market Fit - Brian Balfour at SaaSFest 2016
Price Intelligently
 
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
PDF
Building Data Lakehouse.pdf
Luis Jimenez
 
PDF
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
PDF
The-Customer-Data-Platform-Report-2023.pdf
VO Quang-Tri
 
PDF
CDC NPIN In the Know: Google Plus & YouTube for Public Health
CDC NPIN
 
PDF
DAS Slides: Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
PDF
Control Transactions using PowerCenter
Edureka!
 
PPTX
Tik tok (Overview of Business Model and analyzing the Business Strategy )
Vaibhav Pardeshi
 
PDF
Personalized Playlists at Spotify
Rohan Agrawal
 
Tik-Tok-Pitch-Deck.pdf
MichiyoHayashi1
 
Marketplace in motion - AdKDD keynote - 2020
Roelof van Zwol
 
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
Building Identity Graph at Scale for Programmatic Media Buying Using Apache S...
Databricks
 
Are TikTok Ads Right for Your Business?
Michael Paredrakos
 
ABM Charter Template and Explanation
Demandbase
 
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
Social media analytics powered by data science
Navin Manaswi
 
How to Utilize TikTok in Your Content Marketing Strategy
introtodigital
 
You're Too Focused on Product/Market Fit - Brian Balfour at SaaSFest 2016
Price Intelligently
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
Building Data Lakehouse.pdf
Luis Jimenez
 
Slides: Taking an Active Approach to Data Governance
DATAVERSITY
 
The-Customer-Data-Platform-Report-2023.pdf
VO Quang-Tri
 
CDC NPIN In the Know: Google Plus & YouTube for Public Health
CDC NPIN
 
DAS Slides: Enterprise Architecture vs. Data Architecture
DATAVERSITY
 
Control Transactions using PowerCenter
Edureka!
 
Tik tok (Overview of Business Model and analyzing the Business Strategy )
Vaibhav Pardeshi
 
Personalized Playlists at Spotify
Rohan Agrawal
 

Similar to Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer (20)

PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Databricks
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PDF
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
PPTX
Machine Learning With Spark
Shivaji Dutta
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Using predictive APIs to create smarter apps
Louis Dorard
 
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Databricks
 
PPTX
Machine Learning with Apache Spark
IBM Cloud Data Services
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
L7. A developers’ overview of the world of predictive APIs
Machine Learning Valencia
 
PDF
A developer's overview of the world of predictive APIs
Louis Dorard
 
PPTX
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PDF
Nose Dive into Apache Spark ML
Ahmet Bulut
 
PDF
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Machine Learning With Spark
Shivaji Dutta
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Using predictive APIs to create smarter apps
Louis Dorard
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Databricks
 
Machine Learning with Apache Spark
IBM Cloud Data Services
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
L7. A developers’ overview of the world of predictive APIs
Machine Learning Valencia
 
A developer's overview of the world of predictive APIs
Louis Dorard
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
BinarySearchTree in datastructures in detail
kichokuttu
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamric and Alex Meyer

  • 1. Brandon Hamric & Alex Meyer, Eventbrite Deploying Python Machine Learning Models with Apache Spark #SAISDS2
  • 3. About Eventbrite • Global ticketing and event technology platform that provides creators of events of all shapes and sizes with tools and resources to seamlessly plan, promote, and produce live experiences around the world • Can be accessed online or via mobile apps, scales from basic registration and ticketing to a fully featured event management platform • 203 million tickets processed in 2017 • Powered 3 million events in 170+ countries in 2017 • 700k creators supported in 2017 3#SAISDS2
  • 4. About Us • Eventbrite • We're data engineers • We ship models for Eventbrite data scientists • Started out at Eventbrite on Discovery - Event Recommendations • Built new data infrastructure to support all business needs • Our team creates, maintains, and supports the data infrastructure, tasks, and pipelines that serve other engineers, business insights, and product 4#SAISDS2
  • 5. Brandon Hamric - [email protected] • Principal Data Engineer/Architect @ Eventbrite • Co-founded Rescue Forensics (YC W15) • 10 years experience in data engineering • Worked with Spark since 2014 5#SAISDS2
  • 6. • Senior Data Engineer @ Eventbrite • MS in Computer Science - Distributed Systems (Vanderbilt University) • 4 years experience in data engineering • Worked with Spark since 2014 6#SAISDS2 Alex Meyer - [email protected]
  • 8. Common Predictor Workflow • High coupling between engineers and data scientists • Mostly serial workflow • High barrier to entry • Too many contributors • Code duplication 8#SAISDS2
  • 9. Improved Predictor Workflow • Low coupling between engineers and data scientists • Independent Workflows • Data scientists own their models end-to- end • Data Engineering isn't a bottleneck 9#SAISDS2
  • 10. Predictor Code 10#SAISDS2 Model ManagementData prep and cleanup ● Training and prediction code can be inconsistent ● Sample data prep can be different than prod data prep Feature Extraction Prediction ● Training and prediction code can be inconsistent ● Mostly written for vertical scaling ● Version management is hard ● Can use a lot of memory ● It can be hard to switch between models ● Bulk vs single- item
  • 11. Predictor Deployment Problems • Shared code between dev, batch, and streaming is an afterthought • Most models are written for vertical scaling first • Deployment is ad-hoc without a common structure • Model iteration is slow because of lack of automation • Model versioning isn't consistent without a library 11#SAISDS2
  • 12. Predictor Structure 12#SAISDS2 Notebook Offline Prediction Streaming Prediction Data prep and cleanup Query to a local csv Convert to incremental query Convert to read stream Feature Extraction Pandas dataframes and python functions Convert to spark dataframe or rdd operations Convert to dataframe operations or foreachBatch in Spark 2.4 Load Model Load from a local pickle Load from s3 or hdfs onto executors Load from s3 or hdfs onto executors Predict Mixed into scoring logic Mapper or UDF on features UDF on feature rows
  • 13. 13 Predictor Class • Manages Model – Versioning – Storage – Loading • Outlines structure – Data loading – Feature extraction – Prediction • Batch and streaming • Enables Automation #SAISDS2
  • 14. Example Predictor and Demo #SAISDS2
  • 15. Demo - Latent Dirichlet Allocation (LDA) • Generate topics on Eventbrite's event description corpus • Get topic probabilities per event • We can use topics to improve search, browse, and personalization • LDA Wiki • LDA Scikit Learn Model <open notebook> 15#SAISDS2
  • 16. Takeaways • Consistent predictor structure makes distributed prediction easy to automate deployment • Streaming and batch prediction can share code • Use bulk feature extraction and prediction often • We may opensource our predictor library • We're hiring! 16
  • 17. Thanks! Questions? Feel free to reach out! 17#SAISDS2