SlideShare a Scribd company logo
| CONTENT-BASED PERSONALIZATION
Taking the Pain out of Data Science
RecSys Machine
Learning Framework
Over Spark
Sonya Liberman
Personalization Team Lead
Outbrain Recommendations Group
?
What Is
The Lighthouse
Help people discover content they can trust
to be interesting, relevant and timely for them
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
Distribution Partners
554
MONTHLY
UNIQUE USERS
GLOBALLY
MILLION
275
BILLION
RECOMMENDATIONSSERVED PER MONTH
35
MILLION
RECOMMENDATIONSCLICKED PER DAY
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
Agenda
• Defining our machine learning challenge
• Distributed Spark framework for machine
learning
• Deploying models to production
Know Your Reader
Oubtrain’s NLP Engine
3
MILLION
NEW ARTICLES
OVER
What is a Document About?
User Semantic Profile
User Semantic Profile
The Machine Learning Challenge
Features Vector Supervision
User Profile Click / No Click
Past Interactions
Current Context
Supervised Machine Learning Algorithms
Predictive Models
1. Content Based Models
Recommends content based on semantic similarity with
user interests
Predictive Models
Music
Tech
Travel
1. Content Based Models
2. Behavioural Models
Finding behavioural patterns beyond a semantic connection
Predictive Models
Retirement
Investing
Health
1. Content Based Models
2. Behavioural Models
3. Collaborative Models
Recommend content that readers with your reading
patterns like
Predictive Models
3. Collaborative Models
Matrix Factorization
Factorization Machines
Feature Embedding with Deep Learning
Predictive Models
Data Processing
and
Distributed Machine Learning Framework
Data Processing
3 Data Centers
300 Machines in each cluster
7 petabytes of data
5 terabytes of compressed
new data daily
Distributed Machine
Learning Framework
Distributed Machine Learning Framework
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation &
Simulation
Model
Deployment
1 2 3
4 5
Distributed Machine Learning Framework
Feature
Engineering
Model
Training
2 3
Used for Research
Used for Production
Agile Development of
new Models
Preparing datasets – hard work!
Data Collection challenges
• Many output tables
Requests, served recommendations, clicks, user profiles,
document profiles.
Data Collection challenges
• Many output tables
Requests, served recommendations, clicks, user profiles,
document profiles.
• Multiple data stores
Hive, MySQL, Cassandra
Data Collection challenges
• Many output tables
Requests, served recommendations, clicks, user profiles,
document profiles.
• Multiple data stores
Hive, MySQL, Cassandra
• Scale
Huge tables, queries take long
Data Collection challenges
• Many output tables
Requests, served recommendations, clicks, user profiles,
document profiles.
• Multiple data stores
Hive, MySQL, Cassandra
• Scale
Huge tables, queries take long
• Irrelevant data
We only need recommendations served on specific traffic variants
Silos and partitioning
• Ability to train models on different silos
Enabling fast retrieval of data for specific silos
• Partitioning
Adding a list of partitions for each listing for fast access
• Partitions
Platform, country, language, user activity level
Building
Models
Recs from
specific
variants
Clicks +
sample of
no clicks
Simulation
– Business
metrics
Clicked
widgets
only
All served
recs
Data Collection - Required datasets
Solution – automatic data collection
• Hourly triggered job
Generate an hourly dataset with only the relevant data
• Join tables and partition
Join the input tables and add partitions to enable silos
• Split between model and simulation
Create different datasets for modeling and simulation
Monitoring
Modeling boilerplate – time consuming
Modeling challenges and boilerplate
• Handling input
Reading, train-test split
Modeling challenges and boilerplate
• Handling input
Reading, train-test split
• Evaluating test metrics
Run model on test data, calculate test metrics
Modeling challenges and boilerplate
• Handling input
Reading, train-test split
• Getting test metrics
Run model on test data, calculate test metrics
• Checking business metrics
Simulate production use cases
Estimate business metrics
ModelFramework
Read
Fit
Save
Test
Simulation
Simple model interface
Data scientists and algorithm engineers only need to implement
their model’s logic.
Spark.ML packages, open source implementations and “home
made” algorithms can be used.
Everything else - out-of-the-box
Simple model interface
Ensemble modeling
• Implementing multiple models
Models that use different features or algorithms
• Combine the models
Run the models one after the other, or combine their output
scores
46
Bridging research and production
47
Control Treatment
Evaluation Criteria
engagement & monetization metrics
Statistical tests
Traffic Split
Models productization
Fast and easy A/B testing
Models productization
Fast and easy A/B testing
Music
Tech
A B
Models productization
Fast and easy A/B testing
Promising offline metrics – New A/B test variant using
the model.
The Spark job ends with updating the A/B test
configuration on the DB
Models productization
Automatic coefficients tuning
After a model has proven value, we want to update it
regularly and keep learning based on new data.
• Periodic model creation
• Business metrics validation
• Production configuration update
Verifying positive KPI’s
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark
High scale modeling – prepare your data well!
Takeaways
High scale modeling – prepare your data well!
Effective research cycle – requires good Big
Data and ML infra
Takeaways
High scale modeling – prepare your data well!
Effective research cycle – requires good Big
Data and ML infra
Connect research and production - easy A/B
testing and model updates
Takeaways
Outbrain Tech Blog - Taking the pain out of
Data Science
Outbrain tech blog
Thank You

More Related Content

Similar to Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark (20)

PDF
Machine learning systems for engineers
Cameron Joannidis
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PPTX
From Spark to Elasticsearch and Back - Learning Large Scale Models for Conten...
Sonya Liberman
 
PDF
Ideas spracklen-final
supportlogic
 
PDF
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 
PDF
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Nikhil Dandekar
 
PDF
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
PDF
Continuous delivery for machine learning
Rajesh Muppalla
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
PPTX
Apache Spark Model Deployment
Databricks
 
PPTX
Productionalizing ML : Real Experience
Ihor Bobak
 
PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PDF
Automated Production Ready ML at Scale
Databricks
 
PPTX
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Velox at SF Data Mining Meetup
Dan Crankshaw
 
Machine learning systems for engineers
Cameron Joannidis
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
From Spark to Elasticsearch and Back - Learning Large Scale Models for Conten...
Sonya Liberman
 
Ideas spracklen-final
supportlogic
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
GoDataDriven
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Nikhil Dandekar
 
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
Continuous delivery for machine learning
Rajesh Muppalla
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Data Science meets Software Development
Alexis Seigneurin
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
Apache Spark Model Deployment
Databricks
 
Productionalizing ML : Real Experience
Ihor Bobak
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Automated Production Ready ML at Scale
Databricks
 
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Velox at SF Data Mining Meetup
Dan Crankshaw
 

Recently uploaded (9)

PDF
Yellow and Orange Illustrative Self-Improvement Infographic Poster.pdf
ajmalshans064
 
PPTX
The effects of innovative finishing materials and creative accessories on mar...
JulianaPerez73
 
PPTX
Presentationsssssssssssssssssssssssssssssssssss.pptx
sarmanali014
 
PPTX
Creative fashion, bold streetwear, visual storytelling, and brand identity sh...
stiction Wear
 
PDF
S1Tu5 Par4 Pem3NanG, M0Dal Rec3H dIj4m1N M3nANg B4ny4k. Daft4r S3kar4ng & Lan...
hokimamad0
 
PPTX
"Lighting the Future: Exploring Unique Pendant Light Designs"
Manaralights
 
PDF
Cinergy Distillers: Where India’s Rich Agricultural Heritage Meets Modern Dis...
johnsmith300799
 
PPTX
Mastering Men’s Style: The Power of a Tailored Wardrobe
Don Morphy
 
PDF
Chapter 13 - Selling of services- Personal
lunaart2000
 
Yellow and Orange Illustrative Self-Improvement Infographic Poster.pdf
ajmalshans064
 
The effects of innovative finishing materials and creative accessories on mar...
JulianaPerez73
 
Presentationsssssssssssssssssssssssssssssssssss.pptx
sarmanali014
 
Creative fashion, bold streetwear, visual storytelling, and brand identity sh...
stiction Wear
 
S1Tu5 Par4 Pem3NanG, M0Dal Rec3H dIj4m1N M3nANg B4ny4k. Daft4r S3kar4ng & Lan...
hokimamad0
 
"Lighting the Future: Exploring Unique Pendant Light Designs"
Manaralights
 
Cinergy Distillers: Where India’s Rich Agricultural Heritage Meets Modern Dis...
johnsmith300799
 
Mastering Men’s Style: The Power of a Tailored Wardrobe
Don Morphy
 
Chapter 13 - Selling of services- Personal
lunaart2000
 
Ad

Taking the Pain out of Data Science - RecSys Machine Learning Framework Over Spark