XGBoost @ Fyber

1
XGBoost @ Fyber
From Theory to Production

Fyber at a Glance
SAN
FRANCISCO
NEW YORK
LONDON
BERLIN
TEL AVIV
BEIJING
Publicly Traded
FBEN Frankfurt
350+
Employees
50%
Of employees
are R&D & Product
7 Oﬃces
Berlin l Tel Aviv l San Francisco
New York l London l Beijing | Korea
KOREA

How big our
Big Data is?
10B Auctions
Per Day
150M DAU
250B Bid
Requests
Per Day
10K+ Apps 300TB
Generated Monthly
300 Users
Level
Dimensions
80 Reported
Dimensions
(on real-time reporting)
65 Reported
Metrics

5
The Goal: Maximize the value of our data
■ Technologies used (Spark, Druid, Presto, and more...)
■ Analysis on new & existing data products and algorithms
■ Implementing product releases and A/B testing on main products (i.e. “groups” by users)
■ Creating dashboards for existing and new products (both business & tech oriented)
■ Product researches & POC’s (e.g. MLeap)
■ Algorithms Development
A taste of Data Science @ Fyber

6
Our main Use-Cases for this session
Two main use-cases which XGBoost was implemented for:
Audience Vault Reach
(Fyber Marketplace)
CTR Prediction
(full model based on
Criteo’s AI Lab use
case, Oﬀerwall)

7
Why XGBoost?
■ State of the art results on data competitions
■ Works great with tabular data
■ Works great with Spark and big data
■ Combines several ML optimization methodologies (boosting, bagging)
■ Good time doing feature engineering 😎

9
Decision Trees - Deﬁnition
■ Decision tree builds classiﬁcation or
regression models in the form of a
tree structure
■ Decision tree breaks down a dataset
into smaller and smaller subsets
■ Decision tree can be viewed as
“divide and conquer” algorithm

10
Decision Trees - Steps
■ Take the entire data set as input
■ Search for a split that maximizes the "separation"
of the classes using: infogain / gini index
■ Apply the split
■ Again, search for beneﬁcial split
■ Stop when you meet some stopping criteria
■ Check on test data according to tree

11
Decision Trees - Information Gain
■ We want to determine which attribute in a given set of training feature vectors is most useful for
discriminating between the classes to be learned
■ Information gain tells us how important a given attribute of the feature vector is
Information gain = Entropy (parent) – Average Entropy (children)
https://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf

12
Decision Trees - Entropy (Example#1)
“lack of order or predictability” From google
A common way to measure impurity in a group of samples.
Values may diﬀer from 0 to 1 - 0 : pure , 1 : not pure
Entropy = ∑-pi*log2(pi)
Example:
■ 16/30 are green circles
■ 14/30 are pink crosses
■ “Greens” log2 calculation: log2(16/30) = -0.9
■ “Pink” log2 calculation: log2(14/30) = -1.1
■ Entropy = -(16/30)*(-0.9) –(14/30)*(-1.1) = 0.99

13
Decision Trees - Entropy (Example #2)
A common way to measure impurity in a group of samples.
Entropy = ∑-pi*log2(pi)
Example:
■ 0/30 are green circles
■ 30/30 are pink crosses
■ “Greens” log2 calculation: log2(0/30) = Undeﬁned
■ “Pink” log2 calculation: log2(30/30) = log2(1) = 0
■ Entropy = –(30/30)*(0) = 0

14
Decision Trees - Information Gain
Information gain =
Entropy (parent) – Average Entropy (children)

15
Decision Trees - Special Parameters
■ maxDepth: Maximum depth of the tree
■ minInfoGain: Minimum gain for a split to be
considered at a tree node
■ minInstancesPerNode: Minimum number of
instances each child must have after split
■ algo: type of decision tree (Classiﬁcation /
Regression)

16
Decision Tree - main problems
The tree is too complicated -
Variance / Overﬁtting
The tree is too basic -
Bias / Underﬁtting

17
The tree is too
basic -
Underﬁtting / Bias
The tree is too
complicated -
Overﬁtting /
Variance
Introducing XGBoost

18
XGBoost
■ Random Forest
■ Bagging
■ Gradient boosting

19
Solution 1: Random Forest
Aimed to reduce variance

20
Solution 2: Bagging
Aimed to reduce variance

21
Solution 3: Gradient Boosting
Aimed to reduce bias
■ Build a very basic model using mean
■ Calculate error for every data point
■ Use the calculated error as label
■ Try to predict the new label using a tree (learning rate is a must)
■ Use learning rate in order to avoid overshooting
■ Repeat the process

22
Solution 3:
Gradient
Boosting
Height Food Gender Weight Kg
192 Pizza Male 88
166 Pasta Female 76
182 Pasta Male 80
175 Pizza Male 73
160 Pizza Female 77
165 Pizza Female 57

23
Solution 3 : Gradient Boosting
Height Food Gender Weight Kg Error
192 Pizza Male 88 88-71.2 =16.8
166 Pasta Female 76 76-71.2 = 4.8
182 Pasta Male 80 80-71.2 = 8.8
175 Pizza Male 73 73-71.2 = 1.8
160 Pizza Female 77 77-71.2 = 5.8
165 Pizza Female 57 57-71.2 =-14.2
Average : 71.2 Kg
This is our basic model

24
The error is our new label, so our
model will try to predict the error

25
Starting point
Height < 175Height => 175
15 2

26
Height Weight Old model
Old Error
(new Label)
New model New Error
192 88 71.2 16.8 71.2+15*0.1= 72.7 15.3
166 76 71.2 4.8 71.2+2*0.1 = 71.4 4.6
182 80 71.2 8.8 71.2+15*0.1 = 72.7 7.3
170 73 71.2 1.8 71.2+2*0.1 = 71.4 1.6
160 77 71.2 5.8 71.2+2*0.1 = 71.4 5.6
165 57 71.2 -14.2 71.2+2*0.1 = 71.4 −14.4
Starting
point
Height <
175
Height >=
175
15 2
Based on Learning Rate = 0.1

27
XGBoost - Examples of Hyper Parameters
■ maxDepth: Maximum depth of the tree
■ numRound: number of classiﬁers that are been
built
■ objective: What kind of prediction we want to
preform? (classiﬁcation, regression, ranking)
■ eta: Learning Rate
■ ColsampleByTree: What is the ratio of columns
that will be used by every tree

ss Fyber’s XGBoost
Use Cases

29
Our main Use-Cases
Two main use-cases which XGBoost was implemented for:
(Fyber Marketplace)
CTR Prediction
(full model based on
Criteo’s AI Lab use
case, Oﬀerwall)

30
A word about XGBoost with Spark
■ XGBoost4J Latest stable release - May 2019
■ Allows huge data processing
■ The project is constantly being updated, stabilized and many features are being added
■ Supports Java, Scala (Spark)
■ Soon: XGBoost with PySpark
■ Spark ML framework (MLlib) functionality integrates smoothly with XGBoost. It contains:
○ String Indexer
○ One Hot Encoding
○ Vector Assembler
○ Tokenizer
○ And many more...

32
Old Situation:
■ Audience vault presents past data, audience reach is not accurate at all
New Solution:
■ A model that will easily integrate to the Audience Vault and will present estimated audience
reach for the next 14 days
■ Audience reach needs to be easily presented regardless which ﬁlter will be chosen (countless possibilities)
XGBoost with Spark enabled us perform predictions on hundreds of millions of users

33
Account Managers
Need to sell audiences to
customers, so they MUST know
the amount of the relevant audience
size
External Clients
Would like to target audiences
through the Fyber Marketplace, and
therefore MUST know the audience
size
Target Audience

34
How was it done?
Data Preparation
Feature
Engineering
Vector Assembler
Model Preparation Training Transformation
Pipeline Preparation

35
Data Pre-processing
Feature Engineering
Active Per ${X} days
features
Requests Per ${X} days
features
Load Relevant Data
Historical Data
(30 days)
Label (14 days)
How was it done?
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation
Training Transformation

36
How was it done?
Data Pre-processing
Device Id sumRequests30Days sumRequests14Days sumRequests7Days sumActive30Days sumActive14Days sumActive7Days
1 1 0 0 1 0 0
2 102 89 89 7 3 3
3 1 1 0 1 1 0
4 23 7 2 12 4 1
5 26 8 0 6 2 0
6 214 117 15 8 6 3
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation

37
Model preparation
Vector Assembler
VectorAssembler is a Spark transformer that combines a given list of columns into a single vector column
val assembler = new VectorAssembler()
.setInputCols(Array("hour", "mobile",
"userFeatures"))
.setOutputCol("features")
How was it done?
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation

38
XGBoostRegressor
An instance of the XGBoost object which is used for Regression and Classiﬁcation tasks
Model preparation
How was it done?
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation

39
XGBoostRegressor
Going through few of the parameters that were used as part of the “tweaking” (aka “hyperparameter tuning”):
■ eta - learning rate of the model (usually: 0 < eta < 1) - step size shrinkage for gradient boosting
■ max_depth - Maximum Depth per tree (as part of the whole model). Deeper trees prone to overﬁtting
■ subsample - ratio (0 to 1) of training instance in a tree, meant to prevent overﬁtting (encourage variance between trees)
■ colsample_bytree - ratio (0 to 1) of sub sample ratio of columns (features) when building each of the trees
■ objective - the “learning task” of the model. In our case, logistic regression probability (0 <= P(X) <= 1) was used, with a
label of 0 / 1 (0 - the user wasn’t active during the “next” 14 days, 1 - the user was active during the “next” 14 days)
Model preparation
How was it done?
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation

40
ML Pipelines
The goal: Combine multiple steps / algorithms into a single pipeline / workflow
■ A Pipeline chains multiple actions together to specify an ML workflow
■ Pipeline Stages are specified as an ordered way
■ Persistence - we can save and load entire pipelines for future usages
Putting it all together
How was it done?
Data
Preparation
Feature
Engineering
Vector
Assembler
Model
Preparation

41
CTR Prediction
(Oﬀer Wall)

42
Old Situation:
● Offer scores and ranks were set by old manual parameters and configurations, therefore does not reflect the real
performance
New Solution:
● A ML model that automatically ranks the relevant offers based on their attributes, and therefore able to estimate the
relevant “target” a lot better (i.e. “best in geo XYZ”, “best in application 123”, etc.)
CTR Prediction
(Offer Wall)

43
Pipeline Preparation
How was it done?
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

44
Data Pre-processing
● Raw data - the raw events are the most important as those can help you to be familiar with the
data, the trends, the outliers, and contains tons of “magic”
Timestamp Advertisement_Id Application_Id User_country_code Event
2019-09-03 2:24:59 1296881 40647 US Impression
2019-09-03 2:25:03 1303994 40647 US Impression
2019-09-03 2:25:38 1288117 110226 US Impression
2019-09-03 2:25:39 1303946 119548 SA Impression
2019-09-03 2:25:58 1252617 106241 KG Impression
How was it done?
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

45
● Insights from the raw data -
○ As every “live-traffic” product, there are tons of “long tail” applications and advertisements that
doesn’t generate significant data - should focus (“OOV”, for example)
○ Timestamp is our “master” for time-series data, as we can analyze and see what were the KPIs on
some specific time - should focus
○ As we’re dealing with CTR (i.e. clicks / imps), we don’t focus on outliers (i.e. when clicks > imps) -
should investigate and choose next steps
○ Special cases where dimensions that are not relevant to that product are being ingested to raw data
- should investigate and choose next steps
Data Pre-processing
How was it done?
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

46
● Feature Engineering
○ Window Functions - those are used to calculate KPIs (and speciﬁcally - impressions, clicks) for
several times during the training dataset
○ GroupBy queries - in order to have “diﬀerent” insights of label (CTR) - CTR by n dimensions is
somehow related to CTR by n - 1, n - 2, …, dimensions
○ Normalization - this technique was used as one of our main features didn’t have the same scale of
values. Some values had 100-1000, others had 0-1, and we had to scale them down. The goal was to
have a value between 0 to . Normalization was done using MinMax normalization, though there are
other techniques as well (with STD, mean,...)
Feature Engineering
How was it done?
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

47
Model preparation
How was it done?
XGBoostRegressor
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

48
XGBoostRegressor
● “Missing” - this is a representation for different strategies to handle missing values as part of your
dataset
More info can be found here, and official PDF here
● Ignore missing values
● “setHandleInvalid = skip” in VectorAssembler
● Handle missing values
● Change data accordingly (fill.na = $someNumber)
● “setHandleInvalid = keep” in VectorAssembler
● Add “missing = $someNumber” in XGBoostRegressor
Model preparation
How was it done?
Feature
Engineering
Vector
Assembler
Data
Preparation
String
Indexing
One Hot
Encoding
Model
Preparation

50
XGBoost Feature Importance (via Information Gain)
Understand how good / bad you’re features are, and act accordingly

ss
Code Session
■ GitHub Repo

XGBoost @ Fyber

More Related Content

What's hot (20)

Similar to XGBoost @ Fyber (20)

Recently uploaded (20)

XGBoost @ Fyber