Why Neural Net Field Aware Factorization
Machines
are able to break ground in
digital behaviours prediction
Presenter: Gunjan Sharma
Co-Author: Varun Kumar Modi
About the Authors
Presenter: Gunjan Sharma
System Architect @ InMobi (3 years)
SE @Facebook (2.5 Years)
DPE @Google (1 year)
Twitter Handle: @gunjan_1409
LinkedIn:
https://www.linkedin.com/in/gunjan-
sharma-a6794414/
Co-author: Varun Kumar Modi
Sr Research Scientist @ InMobi(5 years)
LinkedIn:
https://www.linkedin.com/in/varun-
modi-33800652/
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
InMobi is one of the largest advertising platform at scale globally
InMobi reaches >2 billion MAU across the world - specialised in mobile In-app advertising
JAPA
N
INDIA+
SEA
CHINA
Afri
ca
ANZ
NORTH
AMERICA
KOREA
EMEA
Latin
America
LATIN
AMERICA
Afri
ca
AfricaAFRICA
China
APAC
Consolidation has taken place to
clean up the ecosystem few
advertising platforms at scale exist
North America
(only
Video) Very limited number of players have
presence in Asia, InMobi is dominating
Few players control each component of the
chain; No presence of global players, except
InMobi
Problem stmt and why it matters
● What are the problems:
Use case 1 - Conversion ratio (CVR) prediction:
- CVR = Install rate of users = Probability of a install given a click
- Usage: CPM = CTR * CVR * CPI
Use case 2 - Video completion rate (VCR) prediction:
- Video completion rate of users watching advertising videos given click
● Why are they important:
○ Performance business - based on arbitrage, so the model directly determines the margin/profit of the
business and the ability of the campaign to achieve significant scale = > multi-million dollar
businesses!
Existing context and challenges
● Models traditionally used Linear/Logistic Regression and Tree-based models
● Both have their strengths and weaknesses when used in production
● What we need is an awesome model that sits somewhere in the middle and
can bring in the best of both worlds
LR Tree Based
Generalise for unseen combinations Our use cases could not
Potentially Underfit at times Potentially can overfit at times
Requires lesser RAM Can at times bloat RAM usage specially
with high cardinality features
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Why think of NN for CVR/VCR
prediction
● Using cross features in LR wasn’t cutting it for us.
● Plus at some point it starts to become cumbersome both at training and
prediction time.
● All the major predictions noted here follow a complex curve
● LR left much to desire compared to Tree based models for example because
interaction-terms are limited
● We tried couple of awesome models that were also not able to beat Tree
based models
We all agreed that Neural Nets are a suitable technology to find higher order
interactions between our features
At the same time they have the power of generalising to unseen combinations.
Challenges Involved
● Traditionally NNs are more utilized for Classification problems
● We want to model our predictions as regression problem
● Most of the features are categorical which means we need to use one-hot
encoding
● This causes NN to spew very bad results as they need a lot of data to train
efficiently.
● Plus cardinality of some features is very high and it makes life more troublesome.
● Model should be easy to productionised both for training and serving
● Spark isn’t suited for custom NN networks.
● Model should be debuggable as much as possible to be able to explain the
Business changes
● The resistance to using NN for a long time came because of the lack of
understanding into their internals
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Consider the following dummy dataset
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
CNBC Nike Male 0.0004
ESPN Adidas Female 0.008
Sony Coke Female 0.0005
Sony P&G Male 0.002
Factorization Machine (FM) - What are those
ESPN CNBC SONY Adi Nike Coke P&G Male Female
X0
X1
X2
Y0
Y1
Y2
Z0
Z1
Z2
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
1 0 0 0 1 0 0 1 0
= Publisher
Latent Vector
(PV)
= Advertiser
Latent Vector
(AV)
= Gender
Latent Vector
(GV)
PVT*AV + AVT*GV + GVT*PV = pCVR
NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
Factorization Machine (FM) - What are those
● K dimensional representation for every feature value
● Captures second order interactions across all the features (ATB =
|A|*|B|*cos(Θ))
● Essentially a combination of hyperbolas summed up to form the final
prediction
● Works better than LR but tree based models are still more powerful.
● EG: Predict movie’s revenue:
Features
Movie
City
Gender
Latent Features
Horror
Comedy
Action
Romance
Second Order Intuition
● For every latent feature
● For every pair of original feature
● How much this latent feature affect
revenue when considering these pair
Final predicted revenue is linear sum over
all latent features
Field aware Factorization Machine (FFM)
ESPN CNBC SONY Adi Nike Coke P&G Male Female
XA
0
XA
1
XA
2
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
1 0 0 0 1 0 0 1 0
PVA
PVA
T*AVP + AVG
T*GVA + GVP
T*PVG = pCVR
NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
XG
0
XG
1
XG
2
PVG
YP
0
YP
1
YP
2
AVP
YG
0
YG
1
YG
2
AVG
ZP
0
ZP
1
ZP
2
GVP
ZA
0
ZA
1
ZA
2
GVA
Field aware Factorization Machine (FFM)
● We have a K dimensional vector for every feature value for every other feature
type
● Still second order interactions but with more degrees of freedom than FM
● Intuition: Latent features interact with every other cross feature differently
Works significantly better than FM, but at certain cuts was still not able to beat
Tree based model
Deep neural-net with Factorisation Machine:
DeepFM
Sigmoid(FM + NeuralNet(PV :+ AV :+ GV)) = pCVR
DeepFM
● Now we are entering the neural net world
● This model is a combination of FM and NN and the final prediction is sum of
the output from the 2 models
● Here we optimize the entire graph together.
● It performs better than using the latent vectors from FM and then running
them through neural net as a secondary optimization (FNN)
● It performs better than FM but not better than FFM
● Intuition: FM finds the second order interactions while neural net uses the
latent vectors to find the higher order nonlinear interactions.
Neural Factorization Machine: NFM
NeuralNet((PV.*AV .+ AV.*GV .+ GV.*PV)T) = pCVR
NFM
● In this architecture you only run the second order features through NN instead
of the raw latent vectors
● Intuition: The neural net takes the second order interactions and uses them to
find the higher order nonlinear interactions
● Performs better than DeepFM mostly attributed to the 2 facts
○ The size of the net is smaller hence converges faster.
○ The neural net can take the second order interactions and convert them easily to higher order
interactions.
● Results were better than DeepFM as well. But still not better than FFM
InMobi Spec: DeepFFM
Feature1
F2E
Dense
Embeddings
F3E F1E F3E F1E F2E
Hidden Layers
Act
FF Machine
Ypred
Feature2 Feature3 Spare Features
InMobi Spec: DeepFFM
● A simple upgrade to deepFM
● Performs better than both DeepFM and FFM
● Training is slower
● FFM part of things does the majority of the prediction heavy lifting. Evidently
due to faster gradient convergence.
● Intuition: Take the latent vectors run them through NN for higher order
interactions and use FFM for second order interactions.
InMobi Spec: NFFM
Feature1
F2E
Dense
Embeddings
F3E F1E F3E F1E F2E
Feature2 Feature3
Sparse
Features
FF Machine
Hidden Layers
….... K inputs
Ypred
InMobi Spec: NFFM
● A simple upgrade to NFM
● Does better than everyone significantly.
● Converges faster than DeepFFM
● Intuition: Take the second order interactions from FFM and run them through
neural net to find higher order nonlinear interactions.
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Use case 1 - Results CVR
Accuracy function: (ΣWᵢ * abs(Yactᵢ - Ypredᵢ))
ΣWᵢ
Model FFM DeepFM DeepFFM NFFM
Accuracy %
Improvement over
Linear model (small
DS)
44% 35% 48% 64%
Use case 1 - Results CVR
Training Data
Dates
Test Date Accuracy %
Improvement over
Linear Model
T1-T7 T7 21%
T1-T7 T8 14%
T2-T8 T8 20%
T2-T8 T9 14%
% Improvement over Tree
model
Cut1 21.7%
Cut2 18.5%
Use case 2 - Results VCR
Error Ftn(AEPV -
Absolute Error Per
View):
(Σ(Viewsᵢ-Cmpltdᵢ) * abs(Ypredᵢ) +(Cmpltdᵢ) * abs(1 - Ypredᵢ))
ΣViewsᵢ
Model / % AEPV
Improvement By
Country OS Cut
over last 7 day
Avg Model
Logistic Reg Logistic Reg(2nd
order
Autoregressive
features)
LR (GBT based
Feature
Engineering)
NFFM
Cut1 -3.71% 2.30% 2.51% 3.00%
Cut2 -2.16% 3.05% 4.48% 28.83%
Cut3 -0.31% -0.56% 5.65% 12.47%
Use case 2 - Results VCR
● LR with L2 Regularisation
● 2nd Order features were selected based on Information Gain criteria
● GBT package in spark Mlib was used(numTrees = 400, maxDepth=8,
sampling=0.5 minInstancePerNode = 10).
○ Training process was too slow, even with large enough resources.
○ Xgboost with Spark(tried later) was faster , and resulted in further Improvements
● NFFM: Increasing the number of layers till 3 resulted in further 20%
improvement in the validation errors, no significant improvement after that
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Building the full intuition
Factorisation machine:
● Handling categorical features and sparse data matrix
● Extracting latent variables, e.g., identifying non-explicit segment profiles in the population
Field-aware:
● Dimensionality reduction (high cardinality features to K dimension representation)
● Increases degrees of freedom (compared to FM in terms field-specific values) to enable exhaustive
set of second-order interactions
Neural network:
● Explores and weight higher order interactions - went up to 3 layers of interaction sucessfully
● Generates numerical prediction
● Training the factors based on performance of both FM machine and Neural Nets (instead of training
them separately causing latent vectors to only be limited by power of FM)
Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
Implementation details
● Hyper params are k, lambda, num layers, num nodes in layers, activation
functions
● Implemented in Tensorflow
● Adam optimizer
● L2 regularization. No dropouts
● No batch-normalization
● 1 layer 100 nodes performs good enough and saves compute
● ReLU activations (converges faster)
● k=16 (try with powers of 2)
● Weighted RMSE as loss function for both use cases
Predicting for unseen feature values
ESPN CNBC SONY UNKNOWN?
XA
0
XA
1
XA
2
XG
0
XG
1
XG
2
● Avg latent feature interactions per feature for unknown values
YA
0
YA
1
YA
2
YG
0
YG
1
YG
2
ZA
0
ZA
1
ZA
2
ZG
0
ZG
1
ZG
2
(XA
0+YA
0+ZA
0)/3
(XA
1+YA
1+ZA
1)/3
(XA
2+YA
2+ZA
2)/3
(XG
0+YG
0+ZG
0)/3
(XG
1+YG
1+ZG
1)/3
(XG
2+YG
2+ZG
2)/3
Implementing @ low-latency, high-scale
● MLeap: MLeap framework provides support for models trained both in Spark
and Tensorflow. Helps us train models in Spark for Tree based models and
TF models for NN based models
● Offline training and challenges: We cannot train TF models on yarn cluster
hence we use a GPU machine as gateway to pull data and from HDFS and
train on GPU
● Online serving challenges: TF serving has pretty low throughput and wasn’t
scaling for our QPS. Hence we are using local LRU cache with decent TTL to
scale the TF serving
Future research that we are currently pursuing...
● Hybrid Binning NFFM
● Distributed training and serving
● Dropouts & Batch Normalization
● Methods to interpret the latent-vector (Using methods like t-Distributed
Stochastic Neighbour Embedding (t-SNE) etc)
References
FM: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
FFM: http://research.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/
DeepFM: https://arxiv.org/pdf/1703.04247.pdf
NFM: https://arxiv.org/pdf/1708.05027.pdf
GBT Based Feature Engg: http://quinonero.net/Publications/predicting-clicks-facebook.pdf
Thank You!

More Related Content

PPTX
Reward Innovation for long-term member satisfaction
PDF
Music Recommendations at Scale with Spark
PDF
Intro to Factorization Machines
PDF
[2014널리세미나] 시맨틱한 HTML5 마크업 구조 설계, 어떻게 할까?
PDF
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보야져 팀] : 기업연계프로젝트 3종세트 [마케팅시각화/서비스기획/분석시스템 구축]
PPTX
Recommendation Modeling with Impression Data at Netflix
PDF
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
PDF
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
Reward Innovation for long-term member satisfaction
Music Recommendations at Scale with Spark
Intro to Factorization Machines
[2014널리세미나] 시맨틱한 HTML5 마크업 구조 설계, 어떻게 할까?
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보야져 팀] : 기업연계프로젝트 3종세트 [마케팅시각화/서비스기획/분석시스템 구축]
Recommendation Modeling with Impression Data at Netflix
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [쇼미더뮤직 팀] : 텍스트 감정추출을 통한 노래 추천
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화

What's hot (20)

PPTX
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
PDF
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
PDF
제1화 추천 시스템 이란.ppt
PDF
新入生勧誘プレゼン2014
PDF
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
PDF
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Hands-on 팀] : 수어 번역을 통한 위험 상황 속 의사소통 시스템 구축
PDF
Session-Based Recommender Systems
PDF
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [SiZoAH] : 리뷰 기반 의류 사이즈 추천시스템
PDF
Bpr bayesian personalized ranking from implicit feedback
PDF
ML+Hadoop at NYC Predictive Analytics
PDF
CF Models for Music Recommendations At Spotify
PPTX
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
PDF
제 14회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [BICS팀] : Boaz Industry Classification Standard
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
潜在ディリクレ配分法
PDF
R実践 機械学習による異常検知 02
PPTX
顔認識アルゴリズム:Constrained local model を調べてみた
PDF
CVPR2019読み会 "A Theory of Fermat Paths for Non-Line-of-Sight Shape Reconstruc...
PPTX
名のあるフラクタルたち
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Indus2ry 팀] : 2022산업동향- 편의점 & OTT 완벽 분석
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [하둡메이트 팀] : 하둡 설정 고도화 및 맵리듀스 모니터링
제1화 추천 시스템 이란.ppt
新入生勧誘プレゼン2014
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Secret X 팀] : XAI를 활용한 수능 영어영역 문제풀이
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Hands-on 팀] : 수어 번역을 통한 위험 상황 속 의사소통 시스템 구축
Session-Based Recommender Systems
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [SiZoAH] : 리뷰 기반 의류 사이즈 추천시스템
Bpr bayesian personalized ranking from implicit feedback
ML+Hadoop at NYC Predictive Analytics
CF Models for Music Recommendations At Spotify
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [YouPlace 팀] : 카프카와 스파크를 활용한 유튜브 영상 속 제주 명소 검색
제 14회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [BICS팀] : Boaz Industry Classification Standard
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
潜在ディリクレ配分法
R実践 機械学習による異常検知 02
顔認識アルゴリズム:Constrained local model を調べてみた
CVPR2019読み会 "A Theory of Fermat Paths for Non-Line-of-Sight Shape Reconstruc...
名のあるフラクタルたち
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Ad

Similar to Neural Field aware Factorization Machine (20)

PDF
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
PDF
CUSTOMER CHURN PREDICTION
PPTX
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
PDF
RESUME SCREENING USING LSTM
PDF
Frequently Bought Together Recommendations Based on Embeddings
PDF
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
PPTX
Building High Available and Scalable Machine Learning Applications
PPTX
Machine Learning AND Deep Learning for OpenPOWER
PDF
Netflix Recommendations - Beyond the 5 Stars
PDF
copy for Gary Chin.
PDF
Big data 2.0, deep learning and financial Usecases
DOCX
Boosting conversion rates on ecommerce using deep learning algorithms
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PDF
DSRLab seminar Introduction to deep learning
PPTX
Tsinghua invited talk_zhou_xing_v2r0
PDF
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
PPTX
Deep learning Tutorial - Part II
PPTX
Data Science in business World
PPTX
Automatic Attendace using convolutional neural network Face Recognition
Using Deep Learning in Production Pipelines to Predict Consumers’ Interest wi...
CUSTOMER CHURN PREDICTION
"AI in the browser: predicting user actions in real time with TensorflowJS", ...
RESUME SCREENING USING LSTM
Frequently Bought Together Recommendations Based on Embeddings
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Building High Available and Scalable Machine Learning Applications
Machine Learning AND Deep Learning for OpenPOWER
Netflix Recommendations - Beyond the 5 Stars
copy for Gary Chin.
Big data 2.0, deep learning and financial Usecases
Boosting conversion rates on ecommerce using deep learning algorithms
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
DSRLab seminar Introduction to deep learning
Tsinghua invited talk_zhou_xing_v2r0
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Deep learning Tutorial - Part II
Data Science in business World
Automatic Attendace using convolutional neural network Face Recognition
Ad

More from InMobi (20)

PDF
Responding to Coronavirus: How marketers can leverage digital responsibly
PPTX
2020: Celebrating the Era of the Connected Consumer
PPTX
Winning the Indian Festive Shopper in 2019
PPTX
The Changing Face of the Indian Mobile User
PPTX
Unlocking the True Potential of Data on Mobile
PDF
InMobi State of Mobile Video Advertising Report 2018
PPTX
The Essential Mediation Toolkit - Korean
PDF
A Comprehensive Guide for App Marketers
PPTX
A Cure for Ad-Fraud: Turning Fraud Detection into Fraud Prevention
PDF
[Webinar] driving accountability in mobile advertising
PDF
The Brand Marketer's Guide to Mobile Video Viewability
PDF
Top 2017 Mobile Advertising Trends in Indonesia
PPTX
Mobile marketing strategy guide
PDF
InMobi Yearbook 2016
PPTX
Boost Retention on Mobile and Keep Users Coming Back for More!
PPTX
Building Mobile Creatives that Deliver Real Results
PPTX
Everything you need to know about mobile video ads in india and apac
PDF
The Golden Age of Mobile Video Advertising | Global
PPTX
Everything a developer needs to know about the mobile video ads
PPTX
Programmatically Speaking with InMobi and Rubicon Project
Responding to Coronavirus: How marketers can leverage digital responsibly
2020: Celebrating the Era of the Connected Consumer
Winning the Indian Festive Shopper in 2019
The Changing Face of the Indian Mobile User
Unlocking the True Potential of Data on Mobile
InMobi State of Mobile Video Advertising Report 2018
The Essential Mediation Toolkit - Korean
A Comprehensive Guide for App Marketers
A Cure for Ad-Fraud: Turning Fraud Detection into Fraud Prevention
[Webinar] driving accountability in mobile advertising
The Brand Marketer's Guide to Mobile Video Viewability
Top 2017 Mobile Advertising Trends in Indonesia
Mobile marketing strategy guide
InMobi Yearbook 2016
Boost Retention on Mobile and Keep Users Coming Back for More!
Building Mobile Creatives that Deliver Real Results
Everything you need to know about mobile video ads in india and apac
The Golden Age of Mobile Video Advertising | Global
Everything a developer needs to know about the mobile video ads
Programmatically Speaking with InMobi and Rubicon Project

Recently uploaded (20)

PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PPTX
Capstone Presentation a.pptx on data sci
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPT
2011 HCRP presentation-final.pptjrirrififfi
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPTX
Basic Statistical Analysis for experimental data.pptx
PDF
Buddhism presentation about world religion
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PDF
Mcdonald's : a half century growth . pdf
PPTX
Stats annual compiled ipd opd ot br 2024
PPT
Technicalities in writing workshops indigenous language
PPTX
ch20 Database System Architecture by Rizvee
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
Bussiness Plan S Group of college 2020-23 Final
PPTX
cyber row.pptx for cyber proffesionals and hackers
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
REPORT CARD OF GRADE 2 2025-2026 MATATAG
Capstone Presentation a.pptx on data sci
inbound6529290805104538764.pptxmmmmmmmmm
2011 HCRP presentation-final.pptjrirrififfi
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Basic Statistical Analysis for experimental data.pptx
Buddhism presentation about world religion
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Mcdonald's : a half century growth . pdf
Stats annual compiled ipd opd ot br 2024
Technicalities in writing workshops indigenous language
ch20 Database System Architecture by Rizvee
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
Bussiness Plan S Group of college 2020-23 Final
cyber row.pptx for cyber proffesionals and hackers
PPT for Diseases (1)-2, types of diseases.pptx
Teal Blue Futuristic Metaverse Presentation.pdf

Neural Field aware Factorization Machine

  • 1. Why Neural Net Field Aware Factorization Machines are able to break ground in digital behaviours prediction Presenter: Gunjan Sharma Co-Author: Varun Kumar Modi
  • 2. About the Authors Presenter: Gunjan Sharma System Architect @ InMobi (3 years) SE @Facebook (2.5 Years) DPE @Google (1 year) Twitter Handle: @gunjan_1409 LinkedIn: https://www.linkedin.com/in/gunjan- sharma-a6794414/ Co-author: Varun Kumar Modi Sr Research Scientist @ InMobi(5 years) LinkedIn: https://www.linkedin.com/in/varun- modi-33800652/
  • 3. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 4. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 5. InMobi is one of the largest advertising platform at scale globally InMobi reaches >2 billion MAU across the world - specialised in mobile In-app advertising JAPA N INDIA+ SEA CHINA Afri ca ANZ NORTH AMERICA KOREA EMEA Latin America LATIN AMERICA Afri ca AfricaAFRICA China APAC Consolidation has taken place to clean up the ecosystem few advertising platforms at scale exist North America (only Video) Very limited number of players have presence in Asia, InMobi is dominating Few players control each component of the chain; No presence of global players, except InMobi
  • 6. Problem stmt and why it matters ● What are the problems: Use case 1 - Conversion ratio (CVR) prediction: - CVR = Install rate of users = Probability of a install given a click - Usage: CPM = CTR * CVR * CPI Use case 2 - Video completion rate (VCR) prediction: - Video completion rate of users watching advertising videos given click ● Why are they important: ○ Performance business - based on arbitrage, so the model directly determines the margin/profit of the business and the ability of the campaign to achieve significant scale = > multi-million dollar businesses!
  • 7. Existing context and challenges ● Models traditionally used Linear/Logistic Regression and Tree-based models ● Both have their strengths and weaknesses when used in production ● What we need is an awesome model that sits somewhere in the middle and can bring in the best of both worlds LR Tree Based Generalise for unseen combinations Our use cases could not Potentially Underfit at times Potentially can overfit at times Requires lesser RAM Can at times bloat RAM usage specially with high cardinality features
  • 8. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 9. Why think of NN for CVR/VCR prediction ● Using cross features in LR wasn’t cutting it for us. ● Plus at some point it starts to become cumbersome both at training and prediction time. ● All the major predictions noted here follow a complex curve ● LR left much to desire compared to Tree based models for example because interaction-terms are limited ● We tried couple of awesome models that were also not able to beat Tree based models We all agreed that Neural Nets are a suitable technology to find higher order interactions between our features At the same time they have the power of generalising to unseen combinations.
  • 10. Challenges Involved ● Traditionally NNs are more utilized for Classification problems ● We want to model our predictions as regression problem ● Most of the features are categorical which means we need to use one-hot encoding ● This causes NN to spew very bad results as they need a lot of data to train efficiently. ● Plus cardinality of some features is very high and it makes life more troublesome. ● Model should be easy to productionised both for training and serving ● Spark isn’t suited for custom NN networks. ● Model should be debuggable as much as possible to be able to explain the Business changes ● The resistance to using NN for a long time came because of the lack of understanding into their internals
  • 11. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 12. Consider the following dummy dataset Publisher Advertiser Gender CVR ESPN Nike Male 0.01 CNBC Nike Male 0.0004 ESPN Adidas Female 0.008 Sony Coke Female 0.0005 Sony P&G Male 0.002
  • 13. Factorization Machine (FM) - What are those ESPN CNBC SONY Adi Nike Coke P&G Male Female X0 X1 X2 Y0 Y1 Y2 Z0 Z1 Z2 Publisher Advertiser Gender CVR ESPN Nike Male 0.01 1 0 0 0 1 0 0 1 0 = Publisher Latent Vector (PV) = Advertiser Latent Vector (AV) = Gender Latent Vector (GV) PVT*AV + AVT*GV + GVT*PV = pCVR NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
  • 14. Factorization Machine (FM) - What are those ● K dimensional representation for every feature value ● Captures second order interactions across all the features (ATB = |A|*|B|*cos(Θ)) ● Essentially a combination of hyperbolas summed up to form the final prediction ● Works better than LR but tree based models are still more powerful. ● EG: Predict movie’s revenue: Features Movie City Gender Latent Features Horror Comedy Action Romance Second Order Intuition ● For every latent feature ● For every pair of original feature ● How much this latent feature affect revenue when considering these pair Final predicted revenue is linear sum over all latent features
  • 15. Field aware Factorization Machine (FFM) ESPN CNBC SONY Adi Nike Coke P&G Male Female XA 0 XA 1 XA 2 Publisher Advertiser Gender CVR ESPN Nike Male 0.01 1 0 0 0 1 0 0 1 0 PVA PVA T*AVP + AVG T*GVA + GVP T*PVG = pCVR NOTE: All vectors are K dimensional which is hyper parameter for the algorithm XG 0 XG 1 XG 2 PVG YP 0 YP 1 YP 2 AVP YG 0 YG 1 YG 2 AVG ZP 0 ZP 1 ZP 2 GVP ZA 0 ZA 1 ZA 2 GVA
  • 16. Field aware Factorization Machine (FFM) ● We have a K dimensional vector for every feature value for every other feature type ● Still second order interactions but with more degrees of freedom than FM ● Intuition: Latent features interact with every other cross feature differently Works significantly better than FM, but at certain cuts was still not able to beat Tree based model
  • 17. Deep neural-net with Factorisation Machine: DeepFM Sigmoid(FM + NeuralNet(PV :+ AV :+ GV)) = pCVR
  • 18. DeepFM ● Now we are entering the neural net world ● This model is a combination of FM and NN and the final prediction is sum of the output from the 2 models ● Here we optimize the entire graph together. ● It performs better than using the latent vectors from FM and then running them through neural net as a secondary optimization (FNN) ● It performs better than FM but not better than FFM ● Intuition: FM finds the second order interactions while neural net uses the latent vectors to find the higher order nonlinear interactions.
  • 19. Neural Factorization Machine: NFM NeuralNet((PV.*AV .+ AV.*GV .+ GV.*PV)T) = pCVR
  • 20. NFM ● In this architecture you only run the second order features through NN instead of the raw latent vectors ● Intuition: The neural net takes the second order interactions and uses them to find the higher order nonlinear interactions ● Performs better than DeepFM mostly attributed to the 2 facts ○ The size of the net is smaller hence converges faster. ○ The neural net can take the second order interactions and convert them easily to higher order interactions. ● Results were better than DeepFM as well. But still not better than FFM
  • 21. InMobi Spec: DeepFFM Feature1 F2E Dense Embeddings F3E F1E F3E F1E F2E Hidden Layers Act FF Machine Ypred Feature2 Feature3 Spare Features
  • 22. InMobi Spec: DeepFFM ● A simple upgrade to deepFM ● Performs better than both DeepFM and FFM ● Training is slower ● FFM part of things does the majority of the prediction heavy lifting. Evidently due to faster gradient convergence. ● Intuition: Take the latent vectors run them through NN for higher order interactions and use FFM for second order interactions.
  • 23. InMobi Spec: NFFM Feature1 F2E Dense Embeddings F3E F1E F3E F1E F2E Feature2 Feature3 Sparse Features FF Machine Hidden Layers ….... K inputs Ypred
  • 24. InMobi Spec: NFFM ● A simple upgrade to NFM ● Does better than everyone significantly. ● Converges faster than DeepFFM ● Intuition: Take the second order interactions from FFM and run them through neural net to find higher order nonlinear interactions.
  • 25. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 26. Use case 1 - Results CVR Accuracy function: (ΣWᵢ * abs(Yactᵢ - Ypredᵢ)) ΣWᵢ Model FFM DeepFM DeepFFM NFFM Accuracy % Improvement over Linear model (small DS) 44% 35% 48% 64%
  • 27. Use case 1 - Results CVR Training Data Dates Test Date Accuracy % Improvement over Linear Model T1-T7 T7 21% T1-T7 T8 14% T2-T8 T8 20% T2-T8 T9 14% % Improvement over Tree model Cut1 21.7% Cut2 18.5%
  • 28. Use case 2 - Results VCR Error Ftn(AEPV - Absolute Error Per View): (Σ(Viewsᵢ-Cmpltdᵢ) * abs(Ypredᵢ) +(Cmpltdᵢ) * abs(1 - Ypredᵢ)) ΣViewsᵢ Model / % AEPV Improvement By Country OS Cut over last 7 day Avg Model Logistic Reg Logistic Reg(2nd order Autoregressive features) LR (GBT based Feature Engineering) NFFM Cut1 -3.71% 2.30% 2.51% 3.00% Cut2 -2.16% 3.05% 4.48% 28.83% Cut3 -0.31% -0.56% 5.65% 12.47%
  • 29. Use case 2 - Results VCR ● LR with L2 Regularisation ● 2nd Order features were selected based on Information Gain criteria ● GBT package in spark Mlib was used(numTrees = 400, maxDepth=8, sampling=0.5 minInstancePerNode = 10). ○ Training process was too slow, even with large enough resources. ○ Xgboost with Spark(tried later) was faster , and resulted in further Improvements ● NFFM: Increasing the number of layers till 3 resulted in further 20% improvement in the validation errors, no significant improvement after that
  • 30. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 31. Building the full intuition Factorisation machine: ● Handling categorical features and sparse data matrix ● Extracting latent variables, e.g., identifying non-explicit segment profiles in the population Field-aware: ● Dimensionality reduction (high cardinality features to K dimension representation) ● Increases degrees of freedom (compared to FM in terms field-specific values) to enable exhaustive set of second-order interactions Neural network: ● Explores and weight higher order interactions - went up to 3 layers of interaction sucessfully ● Generates numerical prediction ● Training the factors based on performance of both FM machine and Neural Nets (instead of training them separately causing latent vectors to only be limited by power of FM)
  • 32. Content 1) The problem and context 2) The Motivation 3) Building the model theory: piece by piece 4) Results of the 2 use cases 5) Understanding exactly why it works 6) Implementation at InMobi scale
  • 33. Implementation details ● Hyper params are k, lambda, num layers, num nodes in layers, activation functions ● Implemented in Tensorflow ● Adam optimizer ● L2 regularization. No dropouts ● No batch-normalization ● 1 layer 100 nodes performs good enough and saves compute ● ReLU activations (converges faster) ● k=16 (try with powers of 2) ● Weighted RMSE as loss function for both use cases
  • 34. Predicting for unseen feature values ESPN CNBC SONY UNKNOWN? XA 0 XA 1 XA 2 XG 0 XG 1 XG 2 ● Avg latent feature interactions per feature for unknown values YA 0 YA 1 YA 2 YG 0 YG 1 YG 2 ZA 0 ZA 1 ZA 2 ZG 0 ZG 1 ZG 2 (XA 0+YA 0+ZA 0)/3 (XA 1+YA 1+ZA 1)/3 (XA 2+YA 2+ZA 2)/3 (XG 0+YG 0+ZG 0)/3 (XG 1+YG 1+ZG 1)/3 (XG 2+YG 2+ZG 2)/3
  • 35. Implementing @ low-latency, high-scale ● MLeap: MLeap framework provides support for models trained both in Spark and Tensorflow. Helps us train models in Spark for Tree based models and TF models for NN based models ● Offline training and challenges: We cannot train TF models on yarn cluster hence we use a GPU machine as gateway to pull data and from HDFS and train on GPU ● Online serving challenges: TF serving has pretty low throughput and wasn’t scaling for our QPS. Hence we are using local LRU cache with decent TTL to scale the TF serving
  • 36. Future research that we are currently pursuing... ● Hybrid Binning NFFM ● Distributed training and serving ● Dropouts & Batch Normalization ● Methods to interpret the latent-vector (Using methods like t-Distributed Stochastic Neighbour Embedding (t-SNE) etc)
  • 37. References FM: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf FFM: http://research.criteo.com/ctr-prediction-linear-model-field-aware-factorization-machines/ DeepFM: https://arxiv.org/pdf/1703.04247.pdf NFM: https://arxiv.org/pdf/1708.05027.pdf GBT Based Feature Engg: http://quinonero.net/Publications/predicting-clicks-facebook.pdf