SlideShare a Scribd company logo
1
A COMBINATION OF SIMPLE MODELS BY FORWARD
PREDICTOR SELECTION FOR JOB RECOMMENDATION
Dávid Zibriczky, PhD (DaveXster)
Budapest University of Technology and Economics,
Budapest, Hungary
2
The Dataset – Data preparation
• Events (interactions, impressions)
› Target format: (time,user_id,item_id,type,value)
› Interactions  Format OK
› Impressions:
• Generating unique (time,user_id,item_id) triples
• Value  count of their occurrence
• Time  12pm on Thursday of the week
• Type  5
• Catalog (items, users)
› Target format:(id,key1,key2,…,keyN)
› Items and users  Format OK
› Unknown „0” values  empty values
› Inconsistency: Geo-location vs. country/region  Metadata enhancement based on geo-location
3
The Dataset – Basic statistics
Size of training set
• 211M events, 2.8M users, 1.3M items
• Effect: huge and very sparse matrix
Distribution
• 95% of events are impressions
• 72% of the users have impressions only
• Item support for interactions is low (~9)
• Effect: weak collaboration using interactions
Target users
• 150K users
• 73% active, 16% inactive, 12% new
• Effect: user cold start and warm-up problem
Data source #events #users #items
Interactions 8,826,678 784,687 1,029,480
Impressions 201,872,093 2,755,167 846,814
All events 210,698,777 2,792,405 1,257,422
Catalog - 1,367,057 1,358,098
Catalog OR Events - 2,829,563 1,362,890
4
Methods – Concept
Terminology
• Method: A technique of estimating the relevance of an item for a user (p-Value)
• Predictor/model: An instance of a method with a specified parameter setting
• Combination: Linear combination of prediction values for a user-item pairs
Approach
1. Exploring the properties of the data set
2. Definition of „simple” methods with different functionality (time-decay is commonly used)*
3. Finding a set of relevant predictors and optimal combination of them
4. Top-N ranking of available event supported items with non-zero p-Values (~200K)
* Equations of the methods can be found in the paper
5
Methods – Item-kNN
• Observation: Very sparse user-item matrix (0.005%), 211M events
• Goal: Next best items to click, estimating recommendations of Xing
• Method: Standard Item-based kNN with special fetures
› Input-output event types
› Controlling popularity factor
› Similarity of the same item is 0
› Efficient implementation
• Notation: IKNN(I,O)
› I: input event type
› O: output event type
• Comment: No improvement combining other CF algorithms (MF, FM, User-kNN)
6
Methods – Recalling recommendations
• Chart: The distribution of impression
events by the number of weeks on that the
same item has already been shown
• Observation: 38% of recommendations
are recurring items
• Goal: Reverse engineering, recalling
recommendations
• Method:
› Recommendation of already shown items
› Weighted by expected CTR
• Notation: RCTR
7
Methods – Already seen items
• Chart: The probability of returning to an already
seen item after interacting on other items
• Observation: Significant probability of re-
clicking on an already clicked item
• Goal: Capturing re-clicking phenomena
• Method: Recommendation of already clicked
items
• Notation: AS(I)
8
Methods – User metadata-based popularity
• Observation:
› Significant amount of passive and new users
› All target users have metadata
• Goal:
› Semi-personalized recommendations for new users
› Improving accuracy on inactive users
• Method:
1. Item model: Expected popularity of an item in each user group
2. Prediction: Average popularity of an item for a user
› Applied keys: jobroles, edu_fieldofstudies
• Notation: UPOP
9
Methods – MS: Meta cosine similarity
• Observation:
› Item-cold start problem, many low-supported items
› Almost all items has metadata
• Goal:
› Model building for new items
› Improving the model of low-supported items
• Method:
1. Item model: Meta-data representation, tf-idf
2. User model: Meta-words of items seen by the user
3. Prediction: Average cosine similarity between user-item models
› Keys: tags, title, industry_id, geo_country, geo_region,
discipline_id
• Notation: MS
10
Methods – AP: Age-based popularity change
• Observation: Significant drop in popularity of
items with ~30 and ~60 days
• Goal: Underscoring these items
• Method: Expected ratio of the popularity in the
next week
• Notation: AP
11
Methods – OM: The omit method
• Observation: Unwanted items in recommendation lists
• Goal: Omitting poorly modelled items of a predictor or combination
• Method:
1. Sub-train-test split
2. Retrain a new combination
3. Generating top-N recommendations
4. Measuring how the total evaluation would change by omitting items
5. Omitting worst K items on the original combination
• Notation: OM
12
Methods – Optimization
1. Time-based train-test split (test set: last week)
2. Coordinate gradient descent optimization of various methods  candidate predictor set
3. Support-based distinct user groups (new users, inactive users, 10 equal sized group of active users)
4. Forward Predictor Selection
1. Initialization:
1. Predictors that are selected from the candidate set for final combination  selected predictor set
2. Selected predictor set is empty in the beginning
2. Loop:
1. Calculate the accuracy of selected predictor set
2. For all remained candidate predictor, calculate the gain in accuracy that would give the predictor if it
would be moved to the selected set
3. Move the best one to the selected set and recalculate combination weights
4. Repeat the loop until there is improvement or reamining candidate preditor
3. Return: the set of the predictors and corresponding weights
5. Retrain selected predictors on the full data set
13
… let’s put it together and see how it performs!
14
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
* Java-based framework, 8-core 3.4 GHz CPU, 32 GB memory
15
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
• Sub-combinations
› 4 models: 600K+ score (w/o item metadata)
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
2 +RCTR 208 15 548,338 9
3 +AS(1) 237 17 590,526 6
4 +UPOP 247 50 614,674 5
16
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
• Sub-combinations
› 4 models: 600K+ score (w/o item metadata)
› 5 models: 3rd place
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
2 +RCTR 208 15 548,338 9
3 +AS(1) 237 17 590,526 6
4 +UPOP 247 50 614,674 5
5 +MS 364 122 623,909 3
17
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
• Sub-combinations
› 4 models: 600K+ score (w/o item metadata)
› 5 models: 3rd place
› 6 models: 95% of final score
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
2 +RCTR 208 15 548,338 9
3 +AS(1) 237 17 590,526 6
4 +UPOP 247 50 614,674 5
5 +MS 364 122 623,909 3
6 +IKNN(R,R) 1,150 168 635,278 3
18
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
• Sub-combinations
› 4 models: 600K+ score (w/o item metadata)
› 5 models: 3rd place
› 6 models: 95% of final score
› 10 models: 650K+ score (<30 mins. training time)
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
2 +RCTR 208 15 548,338 9
3 +AS(1) 237 17 590,526 6
4 +UPOP 247 50 614,674 5
5 +MS 364 122 623,909 3
6 +IKNN(R,R) 1,150 168 635,278 3
7 +AS(3) 1,205 178 636,498 3
8 +IKNN(R,C) 1,557 197 643,145 3
9 +AS(4) 1,582 202 644,710 3
10 +AP 1,621 207 652,802 3
19
Evaluation – Forward Predictor Selection
• Best single model
› Item-kNN trained on positive interactions
› 2.5 min training time
› 7 ms prediction time
• Sub-combinations
› 4 models: 600K+ score (w/o item metadata)
› 5 models: 3rd place
› 6 models: 95% of final score
› 10 models: 650K+ score (<30 mins. training time)
• Final combination
› 3rd place
› ~666K leaderboard score
› 11 instances
› user-support-based weighting
› 3h+ training time, 200 ms prediction time
# Predictor tTR(s)* tPR(ms)* Score Rank
1 IKNN(C,C) 148 7 450,046 24
2 +RCTR 208 15 548,338 9
3 +AS(1) 237 17 590,526 6
4 +UPOP 247 50 614,674 5
5 +MS 364 122 623,909 3
6 +IKNN(R,R) 1,150 168 635,278 3
7 +AS(3) 1,205 178 636,498 3
8 +IKNN(R,C) 1,557 197 643,145 3
9 +AS(4) 1,582 202 644,710 3
10 +AP 1,621 207 652,802 3
SUPP_C(1-10) 1,639 194 661,359 3
11 +OM 11,790 199 665,592 3
* Java-based framework, 8-core 3.4 GHz CPU, 32 GB memory
20
Evaluation – Timeline
39
1514141415
121110
2 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3
115.4
366.9
418.7
438.3
454.2
468.4
481.9
513.4
569.6
596.5
600.2
603.2
610.0
611.3
611.6
625.2
627.2
627.5
628.9
633.1
637.6
638.1
639.7
640.4
643.5
644.7
652.8
653.2
653.7
665.6
0
5
10
15
20
25
30
35
40
45
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
Apr-25
May-02
May-09
May-16
May-23
May-30
Jun-06
Jun-13
Jun-20
Jun-27
Leaderboardrank
Leaderboardscore(thousands)
Date
Timeline
Initial setup Model design and implementation Final sprint
21
Lessons learnt
• Exploiting the specificity of the dataset
• Using Item-kNN over factorization in a very sparse dataset
• Paying attention to recurrence
• Forward Predictor Selection is effective
• Different optimization for different user groups
• Underscoring/omitting weak items
• Ranking 200K items is slow
• Keep it simple and transparent!
22
Presenter
Contact
Thank you for your attention!
Dávid Zibriczky, PhD
david.zibriczky@gmail.com

More Related Content

PDF
Temporal Learning and Sequence Modeling for a Job Recommender System
Anoop Kumar
 
PDF
A Scalable, High-performance Algorithm for Hybrid Job Recommendations
Toon De Pessemier
 
PDF
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Vasily Leksin
 
PPTX
RecSys Challenge 2016
Fabian Abel
 
PDF
Matrix Factorization Technique for Recommender Systems
Aladejubelo Oluwashina
 
PPT
Item Based Collaborative Filtering Recommendation Algorithms
nextlib
 
PPTX
LinkedIn talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
PDF
Collaborative Filtering 2: Item-based CF
Yusuke Yamamoto
 
Temporal Learning and Sequence Modeling for a Job Recommender System
Anoop Kumar
 
A Scalable, High-performance Algorithm for Hybrid Job Recommendations
Toon De Pessemier
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Vasily Leksin
 
RecSys Challenge 2016
Fabian Abel
 
Matrix Factorization Technique for Recommender Systems
Aladejubelo Oluwashina
 
Item Based Collaborative Filtering Recommendation Algorithms
nextlib
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
Collaborative Filtering 2: Item-based CF
Yusuke Yamamoto
 

What's hot (20)

PDF
Recsys2021_slides_sato
Masahiro Sato
 
PPTX
Collaborative filtering at scale
huguk
 
PDF
Artwork Personalization at Netflix
Justin Basilico
 
PPTX
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
PDF
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 
PDF
Facebook Talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
PPTX
Collaborative filtering
Kishor Datta Gupta
 
PDF
Summary of a Recommender Systems Survey paper
Changsung Moon
 
PDF
Replicable Evaluation of Recommender Systems
Alejandro Bellogin
 
PDF
Survey of Recommendation Systems
youalab
 
PDF
Movie Recommendation engine
Jayesh Lahori
 
PPTX
Recommender Systems: Advances in Collaborative Filtering
Changsung Moon
 
PPTX
Recommender Systems
Lior Rokach
 
PDF
ACM SIGIR 2020 Tutorial - Reciprocal Recommendation: matching users with the ...
Iván Palomares Carrascosa
 
PDF
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 
PDF
Recent advances in deep recommender systems
NAVER Engineering
 
PPTX
Collaborative filtering
Neha Kulkarni
 
PDF
Recommender Systems! @ASAI 2011
Ernesto Mislej
 
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
PDF
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
Alejandro Bellogin
 
Recsys2021_slides_sato
Masahiro Sato
 
Collaborative filtering at scale
huguk
 
Artwork Personalization at Netflix
Justin Basilico
 
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GrubhubTech
 
Facebook Talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
Collaborative filtering
Kishor Datta Gupta
 
Summary of a Recommender Systems Survey paper
Changsung Moon
 
Replicable Evaluation of Recommender Systems
Alejandro Bellogin
 
Survey of Recommendation Systems
youalab
 
Movie Recommendation engine
Jayesh Lahori
 
Recommender Systems: Advances in Collaborative Filtering
Changsung Moon
 
Recommender Systems
Lior Rokach
 
ACM SIGIR 2020 Tutorial - Reciprocal Recommendation: matching users with the ...
Iván Palomares Carrascosa
 
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 
Recent advances in deep recommender systems
NAVER Engineering
 
Collaborative filtering
Neha Kulkarni
 
Recommender Systems! @ASAI 2011
Ernesto Mislej
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
HT2014 Tutorial: Evaluating Recommender Systems - Ensuring Replicability of E...
Alejandro Bellogin
 
Ad

Similar to A Combination of Simple Models by Forward Predictor Selection for Job Recommendation (20)

PDF
[CS570] Machine Learning Team Project (I know what items really are)
Kunwoo Park
 
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
PPTX
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
PPTX
Multi-method Evaluation in Scientific Paper Recommender Systems
Aravind Sesagiri Raamkumar
 
PDF
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
PDF
From sensor readings to prediction: on the process of developing practical so...
Manuel Martín
 
PDF
An introduction to variable and feature selection
Marco Meoni
 
PDF
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
PPTX
RS in the context of Big Data-v4
Khadija Atiya
 
PPTX
Kaggle Gold Medal Case Study
Alon Bochman, CFA
 
PPTX
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
thanhdowork
 
PPTX
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
thanhdowork
 
PDF
Ds for finance day 3
QuantUniversity
 
PPTX
[DSC Europe 24] Dmitrii Matveev - RecSys.pptx
DataScienceConferenc1
 
PPTX
A Machine learning approach to classify a pair of sentence as duplicate or not.
Pankaj Chandan Mohapatra
 
PDF
introducatio to ml introducatio to ml introducatio to ml
DecentMusicians
 
PPTX
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
PDF
Recommendation algorithm using reinforcement learning
Arithmer Inc.
 
PPTX
250310_JH_labseminar[CASER : Personalized Top-N Sequential Recommendation via...
thanhdowork
 
PPTX
IPL match winning predicion using machine learnong
prasadmaruthi272
 
[CS570] Machine Learning Team Project (I know what items really are)
Kunwoo Park
 
JamieStainer ATA SCIEnCE path finder.pptx
RadhaKilari
 
AI AND DATA SCIENCE generative data scinece.pptx
RadhaKilari
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Aravind Sesagiri Raamkumar
 
DutchMLSchool 2022 - History and Developments in ML
BigML, Inc
 
From sensor readings to prediction: on the process of developing practical so...
Manuel Martín
 
An introduction to variable and feature selection
Marco Meoni
 
Customer Churn Analytics using Microsoft R Open
Poo Kuan Hoong
 
RS in the context of Big Data-v4
Khadija Atiya
 
Kaggle Gold Medal Case Study
Alon Bochman, CFA
 
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
thanhdowork
 
250203_JH_labseminar[BERT4Rec : Sequential Recommendation with Bidirectional ...
thanhdowork
 
Ds for finance day 3
QuantUniversity
 
[DSC Europe 24] Dmitrii Matveev - RecSys.pptx
DataScienceConferenc1
 
A Machine learning approach to classify a pair of sentence as duplicate or not.
Pankaj Chandan Mohapatra
 
introducatio to ml introducatio to ml introducatio to ml
DecentMusicians
 
Week 12 Dimensionality Reduction Bagian 1
khairulhuda242
 
Recommendation algorithm using reinforcement learning
Arithmer Inc.
 
250310_JH_labseminar[CASER : Personalized Top-N Sequential Recommendation via...
thanhdowork
 
IPL match winning predicion using machine learnong
prasadmaruthi272
 
Ad

More from David Zibriczky (10)

PDF
Highlights from the 8th ACM Conference on Recommender Systems (RecSys 2014)
David Zibriczky
 
PDF
Predictive Solutions and Analytics for TV & Entertainment Businesses
David Zibriczky
 
PDF
Improving the TV User Experience by Algorithms: Personalized Content Recommen...
David Zibriczky
 
PPTX
Recommender Systems meet Finance - A literature review
David Zibriczky
 
PDF
Fast ALS-Based Matrix Factorization for Recommender Systems
David Zibriczky
 
PDF
EPG content recommendation in large scale: a case study on interactive TV pla...
David Zibriczky
 
PDF
Personalized recommendation of linear content on interactive TV platforms
David Zibriczky
 
PDF
An introduction to Recommender Systems
David Zibriczky
 
PDF
Data Modeling in IPTV and OTT Recommender Systems
David Zibriczky
 
PDF
Entropy based asset pricing
David Zibriczky
 
Highlights from the 8th ACM Conference on Recommender Systems (RecSys 2014)
David Zibriczky
 
Predictive Solutions and Analytics for TV & Entertainment Businesses
David Zibriczky
 
Improving the TV User Experience by Algorithms: Personalized Content Recommen...
David Zibriczky
 
Recommender Systems meet Finance - A literature review
David Zibriczky
 
Fast ALS-Based Matrix Factorization for Recommender Systems
David Zibriczky
 
EPG content recommendation in large scale: a case study on interactive TV pla...
David Zibriczky
 
Personalized recommendation of linear content on interactive TV platforms
David Zibriczky
 
An introduction to Recommender Systems
David Zibriczky
 
Data Modeling in IPTV and OTT Recommender Systems
David Zibriczky
 
Entropy based asset pricing
David Zibriczky
 

Recently uploaded (20)

PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 

A Combination of Simple Models by Forward Predictor Selection for Job Recommendation

  • 1. 1 A COMBINATION OF SIMPLE MODELS BY FORWARD PREDICTOR SELECTION FOR JOB RECOMMENDATION Dávid Zibriczky, PhD (DaveXster) Budapest University of Technology and Economics, Budapest, Hungary
  • 2. 2 The Dataset – Data preparation • Events (interactions, impressions) › Target format: (time,user_id,item_id,type,value) › Interactions  Format OK › Impressions: • Generating unique (time,user_id,item_id) triples • Value  count of their occurrence • Time  12pm on Thursday of the week • Type  5 • Catalog (items, users) › Target format:(id,key1,key2,…,keyN) › Items and users  Format OK › Unknown „0” values  empty values › Inconsistency: Geo-location vs. country/region  Metadata enhancement based on geo-location
  • 3. 3 The Dataset – Basic statistics Size of training set • 211M events, 2.8M users, 1.3M items • Effect: huge and very sparse matrix Distribution • 95% of events are impressions • 72% of the users have impressions only • Item support for interactions is low (~9) • Effect: weak collaboration using interactions Target users • 150K users • 73% active, 16% inactive, 12% new • Effect: user cold start and warm-up problem Data source #events #users #items Interactions 8,826,678 784,687 1,029,480 Impressions 201,872,093 2,755,167 846,814 All events 210,698,777 2,792,405 1,257,422 Catalog - 1,367,057 1,358,098 Catalog OR Events - 2,829,563 1,362,890
  • 4. 4 Methods – Concept Terminology • Method: A technique of estimating the relevance of an item for a user (p-Value) • Predictor/model: An instance of a method with a specified parameter setting • Combination: Linear combination of prediction values for a user-item pairs Approach 1. Exploring the properties of the data set 2. Definition of „simple” methods with different functionality (time-decay is commonly used)* 3. Finding a set of relevant predictors and optimal combination of them 4. Top-N ranking of available event supported items with non-zero p-Values (~200K) * Equations of the methods can be found in the paper
  • 5. 5 Methods – Item-kNN • Observation: Very sparse user-item matrix (0.005%), 211M events • Goal: Next best items to click, estimating recommendations of Xing • Method: Standard Item-based kNN with special fetures › Input-output event types › Controlling popularity factor › Similarity of the same item is 0 › Efficient implementation • Notation: IKNN(I,O) › I: input event type › O: output event type • Comment: No improvement combining other CF algorithms (MF, FM, User-kNN)
  • 6. 6 Methods – Recalling recommendations • Chart: The distribution of impression events by the number of weeks on that the same item has already been shown • Observation: 38% of recommendations are recurring items • Goal: Reverse engineering, recalling recommendations • Method: › Recommendation of already shown items › Weighted by expected CTR • Notation: RCTR
  • 7. 7 Methods – Already seen items • Chart: The probability of returning to an already seen item after interacting on other items • Observation: Significant probability of re- clicking on an already clicked item • Goal: Capturing re-clicking phenomena • Method: Recommendation of already clicked items • Notation: AS(I)
  • 8. 8 Methods – User metadata-based popularity • Observation: › Significant amount of passive and new users › All target users have metadata • Goal: › Semi-personalized recommendations for new users › Improving accuracy on inactive users • Method: 1. Item model: Expected popularity of an item in each user group 2. Prediction: Average popularity of an item for a user › Applied keys: jobroles, edu_fieldofstudies • Notation: UPOP
  • 9. 9 Methods – MS: Meta cosine similarity • Observation: › Item-cold start problem, many low-supported items › Almost all items has metadata • Goal: › Model building for new items › Improving the model of low-supported items • Method: 1. Item model: Meta-data representation, tf-idf 2. User model: Meta-words of items seen by the user 3. Prediction: Average cosine similarity between user-item models › Keys: tags, title, industry_id, geo_country, geo_region, discipline_id • Notation: MS
  • 10. 10 Methods – AP: Age-based popularity change • Observation: Significant drop in popularity of items with ~30 and ~60 days • Goal: Underscoring these items • Method: Expected ratio of the popularity in the next week • Notation: AP
  • 11. 11 Methods – OM: The omit method • Observation: Unwanted items in recommendation lists • Goal: Omitting poorly modelled items of a predictor or combination • Method: 1. Sub-train-test split 2. Retrain a new combination 3. Generating top-N recommendations 4. Measuring how the total evaluation would change by omitting items 5. Omitting worst K items on the original combination • Notation: OM
  • 12. 12 Methods – Optimization 1. Time-based train-test split (test set: last week) 2. Coordinate gradient descent optimization of various methods  candidate predictor set 3. Support-based distinct user groups (new users, inactive users, 10 equal sized group of active users) 4. Forward Predictor Selection 1. Initialization: 1. Predictors that are selected from the candidate set for final combination  selected predictor set 2. Selected predictor set is empty in the beginning 2. Loop: 1. Calculate the accuracy of selected predictor set 2. For all remained candidate predictor, calculate the gain in accuracy that would give the predictor if it would be moved to the selected set 3. Move the best one to the selected set and recalculate combination weights 4. Repeat the loop until there is improvement or reamining candidate preditor 3. Return: the set of the predictors and corresponding weights 5. Retrain selected predictors on the full data set
  • 13. 13 … let’s put it together and see how it performs!
  • 14. 14 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 * Java-based framework, 8-core 3.4 GHz CPU, 32 GB memory
  • 15. 15 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time • Sub-combinations › 4 models: 600K+ score (w/o item metadata) # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 2 +RCTR 208 15 548,338 9 3 +AS(1) 237 17 590,526 6 4 +UPOP 247 50 614,674 5
  • 16. 16 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time • Sub-combinations › 4 models: 600K+ score (w/o item metadata) › 5 models: 3rd place # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 2 +RCTR 208 15 548,338 9 3 +AS(1) 237 17 590,526 6 4 +UPOP 247 50 614,674 5 5 +MS 364 122 623,909 3
  • 17. 17 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time • Sub-combinations › 4 models: 600K+ score (w/o item metadata) › 5 models: 3rd place › 6 models: 95% of final score # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 2 +RCTR 208 15 548,338 9 3 +AS(1) 237 17 590,526 6 4 +UPOP 247 50 614,674 5 5 +MS 364 122 623,909 3 6 +IKNN(R,R) 1,150 168 635,278 3
  • 18. 18 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time • Sub-combinations › 4 models: 600K+ score (w/o item metadata) › 5 models: 3rd place › 6 models: 95% of final score › 10 models: 650K+ score (<30 mins. training time) # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 2 +RCTR 208 15 548,338 9 3 +AS(1) 237 17 590,526 6 4 +UPOP 247 50 614,674 5 5 +MS 364 122 623,909 3 6 +IKNN(R,R) 1,150 168 635,278 3 7 +AS(3) 1,205 178 636,498 3 8 +IKNN(R,C) 1,557 197 643,145 3 9 +AS(4) 1,582 202 644,710 3 10 +AP 1,621 207 652,802 3
  • 19. 19 Evaluation – Forward Predictor Selection • Best single model › Item-kNN trained on positive interactions › 2.5 min training time › 7 ms prediction time • Sub-combinations › 4 models: 600K+ score (w/o item metadata) › 5 models: 3rd place › 6 models: 95% of final score › 10 models: 650K+ score (<30 mins. training time) • Final combination › 3rd place › ~666K leaderboard score › 11 instances › user-support-based weighting › 3h+ training time, 200 ms prediction time # Predictor tTR(s)* tPR(ms)* Score Rank 1 IKNN(C,C) 148 7 450,046 24 2 +RCTR 208 15 548,338 9 3 +AS(1) 237 17 590,526 6 4 +UPOP 247 50 614,674 5 5 +MS 364 122 623,909 3 6 +IKNN(R,R) 1,150 168 635,278 3 7 +AS(3) 1,205 178 636,498 3 8 +IKNN(R,C) 1,557 197 643,145 3 9 +AS(4) 1,582 202 644,710 3 10 +AP 1,621 207 652,802 3 SUPP_C(1-10) 1,639 194 661,359 3 11 +OM 11,790 199 665,592 3 * Java-based framework, 8-core 3.4 GHz CPU, 32 GB memory
  • 20. 20 Evaluation – Timeline 39 1514141415 121110 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 115.4 366.9 418.7 438.3 454.2 468.4 481.9 513.4 569.6 596.5 600.2 603.2 610.0 611.3 611.6 625.2 627.2 627.5 628.9 633.1 637.6 638.1 639.7 640.4 643.5 644.7 652.8 653.2 653.7 665.6 0 5 10 15 20 25 30 35 40 45 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 Apr-25 May-02 May-09 May-16 May-23 May-30 Jun-06 Jun-13 Jun-20 Jun-27 Leaderboardrank Leaderboardscore(thousands) Date Timeline Initial setup Model design and implementation Final sprint
  • 21. 21 Lessons learnt • Exploiting the specificity of the dataset • Using Item-kNN over factorization in a very sparse dataset • Paying attention to recurrence • Forward Predictor Selection is effective • Different optimization for different user groups • Underscoring/omitting weak items • Ranking 200K items is slow • Keep it simple and transparent!
  • 22. 22 Presenter Contact Thank you for your attention! Dávid Zibriczky, PhD [email protected]