SlideShare a Scribd company logo
WSDM Challenge
Recommender Systems
Kenneth Emeka Odoh
25 Jan, 2018 ( Kaggle Meetup | SFU Ventures Labs )
Vancouver, BC
Talk Plan
● Bio
● Competition
● Winning Solutions
● My solution
● Conclusions
● References
2
Kenneth Emeka Odoh
Bio
Education
● Masters in Computer Science (University of Regina) 2013 - 2016
Work
● Research assistant in the field of Visual Analytics 2013 - 2016
○ ( https://scholar.google.com/citations?user=3G2ncgwAAAAJ&hl=en )
● Software engineer working in an IOT startup (Current position) 2017 -
3Kenneth Emeka Odoh
Competition
● Competition rule
○ Only 5 submissions per day
● Data Exploration
○ Data visualization
● Evaluation
● Primer on recommender systems
4
Kenneth Emeka Odoh
Nature of Data
● Data of Listening habits.
● Data are chronologically ordered.
○ Implicit time series data without
timestamps.
○ Timestamps are excluded for sake of
avoiding data leakages.
● Balanced data sets.
What to predict
● Given a data set of listening habit of
users.
● If a song has been listened in recent
past, predict if it will be listening to
again in the same month.
● If the song is listened to again by the
same user in the same month, output
1 or 0 otherwise
Competition: Data Exploration
5
Kenneth Emeka Odoh
Competition: Background
● URL ( https://www.kaggle.com/c/kkbox-music-recommendation-challenge )
● Sister competition on churn prediction by same company.
● Total price $5000
Class of Problem
● Classification
● Sequence prediction ? 6Kenneth Emeka Odoh
Competition: Data files
● train.csv
○ msno, song_id, source_system_tab, source_screen_name, source_type,
target
● songs.csv
○ song_id, song_length, genre_ids, artist_name, composer, lyricist,
language
● members.csv
○ msno, city, bd, gender, registered_via, registration_init_time,
expiration_date
● song_extra_info.csv
○ song_id, song name, isrc
7
Kenneth Emeka Odoh
Competition: Visualization of Data
8
Figure 1: Count vs Source typeKenneth Emeka Odoh
[jeru666]
Competition: Data files
9Figure 2: Count vs Source type grouped by labels
Kenneth Emeka Odoh
[jeru666]
Competition: Data files
10Figure 3: Source screen names vs Count
Kenneth Emeka Odoh
[jeru666]
Competition: Data files
11
Figure 4: Count vs Source screen names grouped by labelKenneth Emeka Odoh
[jeru666]
Competition: Data files
12
Figure 5: Count vs Source system tabKenneth Emeka Odoh
[jeru666]
Competition: Data files
13
Figure 6: Proportion by SexKenneth Emeka Odoh
[jeru666]
Competition: Data files
14
Figure 7: Proportion by Sex grouped by labelKenneth Emeka Odoh
[jeru666]
Competition: Data files
15
Figure 8: Proportion by LanguageKenneth Emeka Odoh
[jeru666]
Competition: Data files
16
Figure 9: City vs CountKenneth Emeka Odoh
[jeru666]
Competition: Evaluation (AUC)
Image credit: Walber & http://gim.unmc.edu/dxtests/roc2.htm
17
Optimizing PR space -> optimizing ROC space [Davis .et.al]
Kenneth Emeka Odoh
Evaluation Metric
● AUC per user vs AUC per group ?
● What about GAUC?
○ calculate the average of the AUC scores of each user.
■ Cold start problem
● AUC vs F-score vs RMSE ?
18Kenneth Emeka Odoh
Competition: Recommender Systems
19
Kenneth Emeka Odoh
U: set of users
I : set of items
R: set of rating of users on items.
Iu
: set of items rated by a user, u.
f : U x I→ R
This is usually a sparse matrix and as such so
form of data imputation is needed
The most common form of recommender systems
● Best-item recommendation
● Top-N recommendation
For a user, ua
, and i* has estimated the highest
value
i* = argmax f(ua
, j )
j ∊ IIu
20
Kenneth Emeka Odoh
WINNING SOLUTIONS
21
Kenneth Emeka Odoh
1st Place Solution
● Similar to Click Value Rate (CVR)
○ Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total users who
accessed the resource.
● Relistening is like purchasing, and first listening is like clicking.
● Not suitable for latent-factor based or neighborhood-based collaborative filtering is not the best
way for this problem.
● Missing not at random.
● Data is sparse as a user-song pair is very informative.
● Data is time sensitive.
Kenneth Emeka Odoh
22
Ensembling
0.6 * LightGBMs and 0.4 * NNs
● Single LightGBMs is 0.74440 on public LB with a learning rate of 0.1.
● Single LightGBMs is 0.74460 on public LB with a learning rate of 0.05.
● Single LightGBMs with bagging is 0.74584 on public LB with a learning rate of 0.05.
For NNs, my best 30-ensemble on the same feature set can get 0.74215, and bagging ensemble of NNs on
different feature set can get 0.74333.
The predictions of LightGBMs and NNs are quite uncorrelated (correlation coefficient is about 0.92). The
ensemble of single LightGBM and single 30-ensemble of NN can get 0.7489+. The ensemble of ensembles
can reach 0.7498+.
Data preprocessing
https://kaggle2.blob.core.windows.net/forum-message-attachments/259372/8131/model.png
Kenneth Emeka Odoh
23
Analyzing Missing Data
Missing-data mechanisms
● Missing completely at random: This occurs when the probability of missing is same for all cases.
● Missing at random: The probability of missing depends only on available information.
● Missingness that depends on unobserved predictors
● Missingness that depends on the missing value itself.
Handling missing data
● Exclude missing data
● Mean / median imputation
● Last value carried forward
● Using information from related observation
24
Kenneth Emeka Odoh
[http://www.stat.columbia.edu/~gelman/arm/missing.pdf]
3rd Place Solution
Identified how to generate features from the future. We can listen or not listen to the same artist by
the same user in the future.
Add matrix factorization to the feature to use xgboost and catboost.
last 35% for future feature
earlier 65% for history
The size of X_train, X_val was about 20% of data.
Kenneth Emeka Odoh
25
The history can be generated using target. Other features uses all data. The generated features in the
context of categorical features.
1) mean of target from history by categorical features (pairs and triples)
2) count – the number of observations in all data by categorical features (pairs and triples)
3) regression – linear regression target by categorical features (msno, msno + genre_ids, msno+composer
and so on)
4) time_from_prev_heard – time from last heard by categorical features (msno, msno + genre_ids,
msno+composer and so on)
5) time_to_next_heard – time to next heard by categorical features (pairs and triples which include msno)
6) last_time_diff – time to the last heard by categorical features
7) part_of_unique_song – the share of unique song of artist that the user heard on all unique song of artist
26Kenneth Emeka Odoh
8) matrix_factorization – LightFM. I use as a user msno in different context (source_type) and the msno
was the feature for user.
I use last 35% for fit. Also I drop last 5% and again use 35%. Such a procedure could be repeated many
times and this greatly improved the score (I did 6 times on validation and this gave an increase of 0.005, but
in the end I did not have time to do everything and did only 2 times on test).
As a result, I fit xgboost and catboost on two parts of the data using and without the matrix_factorization
feature. And finally I blend all 8 predictions.
The code is available here:
https://github.com/VasiliyRubtsov/wsdm_music_recommendations/blob/master/pipeline.ipynb
27Kenneth Emeka Odoh
Factorization Machine
● By factoring into a number of latent vectors.
● lends itself to kernels.
Let us dive into factorization in a new thread.
https://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
Observed Characteristics of factorization machine
● Ideal for handling missing data
28
Kenneth Emeka Odoh
6th Place Solution
I use an ensemble (average) of lightgbm. The best single model public LB 0.726
Feature Engineering
1) raw features for both song, user and context
2) SVD matrix factorization latent space of user-song interaction matrix
3) SVD matrix factorization score of user-song interaction matrix
4) collaborative filtering (CF) score (top 100 neighborhoods ) of user-song interaction
5) similarity of artist, lyricist, composer, genre,language, source_system_tab, source_type,
soure_screen_name,within each user.
6) source_system_tab, source_type, soure_screen_name categorical variable rate of each user.
7) word embedding of artist, lyricist, composer, genre.
I use 10 cross validation, where train / valid set are splitted randomly.
Kenneth Emeka Odoh
29
SVD
30
[https://intoli.com/blog/pca-and-svd/]
Kenneth Emeka Odoh
Word Embeddings
31[https://deeplearning4j.org/word2vec.html]Kenneth Emeka Odoh
My Solution
● Primer on LSTM architectures.
○ Bidirectional lstm.
○ Batch normalization
● Intuition / guide for building reasonable neural architecture.
● Tricks to avoid overfitting.
● Optimizing loss function using AUC.
● Cross validation of time series data.
● Discuss failures in my traditional ML models.
○ Try same model with different custom loss functions
32
Kenneth Emeka Odoh
LSTM
● LSTM is RNN trained using BPTT ( solves
vanishing gradient problem ).
● Capture long term temporal dependency.
● A memory blocks used in place of neuron.
There are three types of gates in a memory block:
○ Forget Gate: decides information to be
removed from the unit.
○ Input Gate: decides information to update
memory state.
○ Output Gate: decides information to output
based on input and the memory state.
Goodfellow, et.al. Deep Learning, 2016
33
Kenneth Emeka Odoh
34[Fei-Fei et. al., 2017]Kenneth Emeka Odoh
35[Fei-Fei et. al., 2017]Kenneth Emeka Odoh
LSTM Architecture
iooi
[Google images]
See tutorial on subject [https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/]
Many-to-many model (boundary detection problem)
36
Kenneth Emeka Odoh
[Fei-Fei et. al., 2017]
● Bidirectional LSTMs can be trained
using past and future data of a time
frame.
● Bidirectional LSTMs can improve
performance on a number of sequence
classification.
● Bidirectional LSTMs is two lstms
where the 1st lstm accept input as-is,
and 2nd accept the input in reverse
order.
Bidirectional LSTM
37
[Yu, Zhou et al.]
Kenneth Emeka Odoh
Batch Normalization
● Normalization in neural architecture.
● Reduce covariance shift thereby improving
training speed.
● Control gradient flow in the network
thereby minimizing saturation in non-relu
based activation.
● Fix distribution between training and
testing set thereby enhancing
generalization.
● Implicit regularization.
[Ioffe et. al 2015]
38
Kenneth Emeka Odoh
Overfitting
● Implicit Regularization
○ Batch normalization
○ Early stopping [https://en.wikipedia.org/wiki/Early_stopping]
● Dropout
39
Kenneth Emeka Odoh
[Srivastava et. al., 2014]
Cross-Validation of Time-Series
● Don’t trust the
leaderboard.
● Trust local CV for model
comparison.
40
Kenneth Emeka Odoh
My Solution: Code
def model_relu6():
model = Sequential()
model.add( Embedding(inputDim*inputDim*inputDim, inputDim, dropout=0.2) )
#input vector dimension
model.add(Convolution1D(nb_filter= nb_filter, filter_length= filter_length,
border_mode='valid', activation='relu',subsample_length=1))
model.add(MaxPooling1D(pool_length= pool_length))
model.add( Bidirectional( LSTM(1024, return_sequences=True) ))
model.add(LeakyReLU())
model.add(BatchNormalization( ))
model.add(Dropout(0.5))
model.add( Bidirectional( LSTM(2048, return_sequences=True)) )
model.add(Activation('relu'))
model.add(BatchNormalization( ))
model.add(Dropout(0.5))
model.add( Bidirectional( LSTM(512, return_sequences=True)) )
model.add(Activation('relu'))
model.add(BatchNormalization( ))
model.add(Dropout(0.5))
model.add( Bidirectional( LSTM(256)) )
model.add(Activation('relu'))
model.add(BatchNormalization( ))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=[jacek_auc,discussion41015_auc])
return model
41Code: https://github.com/kenluck2001/KaggleKenneth/tree/master/wsdm_Kaggle
#full code here ( https://www.kaggle.com/rspadim/gini-keras-callback-earlystopping-validation )
# FROM https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41108
def jacek_auc(y_true, y_pred):
score, up_opt = tf.metrics.auc(y_true, y_pred)
#score, up_opt = tf.contrib.metrics.streaming_auc(y_pred, y_true)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([up_opt]):
score = tf.identity(score)
return score
# FROM https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41015
# AUC for a binary classifier
def discussion41015_auc(y_true, y_pred):
ptas = tf.stack([binary_PTA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0)
pfas = tf.stack([binary_PFA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0)
pfas = tf.concat([tf.ones((1,)) ,pfas],axis=0)
binSizes = -(pfas[1:]-pfas[:-1])
s = ptas*binSizes
return K.sum(s, axis=0)
# PFA, prob false alert for binary classifier
def binary_PFA(y_true, y_pred, threshold=K.variable(value=0.5)):
y_pred = K.cast(y_pred >= threshold, 'float32')
# N = total number of negative labels
N = K.sum(1 - y_true)
# FP = total number of false alerts, alerts from the negative class labels
FP = K.sum(y_pred - y_pred * y_true)
return FP/N
# P_TA prob true alerts for binary classifier
def binary_PTA(y_true, y_pred, threshold=K.variable(value=0.5)):
y_pred = K.cast(y_pred >= threshold, 'float32')
# P = total number of positive labels
P = K.sum(y_true)
# TP = total number of correct alerts, alerts from the positive class labels
TP = K.sum(y_pred * y_true)
return TP/P
Kenneth Emeka Odoh
Private board: 0.59
Alternatives
● Trees (catboost, lightboost)
Other that could work with careful effort
● HMM
● MARS
● ANFIS
● Logistic regression
● Naive Bayes
42Kenneth Emeka Odoh
Conclusions
● Building models versus building infrastructure trade off.
● Debugging a failing model versus trying out a new model.
● Feature engineering versus better algorithm.
● Neural depth versus training size.
● “Ensembling work when the conditions are right”.
○ Provide empirical justification.
● The place of magic numbers.
● “Leakages may not actually be a bad thing”.
43
Kenneth Emeka Odoh
Thanks for listening
@kenluck2001
44
https://www.linkedin.com/in/kenluck2001/
kenneth.odoh@gmail.com
Kenneth Emeka Odoh
References
Yu, Zhou et al. “Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of
non-native spontaneous speech.” ASRU (2015).
Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd
International Conference on Machine Learning, in PMLR 37:448-456 (2015).
Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. 2010. Recommender Systems Handbook (1st ed.). Springer-Verlag New York, Inc.,
New York, NY, USA.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press.
Jeru666, Visualization script, https://www.kaggle.com/jeru666/data-exploration-and-analysis .
Davis, J. & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. ICML '06: Proceedings of the 23rd international
conference on Machine learning 233--240, New York, NY, USA
Fei-Fei Li & Justin Johnson & Serena Yeung, (2017), http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from
overfitting.. Journal of Machine Learning Research, 15, 1929-1958.
45
Kenneth Emeka Odoh

More Related Content

PDF
Tracking the tracker: Time Series Analysis in Python from First Principles
kenluck2001
 
PDF
Tracking the tracker: Time Series Analysis in Python from First Principles
kenluck2001
 
PDF
Matrix and Tensor Tools for Computer Vision
Andrews Cordolino Sobral
 
PDF
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
PDF
Generative Adversarial Networks : Basic architecture and variants
ananth
 
PDF
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
ijcsa
 
PDF
Novel Machine Learning Methods for Extraction of Features Characterizing Data...
Velimir (monty) Vesselinov
 
PDF
Week 1 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 
Tracking the tracker: Time Series Analysis in Python from First Principles
kenluck2001
 
Tracking the tracker: Time Series Analysis in Python from First Principles
kenluck2001
 
Matrix and Tensor Tools for Computer Vision
Andrews Cordolino Sobral
 
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
Generative Adversarial Networks : Basic architecture and variants
ananth
 
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
ijcsa
 
Novel Machine Learning Methods for Extraction of Features Characterizing Data...
Velimir (monty) Vesselinov
 
Week 1 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 

What's hot (20)

PDF
Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Andrews Cordolino Sobral
 
PDF
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
PPTX
Optimization problems and algorithms
Aboul Ella Hassanien
 
PDF
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
ijcsit
 
PPT
Unit1 pg math model
Dr.Gaurang Shah
 
PDF
A Primer on Entity Resolution
Benjamin Bengfort
 
PDF
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
MLReview
 
PDF
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
PDF
Matrix Factorisation (and Dimensionality Reduction)
HJ van Veen
 
PDF
505 260-266
idescitation
 
PDF
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
PDF
VAE-type Deep Generative Models
Kenta Oono
 
PDF
Comparative Performance Analysis & Complexity of Different Sorting Algorithm
paperpublications3
 
PDF
ddpg seminar
민재 정
 
PDF
Machine Learning Lecture 2 Basics
ananth
 
PPTX
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Jisang Yoon
 
ODP
Dimensionality Reduction
Knoldus Inc.
 
PPTX
Machine learning and linear regression programming
Soumya Mukherjee
 
PDF
自然方策勾配法の基礎と応用
Ryo Iwaki
 
PDF
Data structure and algorithm notes
suman khadka
 
Incremental and Multi-feature Tensor Subspace Learning applied for Background...
Andrews Cordolino Sobral
 
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
Optimization problems and algorithms
Aboul Ella Hassanien
 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
ijcsit
 
Unit1 pg math model
Dr.Gaurang Shah
 
A Primer on Entity Resolution
Benjamin Bengfort
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
MLReview
 
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
Matrix Factorisation (and Dimensionality Reduction)
HJ van Veen
 
505 260-266
idescitation
 
Transformer based approaches for visual representation learning
Ryohei Suzuki
 
VAE-type Deep Generative Models
Kenta Oono
 
Comparative Performance Analysis & Complexity of Different Sorting Algorithm
paperpublications3
 
ddpg seminar
민재 정
 
Machine Learning Lecture 2 Basics
ananth
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
Jisang Yoon
 
Dimensionality Reduction
Knoldus Inc.
 
Machine learning and linear regression programming
Soumya Mukherjee
 
自然方策勾配法の基礎と応用
Ryo Iwaki
 
Data structure and algorithm notes
suman khadka
 
Ad

Similar to Kaggle kenneth (20)

PDF
Data Science Game 2017 - Machine Learning Meetup Presentation
Luis Moneda
 
PPTX
Kddcup2011
Liang Xiang
 
PDF
Music recommendations @ MLConf 2014
Erik Bernhardsson
 
PDF
Music Recommendations at Scale with Spark
Chris Johnson
 
PDF
Building Data Pipelines for Music Recommendations at Spotify
Vidhya Murali
 
PDF
BeepTunes Music Recommender System
Tadeh Alexani
 
PDF
Algorithmic Music Recommendations at Spotify
Chris Johnson
 
PDF
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
PDF
Hsjs.pdf
Jyotimoydas2
 
PDF
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain
 
PDF
Building your first recommender - Jettro Coenradie - Codemotion Amsterdam 2018
Codemotion
 
PDF
CF Models for Music Recommendations At Spotify
Vidhya Murali
 
PDF
IRJET- A Personalized Music Recommendation System
IRJET Journal
 
PDF
Machine learning @ Spotify - Madison Big Data Meetup
Andy Sloane
 
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
PPT
Download
butest
 
PPT
Download
butest
 
PDF
Recommendations 101
Esh Vckay
 
PDF
Time-dependand Recommendation based on Implicit Feedback
Xavier Amatriain
 
PPTX
Recommendation system
Ding Li
 
Data Science Game 2017 - Machine Learning Meetup Presentation
Luis Moneda
 
Kddcup2011
Liang Xiang
 
Music recommendations @ MLConf 2014
Erik Bernhardsson
 
Music Recommendations at Scale with Spark
Chris Johnson
 
Building Data Pipelines for Music Recommendations at Spotify
Vidhya Murali
 
BeepTunes Music Recommender System
Tadeh Alexani
 
Algorithmic Music Recommendations at Spotify
Chris Johnson
 
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
Hsjs.pdf
Jyotimoydas2
 
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain
 
Building your first recommender - Jettro Coenradie - Codemotion Amsterdam 2018
Codemotion
 
CF Models for Music Recommendations At Spotify
Vidhya Murali
 
IRJET- A Personalized Music Recommendation System
IRJET Journal
 
Machine learning @ Spotify - Madison Big Data Meetup
Andy Sloane
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
Download
butest
 
Download
butest
 
Recommendations 101
Esh Vckay
 
Time-dependand Recommendation based on Implicit Feedback
Xavier Amatriain
 
Recommendation system
Ding Li
 
Ad

Recently uploaded (20)

PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 

Kaggle kenneth

  • 1. WSDM Challenge Recommender Systems Kenneth Emeka Odoh 25 Jan, 2018 ( Kaggle Meetup | SFU Ventures Labs ) Vancouver, BC
  • 2. Talk Plan ● Bio ● Competition ● Winning Solutions ● My solution ● Conclusions ● References 2 Kenneth Emeka Odoh
  • 3. Bio Education ● Masters in Computer Science (University of Regina) 2013 - 2016 Work ● Research assistant in the field of Visual Analytics 2013 - 2016 ○ ( https://scholar.google.com/citations?user=3G2ncgwAAAAJ&hl=en ) ● Software engineer working in an IOT startup (Current position) 2017 - 3Kenneth Emeka Odoh
  • 4. Competition ● Competition rule ○ Only 5 submissions per day ● Data Exploration ○ Data visualization ● Evaluation ● Primer on recommender systems 4 Kenneth Emeka Odoh
  • 5. Nature of Data ● Data of Listening habits. ● Data are chronologically ordered. ○ Implicit time series data without timestamps. ○ Timestamps are excluded for sake of avoiding data leakages. ● Balanced data sets. What to predict ● Given a data set of listening habit of users. ● If a song has been listened in recent past, predict if it will be listening to again in the same month. ● If the song is listened to again by the same user in the same month, output 1 or 0 otherwise Competition: Data Exploration 5 Kenneth Emeka Odoh
  • 6. Competition: Background ● URL ( https://www.kaggle.com/c/kkbox-music-recommendation-challenge ) ● Sister competition on churn prediction by same company. ● Total price $5000 Class of Problem ● Classification ● Sequence prediction ? 6Kenneth Emeka Odoh
  • 7. Competition: Data files ● train.csv ○ msno, song_id, source_system_tab, source_screen_name, source_type, target ● songs.csv ○ song_id, song_length, genre_ids, artist_name, composer, lyricist, language ● members.csv ○ msno, city, bd, gender, registered_via, registration_init_time, expiration_date ● song_extra_info.csv ○ song_id, song name, isrc 7 Kenneth Emeka Odoh
  • 8. Competition: Visualization of Data 8 Figure 1: Count vs Source typeKenneth Emeka Odoh [jeru666]
  • 9. Competition: Data files 9Figure 2: Count vs Source type grouped by labels Kenneth Emeka Odoh [jeru666]
  • 10. Competition: Data files 10Figure 3: Source screen names vs Count Kenneth Emeka Odoh [jeru666]
  • 11. Competition: Data files 11 Figure 4: Count vs Source screen names grouped by labelKenneth Emeka Odoh [jeru666]
  • 12. Competition: Data files 12 Figure 5: Count vs Source system tabKenneth Emeka Odoh [jeru666]
  • 13. Competition: Data files 13 Figure 6: Proportion by SexKenneth Emeka Odoh [jeru666]
  • 14. Competition: Data files 14 Figure 7: Proportion by Sex grouped by labelKenneth Emeka Odoh [jeru666]
  • 15. Competition: Data files 15 Figure 8: Proportion by LanguageKenneth Emeka Odoh [jeru666]
  • 16. Competition: Data files 16 Figure 9: City vs CountKenneth Emeka Odoh [jeru666]
  • 17. Competition: Evaluation (AUC) Image credit: Walber & http://gim.unmc.edu/dxtests/roc2.htm 17 Optimizing PR space -> optimizing ROC space [Davis .et.al] Kenneth Emeka Odoh
  • 18. Evaluation Metric ● AUC per user vs AUC per group ? ● What about GAUC? ○ calculate the average of the AUC scores of each user. ■ Cold start problem ● AUC vs F-score vs RMSE ? 18Kenneth Emeka Odoh
  • 20. U: set of users I : set of items R: set of rating of users on items. Iu : set of items rated by a user, u. f : U x I→ R This is usually a sparse matrix and as such so form of data imputation is needed The most common form of recommender systems ● Best-item recommendation ● Top-N recommendation For a user, ua , and i* has estimated the highest value i* = argmax f(ua , j ) j ∊ IIu 20 Kenneth Emeka Odoh
  • 22. 1st Place Solution ● Similar to Click Value Rate (CVR) ○ Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total users who accessed the resource. ● Relistening is like purchasing, and first listening is like clicking. ● Not suitable for latent-factor based or neighborhood-based collaborative filtering is not the best way for this problem. ● Missing not at random. ● Data is sparse as a user-song pair is very informative. ● Data is time sensitive. Kenneth Emeka Odoh 22
  • 23. Ensembling 0.6 * LightGBMs and 0.4 * NNs ● Single LightGBMs is 0.74440 on public LB with a learning rate of 0.1. ● Single LightGBMs is 0.74460 on public LB with a learning rate of 0.05. ● Single LightGBMs with bagging is 0.74584 on public LB with a learning rate of 0.05. For NNs, my best 30-ensemble on the same feature set can get 0.74215, and bagging ensemble of NNs on different feature set can get 0.74333. The predictions of LightGBMs and NNs are quite uncorrelated (correlation coefficient is about 0.92). The ensemble of single LightGBM and single 30-ensemble of NN can get 0.7489+. The ensemble of ensembles can reach 0.7498+. Data preprocessing https://kaggle2.blob.core.windows.net/forum-message-attachments/259372/8131/model.png Kenneth Emeka Odoh 23
  • 24. Analyzing Missing Data Missing-data mechanisms ● Missing completely at random: This occurs when the probability of missing is same for all cases. ● Missing at random: The probability of missing depends only on available information. ● Missingness that depends on unobserved predictors ● Missingness that depends on the missing value itself. Handling missing data ● Exclude missing data ● Mean / median imputation ● Last value carried forward ● Using information from related observation 24 Kenneth Emeka Odoh [http://www.stat.columbia.edu/~gelman/arm/missing.pdf]
  • 25. 3rd Place Solution Identified how to generate features from the future. We can listen or not listen to the same artist by the same user in the future. Add matrix factorization to the feature to use xgboost and catboost. last 35% for future feature earlier 65% for history The size of X_train, X_val was about 20% of data. Kenneth Emeka Odoh 25
  • 26. The history can be generated using target. Other features uses all data. The generated features in the context of categorical features. 1) mean of target from history by categorical features (pairs and triples) 2) count – the number of observations in all data by categorical features (pairs and triples) 3) regression – linear regression target by categorical features (msno, msno + genre_ids, msno+composer and so on) 4) time_from_prev_heard – time from last heard by categorical features (msno, msno + genre_ids, msno+composer and so on) 5) time_to_next_heard – time to next heard by categorical features (pairs and triples which include msno) 6) last_time_diff – time to the last heard by categorical features 7) part_of_unique_song – the share of unique song of artist that the user heard on all unique song of artist 26Kenneth Emeka Odoh
  • 27. 8) matrix_factorization – LightFM. I use as a user msno in different context (source_type) and the msno was the feature for user. I use last 35% for fit. Also I drop last 5% and again use 35%. Such a procedure could be repeated many times and this greatly improved the score (I did 6 times on validation and this gave an increase of 0.005, but in the end I did not have time to do everything and did only 2 times on test). As a result, I fit xgboost and catboost on two parts of the data using and without the matrix_factorization feature. And finally I blend all 8 predictions. The code is available here: https://github.com/VasiliyRubtsov/wsdm_music_recommendations/blob/master/pipeline.ipynb 27Kenneth Emeka Odoh
  • 28. Factorization Machine ● By factoring into a number of latent vectors. ● lends itself to kernels. Let us dive into factorization in a new thread. https://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf Observed Characteristics of factorization machine ● Ideal for handling missing data 28 Kenneth Emeka Odoh
  • 29. 6th Place Solution I use an ensemble (average) of lightgbm. The best single model public LB 0.726 Feature Engineering 1) raw features for both song, user and context 2) SVD matrix factorization latent space of user-song interaction matrix 3) SVD matrix factorization score of user-song interaction matrix 4) collaborative filtering (CF) score (top 100 neighborhoods ) of user-song interaction 5) similarity of artist, lyricist, composer, genre,language, source_system_tab, source_type, soure_screen_name,within each user. 6) source_system_tab, source_type, soure_screen_name categorical variable rate of each user. 7) word embedding of artist, lyricist, composer, genre. I use 10 cross validation, where train / valid set are splitted randomly. Kenneth Emeka Odoh 29
  • 32. My Solution ● Primer on LSTM architectures. ○ Bidirectional lstm. ○ Batch normalization ● Intuition / guide for building reasonable neural architecture. ● Tricks to avoid overfitting. ● Optimizing loss function using AUC. ● Cross validation of time series data. ● Discuss failures in my traditional ML models. ○ Try same model with different custom loss functions 32 Kenneth Emeka Odoh
  • 33. LSTM ● LSTM is RNN trained using BPTT ( solves vanishing gradient problem ). ● Capture long term temporal dependency. ● A memory blocks used in place of neuron. There are three types of gates in a memory block: ○ Forget Gate: decides information to be removed from the unit. ○ Input Gate: decides information to update memory state. ○ Output Gate: decides information to output based on input and the memory state. Goodfellow, et.al. Deep Learning, 2016 33 Kenneth Emeka Odoh
  • 34. 34[Fei-Fei et. al., 2017]Kenneth Emeka Odoh
  • 35. 35[Fei-Fei et. al., 2017]Kenneth Emeka Odoh
  • 36. LSTM Architecture iooi [Google images] See tutorial on subject [https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/] Many-to-many model (boundary detection problem) 36 Kenneth Emeka Odoh [Fei-Fei et. al., 2017]
  • 37. ● Bidirectional LSTMs can be trained using past and future data of a time frame. ● Bidirectional LSTMs can improve performance on a number of sequence classification. ● Bidirectional LSTMs is two lstms where the 1st lstm accept input as-is, and 2nd accept the input in reverse order. Bidirectional LSTM 37 [Yu, Zhou et al.] Kenneth Emeka Odoh
  • 38. Batch Normalization ● Normalization in neural architecture. ● Reduce covariance shift thereby improving training speed. ● Control gradient flow in the network thereby minimizing saturation in non-relu based activation. ● Fix distribution between training and testing set thereby enhancing generalization. ● Implicit regularization. [Ioffe et. al 2015] 38 Kenneth Emeka Odoh
  • 39. Overfitting ● Implicit Regularization ○ Batch normalization ○ Early stopping [https://en.wikipedia.org/wiki/Early_stopping] ● Dropout 39 Kenneth Emeka Odoh [Srivastava et. al., 2014]
  • 40. Cross-Validation of Time-Series ● Don’t trust the leaderboard. ● Trust local CV for model comparison. 40 Kenneth Emeka Odoh
  • 41. My Solution: Code def model_relu6(): model = Sequential() model.add( Embedding(inputDim*inputDim*inputDim, inputDim, dropout=0.2) ) #input vector dimension model.add(Convolution1D(nb_filter= nb_filter, filter_length= filter_length, border_mode='valid', activation='relu',subsample_length=1)) model.add(MaxPooling1D(pool_length= pool_length)) model.add( Bidirectional( LSTM(1024, return_sequences=True) )) model.add(LeakyReLU()) model.add(BatchNormalization( )) model.add(Dropout(0.5)) model.add( Bidirectional( LSTM(2048, return_sequences=True)) ) model.add(Activation('relu')) model.add(BatchNormalization( )) model.add(Dropout(0.5)) model.add( Bidirectional( LSTM(512, return_sequences=True)) ) model.add(Activation('relu')) model.add(BatchNormalization( )) model.add(Dropout(0.5)) model.add( Bidirectional( LSTM(256)) ) model.add(Activation('relu')) model.add(BatchNormalization( )) model.add(Dropout(0.5)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam',metrics=[jacek_auc,discussion41015_auc]) return model 41Code: https://github.com/kenluck2001/KaggleKenneth/tree/master/wsdm_Kaggle #full code here ( https://www.kaggle.com/rspadim/gini-keras-callback-earlystopping-validation ) # FROM https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41108 def jacek_auc(y_true, y_pred): score, up_opt = tf.metrics.auc(y_true, y_pred) #score, up_opt = tf.contrib.metrics.streaming_auc(y_pred, y_true) K.get_session().run(tf.local_variables_initializer()) with tf.control_dependencies([up_opt]): score = tf.identity(score) return score # FROM https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41015 # AUC for a binary classifier def discussion41015_auc(y_true, y_pred): ptas = tf.stack([binary_PTA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0) pfas = tf.stack([binary_PFA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0) pfas = tf.concat([tf.ones((1,)) ,pfas],axis=0) binSizes = -(pfas[1:]-pfas[:-1]) s = ptas*binSizes return K.sum(s, axis=0) # PFA, prob false alert for binary classifier def binary_PFA(y_true, y_pred, threshold=K.variable(value=0.5)): y_pred = K.cast(y_pred >= threshold, 'float32') # N = total number of negative labels N = K.sum(1 - y_true) # FP = total number of false alerts, alerts from the negative class labels FP = K.sum(y_pred - y_pred * y_true) return FP/N # P_TA prob true alerts for binary classifier def binary_PTA(y_true, y_pred, threshold=K.variable(value=0.5)): y_pred = K.cast(y_pred >= threshold, 'float32') # P = total number of positive labels P = K.sum(y_true) # TP = total number of correct alerts, alerts from the positive class labels TP = K.sum(y_pred * y_true) return TP/P Kenneth Emeka Odoh Private board: 0.59
  • 42. Alternatives ● Trees (catboost, lightboost) Other that could work with careful effort ● HMM ● MARS ● ANFIS ● Logistic regression ● Naive Bayes 42Kenneth Emeka Odoh
  • 43. Conclusions ● Building models versus building infrastructure trade off. ● Debugging a failing model versus trying out a new model. ● Feature engineering versus better algorithm. ● Neural depth versus training size. ● “Ensembling work when the conditions are right”. ○ Provide empirical justification. ● The place of magic numbers. ● “Leakages may not actually be a bad thing”. 43 Kenneth Emeka Odoh
  • 45. References Yu, Zhou et al. “Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech.” ASRU (2015). Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning, in PMLR 37:448-456 (2015). Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. 2010. Recommender Systems Handbook (1st ed.). Springer-Verlag New York, Inc., New York, NY, USA. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. Jeru666, Visualization script, https://www.kaggle.com/jeru666/data-exploration-and-analysis . Davis, J. & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. ICML '06: Proceedings of the 23rd international conference on Machine learning 233--240, New York, NY, USA Fei-Fei Li & Justin Johnson & Serena Yeung, (2017), http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research, 15, 1929-1958. 45 Kenneth Emeka Odoh