SlideShare a Scribd company logo
Dictionary Learning for
Massive Matrix Factorization
Arthur Mensch, Julien Mairal
Ga¨el Varoquaux, Bertrand Thirion
Inria Parietal, Inria Thoth
October 6, 2016
Introduction
Why am I here ?
Inria Parietal: machine learning for neuro-imaging
(fMRI data)
Matrix factorization: major ingredient in fMRI analysis
Very large datasets (2 TB): we designed faster algorithms
These algorithms can be used in collaborative filtering
D AX
Voxels
Time
=
k spatial maps Time
x
1
Work presented at ICML 2016
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
Collaborative filtering
Collaborative platform
n users rate a fraction of
p items
e.g movies, restaurants
Estimate ratings for
recommendation
Use the ratings of other users for recommendation
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
How to predict ratings ?
Credit: [Bell and Koren, 2007]
Joe like We were
soldiers, Black Hawk
down.
Bob and Alice like the
same films, and also
like Saving private
Ryan.
Joe should watch
Saving private Ryan,
because all of them
indeed likes war films.
Need to uncover topics in items
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Predicting rate with scalar products
Embeddings to model the existence of genre/category/topics
Representative vectors for
users and items:
(αj
)1≤j≤n, (di )1≤i≤p ∈ Rk
q-th coefficient of di , αj
= affinity with the “topic” q
Ratings xij (item i, user j):
xij = di αj
( + biases)
= Common affinity for topics
xij αj
di
k topics
di
αj
1
Learning problem: estimate D and A with known ratings
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
Matrix factorization
1Ω X AD
p
n k
n
=
1
X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n
Constraints / penalty on factors D and A
We only observe 1Ω X — Ω set of ratings provided by users
Recommender systems : millions of users, millions of items
How to scale matrix factorization to very large datasets ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
Formalism
Finding representation in Rk for items and users:
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
= 1Ω (X − DA) 2
2 + λ( D 2
F + A 2
F ) 1Ω set of knownratings
2 reconstruction loss — 2 penalty for generalization
Existing methods
Alternated minimization
Stochastic gradient descent
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
Existing methods
Alternated minimization
Minimize over A, D: alternate between
D = min
D∈Rp×k
(i,j)∈Ω
(xij − di αj
)2
+ λ D 2
F
A = min
A∈Rk×n
(i,j)∈Ω
(xij − di αj
)2
+ λ A 2
F
No hyperparameters
Slow and memory expensive: use all ratings at each iteration
a.k.a. coordinate descent (variation in parameter update order)
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
Existing methods
Stochastic gradient descent
min
A,D
(i,j)∈Ω
fij (A, B)
def
= (xij − di αj
)2
+
1
cj
λ αj 2
2 +
1
ci
λ di
2
2
Gradient step for each rating:
(At, Dt) ← (At−1, Dt−1)−
1
ct
(A,D)fij (At−1, Dt−1)
Fast and memory efficient – won the Netflix prize
Very sensitive to step sizes (ct) – need to cross-validate
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
Towards a new algorithm
Best of both worlds ?
Fast and memory efficient algorithm
Little sensitive to hyperparameter setting
Subsampled online dictionary learning
Builds upon the online dictionary learning algorithm
popular in computer vision and interpretable learning (fMRI)
Adapt it to handle missing values efficiently
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
Dictionary learning
Recall: recommender system formalism
Non-masked matrix factorization with 2 penalty:
min
D∈Rp×k
A∈Rk×n
n
j=1
(xj
− D αj
)2
+ λ( D 2
F + A 2
F )
Penalties can be richer, and made into constraints
Dictionary learning
Learn the left side factor [Olshausen and Field, 1997]
min
D∈C
n
j=1
xj
−Dαj 2
2 +λΩ(αj
) αj
= argmin
α∈Rk
xi
−Dα 2
2 +λΩ(α)
Naive approach: alternated minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
Online dictionary learning [Mairal et al., 2010]
At iteration t, select xt in {xj }j (user ratings), improve D
Single iteration complexity ∝ sample dimension O(p)
(Dt)t converges in a few epochs (one for large n)
xt αtD
p
n k n
=Stream
1
Very efficient in computer vision / networks / fMRI /
hyperspectral images
Can we use it efficiently for recommender systems ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
In short: Handling missing values
X
p
n
xt
Steam
Handle large n
n
Handle missing values
Online → online + partial
Batch →
online
Mtxt
Stream
Ignore
Unknown
Unaccessed
1
Leverage streaming + partial access to samples
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
In detail: online dictionary learning
Objective function involves latent codes (right side factor)
min
D∈C
1
t
t
i=1
xi − Dα∗
i (D) 2
2, α∗
i (D) = argmin
α
1
2
xi − Dα 2
2 + λΩ(α)
Replace latent codes by codes computed with old dictionaries
Build an upper-bounding surrogate function
min
1
t
t
i=1
xi −Dαi
2
2 αi = argmin
α
1
2
xi −Di−1α 2
2+λΩ(α)
Minimize surrogate — updateable online at low cost
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
In detail: online dictionary learning
Algorithm outline
1 Compute code – xt → complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2 Update the surrogate function – Complexity in O(p)
gt =
1
t
t
i=1
xi − Dαi
2
2 = Tr (
1
2
D DAt − D Bt)
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3 Minimize surrogate – Complexity in O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
Specification for a new algorithm
Mtxt
Stream
Ignore
p
n
1
Constrained : use only known
ratings from Ω
Efficient: single iteration in O(s),
# of ratings provided by user t
Principled: follows the online
matrix factorization algorithm as
much as possible
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Missing values in practice
Data stream: (xt)t → masked (Mtxt)t
= ratings from user t
Dimension: p (all items) → s (rated items)
Use only Mtxt in algorithm computation
→ complexity in O(s)
Mtxt
Stream
Ignore
p
n
1
Adaptation to make
Modify all parts of the algorithm to obtain O(s) complexity
1 Code
computation
2 Surrogate
update
3 Surrogate
minimization
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
Subsampled online dictionary learning
Check out paper !
Original online MF
1 Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xt αt − Bt−1)
3 Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At )j,j
(DAj
t −Bj
t ))
Our algorithm
1 Code computation: masked loss
αt = argmin
α∈Rk
Mt (xt − Dt−1α) 2
2
+ λ
rk Mt
p
Ω(αt )
2 Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
i=1 Mi
(Mt xt αt − Mt Bt−1)
3 Surrogate minimization
Mt Dj
← p⊥
Cj
(Mt Dj
−
1
(At )j,j
Mt (D(Aj
t − (Bj
t ))
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
D´eroul´e
1 Matrix factorization for recommender systems
Collaborative filtering
Matrix factorization formulation
Existing methods
2 Subsampled online dictionary learning
Dictionary learning – existing methods
Handling missing values efficiently
New algorithm
3 Results
Setting
Benchmarks
Parameter setting
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
Experiments
Validation : Test RMSE (rating prediction) vs CPU time
Baseline : Coordinate descent solver [Yu et al., 2012] for
min
D∈Rp×k
A∈Rk×n (i,j)∈Ω
(xij − di αj
)2
+ λ( D 2
F + A 2
F )
Fastest solver available apart from SGD — hyperparameters
↑ Our method has a learning rate with little influence
Datasets : Movielens, Netflix
Publicly available
Larger one in the industry...
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
Results
Scalable algorithm: speed-up improves with size
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
Performance
Dataset Test RMSE Convergence time Speed
CD SODL CD SODL -up
ML 1M 0.872 0.866 6 s 8 s ×0.75
ML 10M 0.802 0.799 223 s 60 s ×3.7
NF (140M) 0.938 0.934 1714 s 256 s ×6.8
Outperform coordinate descent beyond 10M ratings
Same prediction performance
Speed-up 6.8× on Netflix
Simple model: RMSE is not state-of-the-art
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
Robustness to learning rate
Learning rate in algorithm to be set in [0.75, 1] (← theory)
In practice: Just set it in [0.8, 1]
1 10 40Epoch
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
RMSEontestset
Learning rate β0.75
0.78
0.81
0.83
0.86
0.89
0.92
0.94
0.97
1.00
MovieLens 10M
.1 1 10 20
0.93
0.94
0.95
0.96
0.97
0.98
0.99
Netflix
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Conclusion
Take-home message
Online matrix factorization can be adapted
to handle missing value efficiently, with very
good performance in reccommender system
Mtxt
Stream
Ignore
p
n
1Algorithm usable in any rich model involving matrix factorization
Python package http://github.com/arthurmensch/modl
Article/slides at http://amensch.fr/publications
Questions ?
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
Appendix: Resting-state fMRI
Online dictionary learning
235 h run time
1 full epoch
10 h run time
1
24 epoch
Proposed method
10 h run time
1
2 epoch, reduction r=12
Qualitatively, usable maps are obtained 10× faster
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
Bibliography I
[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).
Lessons from the Netflix prize challenge.
ACM SIGKDD Explorations Newsletter, 9(2):75–79.
[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research, 11:19–60.
[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).
Sparse coding with an overcomplete basis set: A strategy employed by V1?
Vision Research, 37(23):3311–3325.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).
Scalable coordinate descent approaches to parallel matrix factorization for
recommender systems.
In Proceedings of the International Conference on Data Mining, pages
765–774. IEEE.
Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28

More Related Content

What's hot (18)

PPT
Max Entropy
jianingy
 
PDF
Machine learning
Shreyas G S
 
PDF
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
PDF
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
PDF
Additive model and boosting tree
Dong Guo
 
PDF
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
PDF
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
PDF
Learning to Reconstruct
Jonas Adler
 
PPTX
Intro to machine learning
Akshay Kanchan
 
PDF
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
 
PDF
A short and naive introduction to using network in prediction models
tuxette
 
PDF
Machine learning
Andrea Iacono
 
PDF
Random Matrix Theory in Array Signal Processing: Application Examples
Förderverein Technische Fakultät
 
PDF
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
PDF
Estimating Space-Time Covariance from Finite Sample Sets
Förderverein Technische Fakultät
 
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
PDF
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Max Entropy
jianingy
 
Machine learning
Shreyas G S
 
Safe and Efficient Off-Policy Reinforcement Learning
mooopan
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Additive model and boosting tree
Dong Guo
 
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Dual Learning for Machine Translation (NIPS 2016)
Toru Fujino
 
Learning to Reconstruct
Jonas Adler
 
Intro to machine learning
Akshay Kanchan
 
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
 
A short and naive introduction to using network in prediction models
tuxette
 
Machine learning
Andrea Iacono
 
Random Matrix Theory in Array Signal Processing: Application Examples
Förderverein Technische Fakultät
 
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
Estimating Space-Time Covariance from Finite Sample Sets
Förderverein Technische Fakultät
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 

Viewers also liked (10)

PDF
CONTENT2VEC: a Joint Architecture to use Product Image and Text for the task ...
recsysfr
 
PDF
Predictive quality metrics @ tinyclues - Artem Kozhevnikov - Tinyclues
recsysfr
 
PDF
Sequential Learning in the Position-Based Model
recsysfr
 
PDF
Recommendation @ Meetic
recsysfr
 
PDF
What can bring library metadata to the web? Trust, links and love
recsysfr
 
PDF
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
recsysfr
 
PPTX
RecsysFR: Criteo presentation
recsysfr
 
PDF
Injecting semantic links into a graph-based recommender system
recsysfr
 
PDF
Pulpix - Video Recommendation at Scale
recsysfr
 
PDF
Highlights on most interesting RecSys papers - Elena Smirnova, Lowik Chanusso...
recsysfr
 
CONTENT2VEC: a Joint Architecture to use Product Image and Text for the task ...
recsysfr
 
Predictive quality metrics @ tinyclues - Artem Kozhevnikov - Tinyclues
recsysfr
 
Sequential Learning in the Position-Based Model
recsysfr
 
Recommendation @ Meetic
recsysfr
 
What can bring library metadata to the web? Trust, links and love
recsysfr
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
recsysfr
 
RecsysFR: Criteo presentation
recsysfr
 
Injecting semantic links into a graph-based recommender system
recsysfr
 
Pulpix - Video Recommendation at Scale
recsysfr
 
Highlights on most interesting RecSys papers - Elena Smirnova, Lowik Chanusso...
recsysfr
 
Ad

Similar to Dictionary Learning for Massive Matrix Factorization (20)

PDF
Dictionary Learning for Massive Matrix Factorization
Arthur Mensch
 
PDF
Matrix Factorizations for Recommender Systems
Dmitriy Selivanov
 
PDF
Talk icml
Bo Li
 
PPT
Jörg Stelzer
butest
 
PDF
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PDF
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
Yuko Kuroki (黒木祐子)
 
PDF
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
PDF
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
PDF
Machine learning in science and industry — day 1
arogozhnikov
 
PDF
SURF 2012 Final Report(1)
Eric Zhang
 
PDF
ENBIS 2018 presentation on Deep k-Means
tthonet
 
PDF
slides-defense-jie
jie ren
 
PDF
Introduction to Big Data Science
Albert Bifet
 
PDF
Automatic Task-based Code Generation for High Performance DSEL
Joel Falcou
 
PPT
Machine Learning: Foundations Course Number 0368403401
butest
 
PDF
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
South Tyrol Free Software Conference
 
PDF
Review_Cibe Sridharan
Cibe Sridharan
 
PPT
AlgorithmAnalysis2.ppt
REMEGIUSPRAVEENSAHAY
 
Dictionary Learning for Massive Matrix Factorization
Arthur Mensch
 
Matrix Factorizations for Recommender Systems
Dmitriy Selivanov
 
Talk icml
Bo Li
 
Jörg Stelzer
butest
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Deep Learning for Search
Bhaskar Mitra
 
Deep Learning for Search
Bhaskar Mitra
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
Yuko Kuroki (黒木祐子)
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
Machine learning in science and industry — day 1
arogozhnikov
 
SURF 2012 Final Report(1)
Eric Zhang
 
ENBIS 2018 presentation on Deep k-Means
tthonet
 
slides-defense-jie
jie ren
 
Introduction to Big Data Science
Albert Bifet
 
Automatic Task-based Code Generation for High Performance DSEL
Joel Falcou
 
Machine Learning: Foundations Course Number 0368403401
butest
 
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
South Tyrol Free Software Conference
 
Review_Cibe Sridharan
Cibe Sridharan
 
AlgorithmAnalysis2.ppt
REMEGIUSPRAVEENSAHAY
 
Ad

More from recsysfr (14)

PPTX
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Five
recsysfr
 
PDF
Building a recommender system with Annoy and Word2Vec by Cristian PEREZ, Kern...
recsysfr
 
PDF
An Homophily-based Approach for Fast Post Recommendation in Microblogging Sys...
recsysfr
 
PDF
Recommendations @ Rakuten Group
recsysfr
 
PPTX
Recommender systems
recsysfr
 
PDF
Recommendation @Deezer
recsysfr
 
PPTX
Flexible recommender systems based on graphs
recsysfr
 
PPTX
Using Neural Networks to predict user ratings
recsysfr
 
PDF
Preference Elicitation in Mangaki: Is Your Taste Kinda Weird?
recsysfr
 
PDF
Recommendation @ PriceMinister-Rakuten - Road to personalization
recsysfr
 
PDF
Rakuten Institute of Technology Paris
recsysfr
 
PDF
Tailor-made personalization and recommendation - Sailendra
recsysfr
 
PDF
New tools from the bandit literature to improve A/B Testing
recsysfr
 
PDF
Story of the algorithms behind Deezer Flow
recsysfr
 
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Five
recsysfr
 
Building a recommender system with Annoy and Word2Vec by Cristian PEREZ, Kern...
recsysfr
 
An Homophily-based Approach for Fast Post Recommendation in Microblogging Sys...
recsysfr
 
Recommendations @ Rakuten Group
recsysfr
 
Recommender systems
recsysfr
 
Recommendation @Deezer
recsysfr
 
Flexible recommender systems based on graphs
recsysfr
 
Using Neural Networks to predict user ratings
recsysfr
 
Preference Elicitation in Mangaki: Is Your Taste Kinda Weird?
recsysfr
 
Recommendation @ PriceMinister-Rakuten - Road to personalization
recsysfr
 
Rakuten Institute of Technology Paris
recsysfr
 
Tailor-made personalization and recommendation - Sailendra
recsysfr
 
New tools from the bandit literature to improve A/B Testing
recsysfr
 
Story of the algorithms behind Deezer Flow
recsysfr
 

Recently uploaded (20)

PPTX
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
PPTX
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
PPTX
Different Generation Of Computers .pptx
divcoder9507
 
PPTX
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 
PPTX
How tech helps people in the modern era.
upadhyayaryan154
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PPTX
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
PPT
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
PDF
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PDF
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
PPTX
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
PPTX
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
PDF
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
PDF
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
Different Generation Of Computers .pptx
divcoder9507
 
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 
How tech helps people in the modern era.
upadhyayaryan154
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 

Dictionary Learning for Massive Matrix Factorization

  • 1. Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal Ga¨el Varoquaux, Bertrand Thirion Inria Parietal, Inria Thoth October 6, 2016
  • 2. Introduction Why am I here ? Inria Parietal: machine learning for neuro-imaging (fMRI data) Matrix factorization: major ingredient in fMRI analysis Very large datasets (2 TB): we designed faster algorithms These algorithms can be used in collaborative filtering D AX Voxels Time = k spatial maps Time x 1 Work presented at ICML 2016 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28
  • 3. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28
  • 4. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28
  • 5. Collaborative filtering Collaborative platform n users rate a fraction of p items e.g movies, restaurants Estimate ratings for recommendation Use the ratings of other users for recommendation Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28
  • 6. How to predict ratings ? Credit: [Bell and Koren, 2007] Joe like We were soldiers, Black Hawk down. Bob and Alice like the same films, and also like Saving private Ryan. Joe should watch Saving private Ryan, because all of them indeed likes war films. Need to uncover topics in items Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28
  • 7. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 8. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 9. Predicting rate with scalar products Embeddings to model the existence of genre/category/topics Representative vectors for users and items: (αj )1≤j≤n, (di )1≤i≤p ∈ Rk q-th coefficient of di , αj = affinity with the “topic” q Ratings xij (item i, user j): xij = di αj ( + biases) = Common affinity for topics xij αj di k topics di αj 1 Learning problem: estimate D and A with known ratings Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28
  • 10. Matrix factorization 1Ω X AD p n k n = 1 X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n Constraints / penalty on factors D and A We only observe 1Ω X — Ω set of ratings provided by users Recommender systems : millions of users, millions of items How to scale matrix factorization to very large datasets ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28
  • 11. Formalism Finding representation in Rk for items and users: min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) = 1Ω (X − DA) 2 2 + λ( D 2 F + A 2 F ) 1Ω set of knownratings 2 reconstruction loss — 2 penalty for generalization Existing methods Alternated minimization Stochastic gradient descent Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28
  • 12. Existing methods Alternated minimization Minimize over A, D: alternate between D = min D∈Rp×k (i,j)∈Ω (xij − di αj )2 + λ D 2 F A = min A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ A 2 F No hyperparameters Slow and memory expensive: use all ratings at each iteration a.k.a. coordinate descent (variation in parameter update order) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28
  • 13. Existing methods Stochastic gradient descent min A,D (i,j)∈Ω fij (A, B) def = (xij − di αj )2 + 1 cj λ αj 2 2 + 1 ci λ di 2 2 Gradient step for each rating: (At, Dt) ← (At−1, Dt−1)− 1 ct (A,D)fij (At−1, Dt−1) Fast and memory efficient – won the Netflix prize Very sensitive to step sizes (ct) – need to cross-validate Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28
  • 14. Towards a new algorithm Best of both worlds ? Fast and memory efficient algorithm Little sensitive to hyperparameter setting Subsampled online dictionary learning Builds upon the online dictionary learning algorithm popular in computer vision and interpretable learning (fMRI) Adapt it to handle missing values efficiently Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28
  • 15. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28
  • 16. Dictionary learning Recall: recommender system formalism Non-masked matrix factorization with 2 penalty: min D∈Rp×k A∈Rk×n n j=1 (xj − D αj )2 + λ( D 2 F + A 2 F ) Penalties can be richer, and made into constraints Dictionary learning Learn the left side factor [Olshausen and Field, 1997] min D∈C n j=1 xj −Dαj 2 2 +λΩ(αj ) αj = argmin α∈Rk xi −Dα 2 2 +λΩ(α) Naive approach: alternated minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28
  • 17. Online dictionary learning [Mairal et al., 2010] At iteration t, select xt in {xj }j (user ratings), improve D Single iteration complexity ∝ sample dimension O(p) (Dt)t converges in a few epochs (one for large n) xt αtD p n k n =Stream 1 Very efficient in computer vision / networks / fMRI / hyperspectral images Can we use it efficiently for recommender systems ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28
  • 18. In short: Handling missing values X p n xt Steam Handle large n n Handle missing values Online → online + partial Batch → online Mtxt Stream Ignore Unknown Unaccessed 1 Leverage streaming + partial access to samples Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28
  • 19. In detail: online dictionary learning Objective function involves latent codes (right side factor) min D∈C 1 t t i=1 xi − Dα∗ i (D) 2 2, α∗ i (D) = argmin α 1 2 xi − Dα 2 2 + λΩ(α) Replace latent codes by codes computed with old dictionaries Build an upper-bounding surrogate function min 1 t t i=1 xi −Dαi 2 2 αi = argmin α 1 2 xi −Di−1α 2 2+λΩ(α) Minimize surrogate — updateable online at low cost Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28
  • 20. In detail: online dictionary learning Algorithm outline 1 Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 21. In detail: online dictionary learning Algorithm outline 1 Compute code – xt → complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2 Update the surrogate function – Complexity in O(p) gt = 1 t t i=1 xi − Dαi 2 2 = Tr ( 1 2 D DAt − D Bt) At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3 Minimize surrogate – Complexity in O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28
  • 22. Specification for a new algorithm Mtxt Stream Ignore p n 1 Constrained : use only known ratings from Ω Efficient: single iteration in O(s), # of ratings provided by user t Principled: follows the online matrix factorization algorithm as much as possible Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28
  • 23. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 24. Missing values in practice Data stream: (xt)t → masked (Mtxt)t = ratings from user t Dimension: p (all items) → s (rated items) Use only Mtxt in algorithm computation → complexity in O(s) Mtxt Stream Ignore p n 1 Adaptation to make Modify all parts of the algorithm to obtain O(s) complexity 1 Code computation 2 Surrogate update 3 Surrogate minimization Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28
  • 25. Subsampled online dictionary learning Check out paper ! Original online MF 1 Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xt αt − Bt−1) 3 Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At )j,j (DAj t −Bj t )) Our algorithm 1 Code computation: masked loss αt = argmin α∈Rk Mt (xt − Dt−1α) 2 2 + λ rk Mt p Ω(αt ) 2 Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t i=1 Mi (Mt xt αt − Mt Bt−1) 3 Surrogate minimization Mt Dj ← p⊥ Cj (Mt Dj − 1 (At )j,j Mt (D(Aj t − (Bj t )) Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28
  • 26. D´eroul´e 1 Matrix factorization for recommender systems Collaborative filtering Matrix factorization formulation Existing methods 2 Subsampled online dictionary learning Dictionary learning – existing methods Handling missing values efficiently New algorithm 3 Results Setting Benchmarks Parameter setting Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28
  • 27. Experiments Validation : Test RMSE (rating prediction) vs CPU time Baseline : Coordinate descent solver [Yu et al., 2012] for min D∈Rp×k A∈Rk×n (i,j)∈Ω (xij − di αj )2 + λ( D 2 F + A 2 F ) Fastest solver available apart from SGD — hyperparameters ↑ Our method has a learning rate with little influence Datasets : Movielens, Netflix Publicly available Larger one in the industry... Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28
  • 28. Results Scalable algorithm: speed-up improves with size Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28
  • 29. Performance Dataset Test RMSE Convergence time Speed CD SODL CD SODL -up ML 1M 0.872 0.866 6 s 8 s ×0.75 ML 10M 0.802 0.799 223 s 60 s ×3.7 NF (140M) 0.938 0.934 1714 s 256 s ×6.8 Outperform coordinate descent beyond 10M ratings Same prediction performance Speed-up 6.8× on Netflix Simple model: RMSE is not state-of-the-art Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28
  • 30. Robustness to learning rate Learning rate in algorithm to be set in [0.75, 1] (← theory) In practice: Just set it in [0.8, 1] 1 10 40Epoch 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 RMSEontestset Learning rate β0.75 0.78 0.81 0.83 0.86 0.89 0.92 0.94 0.97 1.00 MovieLens 10M .1 1 10 20 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Netflix Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28
  • 31. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 32. Conclusion Take-home message Online matrix factorization can be adapted to handle missing value efficiently, with very good performance in reccommender system Mtxt Stream Ignore p n 1Algorithm usable in any rich model involving matrix factorization Python package http://github.com/arthurmensch/modl Article/slides at http://amensch.fr/publications Questions ? Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28
  • 33. Appendix: Resting-state fMRI Online dictionary learning 235 h run time 1 full epoch 10 h run time 1 24 epoch Proposed method 10 h run time 1 2 epoch, reduction r=12 Qualitatively, usable maps are obtained 10× faster Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28
  • 34. Bibliography I [Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007). Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter, 9(2):75–79. [Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:19–60. [Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325. [Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012). Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In Proceedings of the International Conference on Data Mining, pages 765–774. IEEE. Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28