SlideShare a Scribd company logo
August 5, 2013
ML ♡ Hadoop @ Spotify
If it’s slow, buy more racks
I’m Erik Bernhardsson
Master’s in Physics from KTH in Stockholm
Started at Spotify in 2008, managed the Analytics team for two years
Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC
2
August 5, 2013
What’s Spotify? What are the
challenges?
Started in 2006
Currently has 24 million users
6 million paying users
Available in 20 countries
About 300 engineers, of which 70 in NYC
And adding 20K every day...
Big challenge: Spotify has over 20 million tracks
4
Good and bad news: we also have 100B streams
Let’s use collaborative
filtering!
5
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Hadoop at Spotify
6
Back in 2009
Matrix factorization
causing cluster to
overheat? Don’t worry,
put up curtain
7
Source:
Hadoop today
700 nodes at our data center in London
8
The Discover page
9
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs
hdfs2cass
https://github.com/spotify/luigi
https://github.com/spotify/hdfs2cass
OK so how do we come up with recommendations?
Let’s do collaborative filtering!
In particular, implicit collaborative filtering
In particular, matrix factorization (aka latent factor methods)
11
Stop!!!
Break it down!!
12
AP AP AP AP AP AP
Hadoop
(>100B streams)
Play track z
play track y
play track x
5k tracks/s
Step 1: Collect data
13
Step 2: Put everything into a big sparse matrix
14
@ . . . 7 . . . . . . . . .
...
...
...
A
very big matrix too:
M =
0
B
B
B
@
c11 c12 . . . c1n
c21 c22 . . . c2n
...
...
cm1 cm2 . . . cmn
1
C
C
C
A
| {z }
107
items
9
>>>>>>>>>=
>>>>>>>>>;
107
users
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Erik
Never gonna give
you up
Erik listened to Never
gonna give you up 1
times
Idea is to find vectors for each user and item
Here’s how it looks like algebraically:
Step 3: Matrix factorization
16P =
B
B
B
@
p21 p22 . . . p2n
...
...
pm1 pm2 . . . pmn
C
C
C
A
The idea with matrix factorization is to represent this probability distribu-
tion like this:
pui = aT
u bi
M0
= AT
B
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
1
C
C
C
C
C
C
A
| {z }
f
f
0
. . . . . . .
1 0
. .
1
For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)
Invented as a method intended for text classification
17
P =
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
user vectors
✓
. . . . . . .
. . . . . . .
◆
| {z }
item vectors
PLSA
0
B
B
B
B
B
B
@
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1
C
C
C
C
C
C
A
| {z }
P (u,i)=
P
z
P (u|z)P (i,z)
⇡
0
B
B
B
B
B
B
@
. .
. .
. .
. .
. .
. .
1
C
C
C
C
C
C
A
| {z }
P (u|z)
✓
. . . . . . .
. . . . . . .
◆
| {z }
P (i,z)
X
Why are vectors nice?
Super small fingerprints of the musical style or the user’s taste
Usually something like 40-200 elements
Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt:
18
0.87 1.17 -0.26 0.56 2.21 0.77 -0.03
Latent factor 1
Latent factor 2
track x's vector
Track X:
Another example of tracks in two dimensions
19
Implementing matrix factorization is a little tricky
Iterative algorithms that stake many steps to converge
40 parameters for each item and user
So something like 1.2 billion parameters
“Google News Personalization: Scalable Online Collaborative Filtering”
20
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Here’s what happens in one map shard
Input is a bunch of (user, item, count) tuples
user is the same modulo K for all users
item is the same modulo L for all items
22
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Might take a while to converge
Start with random vectors
around the origin
23
Hadoop?
Yeah we could probably do it in Spark 10x or 100x faster.
Still, Hadoop is a great way to scale things horizontally.
????
24
Nice compact vectors and it’s super fast to compute
similarity
25
Latent factor 1
Latent factor 2
track x
track y
cos(x, y) = HIGH
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
IPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
Music recommendations are now just dot products
26
Latent factor 1
Latent factor 2
track x
User u's vector
track y
It’s still tricky to search for similar tracks though
We have many million tracks and you don’t want to compute cosine for all pairs
27
Approximate nearest neighbors to the rescue!
Cut the space recursively by random
plane.
If two points are close, they are more
likely to end up on the same side of
each plane.
https://github.com/spotify/annoy
28
How do you retrain the model?
It takes a long time to train a full factorization model.
We want to update user vectors much more frequently (at least daily!)
However, item vectors are fairly stable.
Throw away user vectors and recreate them from scratch!
29
The pipeline
“Hack” to recalculate user vectors
more frequently.
Is this a little complicated? Yeah
probably.
30
May 2013 logs
Matrix
factorization
Item
vectors
User
vectors
June 2013 logs
Matrix
factorization
Item
vectors
User
vectors
+ more logs
Seeding
User vectors
(1)
Logs
User vectors
(2)
More logs
User vectors
(3)
More logs
User vectors
(4)
More logs
User vectors
(5)
More logs
Time
Ideal case
Put all vectors in Cassandra/Memcached, use Storm to update in real time
31
But Hadoop is pretty nice at parallelizing recommendations
24 core but not a lot of
RAM? mmap is your
friend
32
One map reduce job
Recs!
ANN index
of all vectors
Distributed cache:
User vectors
M M
M M
DC
M M
M M
DC
M M
M M
DC
Music recommendations!
Our latest baby, the
Discover page. Featuring
lots of different types of
recommendations.
Expect this to change
quite a lot in the next few
months!
33
More music recommendations!
Radio!
34
More music recommendations!
Related artists
35
Thanks!
Btw, we’re hiring Machine Learning Engineers
and Data Engineers!
Email me at erikbern@spotify.com!

More Related Content

What's hot (20)

PDF
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
PDF
Music Recommendations at Scale with Spark
Chris Johnson
 
PPTX
Recommendation at Netflix Scale
Justin Basilico
 
PDF
Past, present, and future of Recommender Systems: an industry perspective
Xavier Amatriain
 
PDF
Recommender system algorithm and architecture
Liang Xiang
 
PDF
Past, Present & Future of Recommender Systems: An Industry Perspective
Justin Basilico
 
PDF
Recommendation engines
Georgian Micsa
 
PPTX
Collaborative Filtering at Spotify
Erik Bernhardsson
 
PPTX
Content based filtering
Bendito Freitas Ribeiro
 
PDF
Music Personalization At Spotify
Vidhya Murali
 
PDF
Deep learning for audio-based music recommendation
Russia.AI
 
PDF
Calibrated Recommendations
Harald Steck
 
PDF
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Xavier Amatriain
 
PDF
How to build a recommender system?
blueace
 
PPTX
Recommender system
Nilotpal Pramanik
 
PPTX
Recommender systems: Content-based and collaborative filtering
Viet-Trung TRAN
 
PDF
From Idea to Execution: Spotify's Discover Weekly
Chris Johnson
 
PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
PPTX
Recommender system introduction
Liang Xiang
 
PPTX
[Final]collaborative filtering and recommender systems
Falitokiniaina Rabearison
 
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
Music Recommendations at Scale with Spark
Chris Johnson
 
Recommendation at Netflix Scale
Justin Basilico
 
Past, present, and future of Recommender Systems: an industry perspective
Xavier Amatriain
 
Recommender system algorithm and architecture
Liang Xiang
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Justin Basilico
 
Recommendation engines
Georgian Micsa
 
Collaborative Filtering at Spotify
Erik Bernhardsson
 
Content based filtering
Bendito Freitas Ribeiro
 
Music Personalization At Spotify
Vidhya Murali
 
Deep learning for audio-based music recommendation
Russia.AI
 
Calibrated Recommendations
Harald Steck
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Xavier Amatriain
 
How to build a recommender system?
blueace
 
Recommender system
Nilotpal Pramanik
 
Recommender systems: Content-based and collaborative filtering
Viet-Trung TRAN
 
From Idea to Execution: Spotify's Discover Weekly
Chris Johnson
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
Recommender system introduction
Liang Xiang
 
[Final]collaborative filtering and recommender systems
Falitokiniaina Rabearison
 

Viewers also liked (18)

PPTX
RapidMiner: Word Vector Tool And Rapid Miner
Rapidmining Content
 
PDF
Phoenix - A High Performance Open Source SQL Layer over HBase
Salesforce Developers
 
PDF
storm at twitter
Krishna Gade
 
PDF
Collaborative Filtering with Spark
Chris Johnson
 
PDF
Big data and machine learning @ Spotify
Oscar Carlsson
 
PDF
Music data is scary, beautiful and exciting
Brian Whitman
 
PDF
The Echo Nest at Music and Bits, October 21 2009
Brian Whitman
 
PDF
Cut Bait - 10 Years of Dorkbot
Brian Whitman
 
PDF
The echo nest-music_discovery(1)
Sophia Yeiji Shin
 
PDF
The Echo Nest Remix at Dorkbot NYC, March 4 2009
Brian Whitman
 
PDF
The future music platform
Brian Whitman
 
PPT
Echo nest-api-boston-2012
Paul Lamere
 
PDF
Luigi future
Erik Bernhardsson
 
PDF
Luigi Presentation at OSCON 2013
Erik Bernhardsson
 
PPTX
Analysis of Spotify & New Feature Ideas
Sarah L. Miller
 
PDF
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
RapidMiner: Word Vector Tool And Rapid Miner
Rapidmining Content
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Salesforce Developers
 
storm at twitter
Krishna Gade
 
Collaborative Filtering with Spark
Chris Johnson
 
Big data and machine learning @ Spotify
Oscar Carlsson
 
Music data is scary, beautiful and exciting
Brian Whitman
 
The Echo Nest at Music and Bits, October 21 2009
Brian Whitman
 
Cut Bait - 10 Years of Dorkbot
Brian Whitman
 
The echo nest-music_discovery(1)
Sophia Yeiji Shin
 
The Echo Nest Remix at Dorkbot NYC, March 4 2009
Brian Whitman
 
The future music platform
Brian Whitman
 
Echo nest-api-boston-2012
Paul Lamere
 
Luigi future
Erik Bernhardsson
 
Luigi Presentation at OSCON 2013
Erik Bernhardsson
 
Analysis of Spotify & New Feature Ideas
Sarah L. Miller
 
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Ad

Similar to ML+Hadoop at NYC Predictive Analytics (20)

PDF
Recommendations 101
Esh Vckay
 
PPTX
Deep recommendations in PyTorch
Stitch Fix Algorithms
 
PDF
BeepTunes Music Recommender System
Tadeh Alexani
 
PPTX
Sagemaker built_in algorithems.pptx
asifshahzad100
 
PPTX
Spring 2016 Intern at Treasure Data
Sotaro Sugimoto
 
PPTX
What the Bleep is Big Data? A Holistic View of Data and Algorithms
Alice Zheng
 
PDF
IntroductionRecommenderSystems_Petroni.pdf
AlphaIssaghaDiallo
 
PDF
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
PDF
Warsaw Data Science - Factorization Machines Introduction
Bartlomiej Twardowski
 
PDF
Hsjs.pdf
Jyotimoydas2
 
PDF
The Factorization Machines algorithm for building recommendation system - Paw...
Evention
 
PDF
Kaggle kenneth
kenluck2001
 
PDF
CF Models for Music Recommendations At Spotify
Vidhya Murali
 
PDF
Factorization Machines and Applications in Recommender Systems
Evgeniy Marinov
 
PPTX
lecture26-mf.pptx
Jadna Almeida
 
PPTX
lecture244-mf.pptx
Jadna Almeida
 
PPT
Download
butest
 
PPT
Download
butest
 
PDF
GeeCON Prague 2015
Mateusz Dymczyk
 
PPTX
Recommendation Systems
Robin Reni
 
Recommendations 101
Esh Vckay
 
Deep recommendations in PyTorch
Stitch Fix Algorithms
 
BeepTunes Music Recommender System
Tadeh Alexani
 
Sagemaker built_in algorithems.pptx
asifshahzad100
 
Spring 2016 Intern at Treasure Data
Sotaro Sugimoto
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
Alice Zheng
 
IntroductionRecommenderSystems_Petroni.pdf
AlphaIssaghaDiallo
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Chris Fregly
 
Warsaw Data Science - Factorization Machines Introduction
Bartlomiej Twardowski
 
Hsjs.pdf
Jyotimoydas2
 
The Factorization Machines algorithm for building recommendation system - Paw...
Evention
 
Kaggle kenneth
kenluck2001
 
CF Models for Music Recommendations At Spotify
Vidhya Murali
 
Factorization Machines and Applications in Recommender Systems
Evgeniy Marinov
 
lecture26-mf.pptx
Jadna Almeida
 
lecture244-mf.pptx
Jadna Almeida
 
Download
butest
 
Download
butest
 
GeeCON Prague 2015
Mateusz Dymczyk
 
Recommendation Systems
Robin Reni
 
Ad

Recently uploaded (20)

PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 

ML+Hadoop at NYC Predictive Analytics

  • 1. August 5, 2013 ML ♡ Hadoop @ Spotify If it’s slow, buy more racks
  • 2. I’m Erik Bernhardsson Master’s in Physics from KTH in Stockholm Started at Spotify in 2008, managed the Analytics team for two years Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC 2
  • 3. August 5, 2013 What’s Spotify? What are the challenges? Started in 2006 Currently has 24 million users 6 million paying users Available in 20 countries About 300 engineers, of which 70 in NYC
  • 4. And adding 20K every day... Big challenge: Spotify has over 20 million tracks 4
  • 5. Good and bad news: we also have 100B streams Let’s use collaborative filtering! 5 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T!
  • 7. Back in 2009 Matrix factorization causing cluster to overheat? Don’t worry, put up curtain 7
  • 8. Source: Hadoop today 700 nodes at our data center in London 8
  • 10. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  • 11. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  • 12. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi
  • 13. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi https://github.com/spotify/hdfs2cass
  • 14. OK so how do we come up with recommendations? Let’s do collaborative filtering! In particular, implicit collaborative filtering In particular, matrix factorization (aka latent factor methods) 11
  • 16. AP AP AP AP AP AP Hadoop (>100B streams) Play track z play track y play track x 5k tracks/s Step 1: Collect data 13
  • 17. Step 2: Put everything into a big sparse matrix 14 @ . . . 7 . . . . . . . . . ... ... ... A very big matrix too: M = 0 B B B @ c11 c12 . . . c1n c21 c22 . . . c2n ... ... cm1 cm2 . . . cmn 1 C C C A | {z } 107 items 9 >>>>>>>>>= >>>>>>>>>; 107 users
  • 18. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15
  • 19. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15 Erik Never gonna give you up Erik listened to Never gonna give you up 1 times
  • 20. Idea is to find vectors for each user and item Here’s how it looks like algebraically: Step 3: Matrix factorization 16P = B B B @ p21 p22 . . . p2n ... ... pm1 pm2 . . . pmn C C C A The idea with matrix factorization is to represent this probability distribu- tion like this: pui = aT u bi M0 = AT B 0 B B B B B B @ 1 C C C C C C A ⇡ 0 B B B B B B @ 1 C C C C C C A | {z } f f 0 . . . . . . . 1 0 . . 1
  • 21. For instance, for PLSA Probabilistic Latent Semantic Indexing (Hoffman, 1999) Invented as a method intended for text classification 17 P = 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } user vectors ✓ . . . . . . . . . . . . . . ◆ | {z } item vectors PLSA 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A | {z } P (u,i)= P z P (u|z)P (i,z) ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } P (u|z) ✓ . . . . . . . . . . . . . . ◆ | {z } P (i,z) X
  • 22. Why are vectors nice? Super small fingerprints of the musical style or the user’s taste Usually something like 40-200 elements Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt: 18 0.87 1.17 -0.26 0.56 2.21 0.77 -0.03 Latent factor 1 Latent factor 2 track x's vector Track X:
  • 23. Another example of tracks in two dimensions 19
  • 24. Implementing matrix factorization is a little tricky Iterative algorithms that stake many steps to converge 40 parameters for each item and user So something like 1.2 billion parameters “Google News Personalization: Scalable Online Collaborative Filtering” 20
  • 25. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  • 26. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  • 27. Here’s what happens in one map shard Input is a bunch of (user, item, count) tuples user is the same modulo K for all users item is the same modulo L for all items 22 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector!
  • 28. Might take a while to converge Start with random vectors around the origin 23
  • 29. Hadoop? Yeah we could probably do it in Spark 10x or 100x faster. Still, Hadoop is a great way to scale things horizontally. ???? 24
  • 30. Nice compact vectors and it’s super fast to compute similarity 25 Latent factor 1 Latent factor 2 track x track y cos(x, y) = HIGH IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 )
  • 31. Music recommendations are now just dot products 26 Latent factor 1 Latent factor 2 track x User u's vector track y
  • 32. It’s still tricky to search for similar tracks though We have many million tracks and you don’t want to compute cosine for all pairs 27
  • 33. Approximate nearest neighbors to the rescue! Cut the space recursively by random plane. If two points are close, they are more likely to end up on the same side of each plane. https://github.com/spotify/annoy 28
  • 34. How do you retrain the model? It takes a long time to train a full factorization model. We want to update user vectors much more frequently (at least daily!) However, item vectors are fairly stable. Throw away user vectors and recreate them from scratch! 29
  • 35. The pipeline “Hack” to recalculate user vectors more frequently. Is this a little complicated? Yeah probably. 30 May 2013 logs Matrix factorization Item vectors User vectors June 2013 logs Matrix factorization Item vectors User vectors + more logs Seeding User vectors (1) Logs User vectors (2) More logs User vectors (3) More logs User vectors (4) More logs User vectors (5) More logs Time
  • 36. Ideal case Put all vectors in Cassandra/Memcached, use Storm to update in real time 31
  • 37. But Hadoop is pretty nice at parallelizing recommendations 24 core but not a lot of RAM? mmap is your friend 32 One map reduce job Recs! ANN index of all vectors Distributed cache: User vectors M M M M DC M M M M DC M M M M DC
  • 38. Music recommendations! Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months! 33
  • 41. Thanks! Btw, we’re hiring Machine Learning Engineers and Data Engineers! Email me at [email protected]!