Incremental Item-based Collaborative Filtering

4 likes•2,349 views

The document discusses incremental item-based collaborative filtering techniques, emphasizing methods for updating similarity matrices and addressing the volatility in data relevance through forgetting strategies like sliding windows and fading factors. It details experiments conducted with two datasets, showcasing the performance of item-based recommendations compared to user-based methods. Results indicate that item-based approaches generally outperform user-based methods in specific contexts, particularly with regard to update efficiency and adaptability to dataset changes.

Incremental Item-based
Collaborative Filtering

João Marques da Silva
Palco Workshop - May 13, 2009

Item Similarity

Clã Xutos Gift DaWeasel
Ana 1 1 0 0
Miguel 1 1 1 0
Ivo 0 1 0 1
Paula 0 0 1 0
Joana 1 0 0 0

Take columns as vectors:
v =1,1,0 ,0 ,1
Clã and v =0,1 ,0 ,1,0
Gift

Similarity between Clã and Gift (cosine measure):
v . v
Clã Gift
sim Clã , Gift =cos  v , v =
Clã Gift ≃0.16
∥v ∥∗∥v ∥
Clã Gift

2

Similarity Matrix
S matrix
MxM, with M = nº of items

Clã Xutos Gift DaWeasel
Clã 1
Xutos ... 1
Gift 0.16 ... 1
DaWeasel 0 ... 0 1

How do we keep S up-to-date?

 Rebuild S at each new session:
O(m2n) for m items and n users.
 Incrementally update S with session data:
O(km) for k items in session.
3

Algorithm
Cosine measure for binary ratings:
#  I ∩J 
cos  ,  =
i j I , J are the sets of users that rated items i , j
 # I × # J
A cache matrix Int stores #(I ∩ J) for all item pairs (i,j):
Inti,j = #(I ∩ J)
Inti,i = #I

For each new session:
 Increment Inti,j by 1 for each item pair (i,j) in session

For each item in session update corresponding row/col in S:
Int i , .
S i ,.=
 Int i , i ×  Int . , . 4

Forgetting

 Usage and content change!
 News content quickly becomes obsolete
 Music/Movies/Books - popularity is often volatile

 How can CF adapt to change?
 Forget older data
 Two methods: sliding windows and fading
factors

5

Forgetting: Sliding Windows

Sliding Windows

window
Session weight

length

Data in window Current Session

Session index

Good for non-incremental:
Rebuild S with data in window. 6

Forgetting: Fading Factors

Fading Factors
Session weight

Current Session

Session index

Good for incremental. Before updating S:
S = αS , 0 < α < 1
α=1 is the non-fading factor
7

Implementation

 Implementation in R
 Code available from previous work (C. Miranda)
 Adapt algorithms to use forgetting mechanisms
 Improvements: sparse matrix handling
 Limitations with R: speed

8

Experiments

 Aims
 Forgetting – is it useful?
 Sliding windows vs fading factors
 Item-based better than user-based?

 Evaluation method
 All-but-one protocol (training, test and hidden sets)
 Artificial disturbances in datasets
 Accuracy: precision/recall (binary ratings)
9

Experiments: datasets

2 sequential datasets:
Dataset Origin # sessions # items
PALCO* Palco principal 725 1285
ART* Artificial 1500 4

PALCO: Listened tracks in Palcoprincipal
ART: dataset with abrupt change
{a,b,c} → {a,b,d} at session 500

10

Results (so far)
 Matrix update time
 Update time < Rebuild time
 Item-based better for #users > #items
 PALCO: user-based performs better
 Non-incremental good with small windows
 Recommendation time
 Item-based is faster
 Recovery from drifts

ART: α<1 recovers faster than α=1 (as expected)

PALCO: α=1 still better even with 90% drift! 11

Issues

 Forgetting
 Not good for PALCO dataset?
 Good with ART dataset, but not realistic
 Other datasets (ex: news)?
 Long term effects → larger scale experiments
 Better hardware - on the way
 Other implementations (Java, C, SQL…)
 Palcoprincipal
 More items than users!
 Item-based possibly better for artist recommendations.
14

Incremental Item-based Collaborative Filtering

1. Incremental Item-based Collaborative Filtering João Marques da Silva Palco Workshop - May 13, 2009

2. Item Similarity Clã Xutos Gift DaWeasel Ana 1 1 0 0 Miguel 1 1 1 0 Ivo 0 1 0 1 Paula 0 0 1 0 Joana 1 0 0 0 Take columns as vectors: v =1,1,0 ,0 ,1 Clã and v =0,1 ,0 ,1,0 Gift Similarity between Clã and Gift (cosine measure): v . v Clã Gift sim Clã , Gift =cos  v , v = Clã Gift ≃0.16 ∥v ∥∗∥v ∥ Clã Gift 2

3. Similarity Matrix S matrix MxM, with M = nº of items Clã Xutos Gift DaWeasel Clã 1 Xutos ... 1 Gift 0.16 ... 1 DaWeasel 0 ... 0 1 How do we keep S up-to-date?  Rebuild S at each new session: O(m2n) for m items and n users.  Incrementally update S with session data: O(km) for k items in session. 3

4. Algorithm Cosine measure for binary ratings: #  I ∩J  cos  ,  = i j I , J are the sets of users that rated items i , j  # I × # J A cache matrix Int stores #(I ∩ J) for all item pairs (i,j): Inti,j = #(I ∩ J) Inti,i = #I For each new session:  Increment Inti,j by 1 for each item pair (i,j) in session  For each item in session update corresponding row/col in S: Int i , . S i ,.=  Int i , i ×  Int . , . 4

5. Forgetting  Usage and content change!  News content quickly becomes obsolete  Music/Movies/Books - popularity is often volatile  How can CF adapt to change?  Forget older data  Two methods: sliding windows and fading factors 5

6. Forgetting: Sliding Windows Sliding Windows window Session weight length Data in window Current Session Session index Good for non-incremental: Rebuild S with data in window. 6

7. Forgetting: Fading Factors Fading Factors Session weight Current Session Session index Good for incremental. Before updating S: S = αS , 0 < α < 1 α=1 is the non-fading factor 7

8. Implementation  Implementation in R  Code available from previous work (C. Miranda)  Adapt algorithms to use forgetting mechanisms  Improvements: sparse matrix handling  Limitations with R: speed 8

9. Experiments  Aims  Forgetting – is it useful?  Sliding windows vs fading factors  Item-based better than user-based?  Evaluation method  All-but-one protocol (training, test and hidden sets)  Artificial disturbances in datasets  Accuracy: precision/recall (binary ratings) 9

10. Experiments: datasets 2 sequential datasets: Dataset Origin # sessions # items PALCO* Palco principal 725 1285 ART* Artificial 1500 4 PALCO: Listened tracks in Palcoprincipal ART: dataset with abrupt change {a,b,c} → {a,b,d} at session 500 10

11. Results (so far)  Matrix update time  Update time < Rebuild time  Item-based better for #users > #items  PALCO: user-based performs better  Non-incremental good with small windows  Recommendation time  Item-based is faster  Recovery from drifts  ART: α<1 recovers faster than α=1 (as expected)  PALCO: α=1 still better even with 90% drift! 11

12. Accuracy IBFF w/ ART 12

13. Accuracy UBSW, UBFF w/ PALCO 13

14. Issues  Forgetting  Not good for PALCO dataset?  Good with ART dataset, but not realistic  Other datasets (ex: news)?  Long term effects → larger scale experiments  Better hardware - on the way  Other implementations (Java, C, SQL…)  Palcoprincipal  More items than users!  Item-based possibly better for artist recommendations. 14

15. Thank you! 15

Incremental Item-based Collaborative Filtering

More Related Content

Similar to Incremental Item-based Collaborative Filtering (20)

Incremental Item-based Collaborative Filtering