SlideShare a Scribd company logo
Incremental Item-based
Collaborative Filtering




              João Marques da Silva
          Palco Workshop - May 13, 2009
Item Similarity

                          Clã       Xutos      Gift    DaWeasel
            Ana            1          1         0           0
            Miguel         1          1         1           0
            Ivo            0          1         0           1
            Paula          0          0         1           0
            Joana          1          0         0           0

Take columns as vectors:
  v =1,1,0 ,0 ,1
   Clã                   and     v =0,1 ,0 ,1,0
                                  Gift

Similarity between Clã and Gift (cosine measure):
                                       v . v
                                         Clã Gift
  sim Clã , Gift =cos  v , v =
                           Clã  Gift                ≃0.16
                                     ∥v ∥∗∥v ∥
                                       Clã     Gift

                                                                  2
Similarity Matrix
                             S matrix
                      MxM, with M = nº of items

                       Clã     Xutos      Gift    DaWeasel
           Clã         1
           Xutos       ...       1
           Gift       0.16      ...        1
           DaWeasel    0        ...        0         1


     How do we keep S up-to-date?

    Rebuild S at each new session:
        O(m2n) for m items and n users.
    Incrementally update S with session data:
         O(km) for k items in session.
                                                             3
Algorithm
 Cosine measure for binary ratings:
                 #  I ∩J 
 cos  ,  =
      i j                            I , J are the sets of users that rated items i , j
                # I × # J
 A cache matrix Int stores #(I ∩ J) for all item pairs (i,j):
    Inti,j = #(I ∩ J)
    Inti,i = #I

 For each new session:
     Increment Inti,j by 1 for each item pair (i,j) in session
  
      For each item in session update corresponding row/col in S:
                           Int i , .
            S i ,.=
                     Int i , i ×  Int . , .                                         4
Forgetting

   Usage and content change!
       News content quickly becomes obsolete
       Music/Movies/Books - popularity is often volatile

   How can CF adapt to change?
       Forget older data
       Two methods: sliding windows and fading
        factors


                                                            5
Forgetting: Sliding Windows

                    Sliding Windows


                       window
   Session weight




                       length



                     Data in window   Current Session




                      Session index

Good for non-incremental:
Rebuild S with data in window.                          6
Forgetting: Fading Factors

                    Fading Factors
   Session weight




                                           Current Session



                      Session index

Good for incremental. Before updating S:
     S = αS , 0 < α < 1
α=1 is the non-fading factor
                                                             7
Implementation


   Implementation in R
       Code available from previous work (C. Miranda)
       Adapt algorithms to use forgetting mechanisms
       Improvements: sparse matrix handling
       Limitations with R: speed




                                                         8
Experiments

   Aims
       Forgetting – is it useful?
       Sliding windows vs fading factors
       Item-based better than user-based?

   Evaluation method
       All-but-one protocol (training, test and hidden sets)
       Artificial disturbances in datasets
       Accuracy: precision/recall (binary ratings)
                                                                9
Experiments: datasets

  2 sequential datasets:
    Dataset       Origin        # sessions       # items
  PALCO*      Palco principal           725           1285
  ART*        Artificial               1500              4



  PALCO: Listened tracks in Palcoprincipal
  ART: dataset with abrupt change
              {a,b,c} → {a,b,d} at session 500



                                                             10
Results (so far)
   Matrix update time
       Update time < Rebuild time
       Item-based better for #users > #items
       PALCO: user-based performs better
       Non-incremental good with small windows
   Recommendation time
       Item-based is faster
   Recovery from drifts
    
        ART: α<1 recovers faster than α=1 (as expected)
    
        PALCO: α=1 still better even with 90% drift!      11
Accuracy IBFF w/ ART




                       12
Accuracy UBSW, UBFF w/ PALCO




                               13
Issues

   Forgetting
       Not good for PALCO dataset?
       Good with ART dataset, but not realistic
       Other datasets (ex: news)?
   Long term effects → larger scale experiments
       Better hardware - on the way
       Other implementations (Java, C, SQL…)
   Palcoprincipal
       More items than users!
       Item-based possibly better for artist recommendations.
                                                             14
Thank you!




             15

More Related Content

PPTX
Single shot multiboxdetectors
지현 백
 
PDF
SSgA_Complete_Issue_The_Participant
Jill Ayuso (LION)
 
PPTX
Collaborative filtering at scale
huguk
 
PPTX
How to build a Recommender System
Võ Duy Tuấn
 
PDF
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
PPT
Sungard Global trading Presentation
ahemeury
 
PDF
Latent factor models for Collaborative Filtering
sscdotopen
 
PPT
Trade And Settlement Process
Kartik Mehta
 
Single shot multiboxdetectors
지현 백
 
SSgA_Complete_Issue_The_Participant
Jill Ayuso (LION)
 
Collaborative filtering at scale
huguk
 
How to build a Recommender System
Võ Duy Tuấn
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Sungard Global trading Presentation
ahemeury
 
Latent factor models for Collaborative Filtering
sscdotopen
 
Trade And Settlement Process
Kartik Mehta
 

Similar to Incremental Item-based Collaborative Filtering (20)

PDF
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
Sri Ambati
 
PDF
H2O Open Source Deep Learning, Arno Candel 03-20-14
Sri Ambati
 
PDF
San Francisco Hadoop User Group Meetup Deep Learning
Sri Ambati
 
PDF
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
PDF
[PR12] PR-036 Learning to Remember Rare Events
Taegyun Jeon
 
PDF
SVD and the Netflix Dataset
Ben Mabey
 
PDF
H2O Deep Learning at Next.ML
Sri Ambati
 
PDF
Fast Distributed Online Classification
Prasad Chalasani
 
PDF
The Back Propagation Learning Algorithm
ESCOM
 
PDF
H2ODeepLearningThroughExamples021215
Sri Ambati
 
PDF
Yulia Honcharenko "Application of metric learning for logo recognition"
Fwdays
 
PDF
Single shot multiboxdetectors
지현 백
 
PDF
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Albert Bifet
 
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
PDF
Fast Distributed Online Classification
DataWorks Summit/Hadoop Summit
 
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
PDF
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Olivier Jeunen
 
PPT
A scalable collaborative filtering framework based on co clustering
AllenWu
 
PDF
Making BIG DATA smaller
Tony Tran
 
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
Sri Ambati
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
Sri Ambati
 
San Francisco Hadoop User Group Meetup Deep Learning
Sri Ambati
 
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
[PR12] PR-036 Learning to Remember Rare Events
Taegyun Jeon
 
SVD and the Netflix Dataset
Ben Mabey
 
H2O Deep Learning at Next.ML
Sri Ambati
 
Fast Distributed Online Classification
Prasad Chalasani
 
The Back Propagation Learning Algorithm
ESCOM
 
H2ODeepLearningThroughExamples021215
Sri Ambati
 
Yulia Honcharenko "Application of metric learning for logo recognition"
Fwdays
 
Single shot multiboxdetectors
지현 백
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Albert Bifet
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
Fast Distributed Online Classification
DataWorks Summit/Hadoop Summit
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Olivier Jeunen
 
A scalable collaborative filtering framework based on co clustering
AllenWu
 
Making BIG DATA smaller
Tony Tran
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Eugene Yan Ziyou
 
Ad

Incremental Item-based Collaborative Filtering

  • 1. Incremental Item-based Collaborative Filtering João Marques da Silva Palco Workshop - May 13, 2009
  • 2. Item Similarity Clã Xutos Gift DaWeasel Ana 1 1 0 0 Miguel 1 1 1 0 Ivo 0 1 0 1 Paula 0 0 1 0 Joana 1 0 0 0 Take columns as vectors: v =1,1,0 ,0 ,1 Clã and v =0,1 ,0 ,1,0 Gift Similarity between Clã and Gift (cosine measure): v . v Clã Gift sim Clã , Gift =cos  v , v = Clã Gift ≃0.16 ∥v ∥∗∥v ∥ Clã Gift 2
  • 3. Similarity Matrix S matrix MxM, with M = nº of items Clã Xutos Gift DaWeasel Clã 1 Xutos ... 1 Gift 0.16 ... 1 DaWeasel 0 ... 0 1 How do we keep S up-to-date?  Rebuild S at each new session: O(m2n) for m items and n users.  Incrementally update S with session data: O(km) for k items in session. 3
  • 4. Algorithm Cosine measure for binary ratings: #  I ∩J  cos  ,  = i j I , J are the sets of users that rated items i , j  # I × # J A cache matrix Int stores #(I ∩ J) for all item pairs (i,j): Inti,j = #(I ∩ J) Inti,i = #I For each new session:  Increment Inti,j by 1 for each item pair (i,j) in session  For each item in session update corresponding row/col in S: Int i , . S i ,.=  Int i , i ×  Int . , . 4
  • 5. Forgetting  Usage and content change!  News content quickly becomes obsolete  Music/Movies/Books - popularity is often volatile  How can CF adapt to change?  Forget older data  Two methods: sliding windows and fading factors 5
  • 6. Forgetting: Sliding Windows Sliding Windows window Session weight length Data in window Current Session Session index Good for non-incremental: Rebuild S with data in window. 6
  • 7. Forgetting: Fading Factors Fading Factors Session weight Current Session Session index Good for incremental. Before updating S: S = αS , 0 < α < 1 α=1 is the non-fading factor 7
  • 8. Implementation  Implementation in R  Code available from previous work (C. Miranda)  Adapt algorithms to use forgetting mechanisms  Improvements: sparse matrix handling  Limitations with R: speed 8
  • 9. Experiments  Aims  Forgetting – is it useful?  Sliding windows vs fading factors  Item-based better than user-based?  Evaluation method  All-but-one protocol (training, test and hidden sets)  Artificial disturbances in datasets  Accuracy: precision/recall (binary ratings) 9
  • 10. Experiments: datasets 2 sequential datasets: Dataset Origin # sessions # items PALCO* Palco principal 725 1285 ART* Artificial 1500 4 PALCO: Listened tracks in Palcoprincipal ART: dataset with abrupt change {a,b,c} → {a,b,d} at session 500 10
  • 11. Results (so far)  Matrix update time  Update time < Rebuild time  Item-based better for #users > #items  PALCO: user-based performs better  Non-incremental good with small windows  Recommendation time  Item-based is faster  Recovery from drifts  ART: α<1 recovers faster than α=1 (as expected)  PALCO: α=1 still better even with 90% drift! 11
  • 13. Accuracy UBSW, UBFF w/ PALCO 13
  • 14. Issues  Forgetting  Not good for PALCO dataset?  Good with ART dataset, but not realistic  Other datasets (ex: news)?  Long term effects → larger scale experiments  Better hardware - on the way  Other implementations (Java, C, SQL…)  Palcoprincipal  More items than users!  Item-based possibly better for artist recommendations. 14