SlideShare a Scribd company logo
Crab
                A Python Framework for Building
                    Recommendation Engines
                  PythonBrasil 2011, São Paulo, SP


Marcel Caraciolo Ricardo Caspirro                Bruno Melo
   @marcelcaraciolo        @ricardocaspirro          @brunomelo
What is Crab ?

 A python framework for building recommendation engines
A Scikit module for collaborative, content and hybrid filtering
       Mahout Alternative for Python Developers :D
             Open-Source under the BSD license


             https://github.com/muricoca/crab
When started ?

It began one year ago
Community-driven, 4 members
Since April,2011 the open-source labs Muriçoca incorporated it
Since April,2011 rewritting it as Scikit




                https://github.com/muricoca/
Knowing Scikits
Scikits are Scipy Toolkits - independent and projects hosted
                under a common namespace.


                       Scikits Image
                     Scikits MlabWrap
                     Scikits AudioLab
                      Scikit Learn
                             ....

           http://scikits.appspot.com/scikits
Knowing Scikits

                        Scikit-Learn

    Machine Learning Algorithms + scientific Python packages
                (Numpy, Scipy and Matplotlib)

           http://scikit-learn.sourceforge.net/


Our goal: Incorporate the Crab as Scikit and incorporate
           some parts of them at Scikit-learn
Why Recommendations ?
The world is an over-crowded place
 !"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#
Why Recommendations
     * +,&-.$/).#&0#/"1.#$%234(".#                   ?
       $/)#5(&6 7&.2.#"$4,#)$8
                   We are overloaded
     * 93((3&/.#&0#:&'3".;#5&&<.#
         $/)#:-.34#2%$4<.#&/(3/"
Thousands of news articles and blog posts each day
       * =/#>$/&3;#?#@A#+B#4,$//"(.;#
          2,&-.$/).#&0#7%&6%$:.#
 Millions of movies, books and music tracks online
          "$4,#)$8
          Several Places, Offers and Events

     * =/#C"1#D&%<;#."'"%$(#
  Even Friends sometimes we are overloaded !

         2,&-.$/).#&0#$)#:"..$6".#
         ."/2#2&#-.#7"%#)$8
Why Recommendations ?
We really need and consume only a few of them!

   “A lot of times, people don’t know what
   they want until you show it to them.”
                                         Steve Jobs

  “We are leaving the Information age, and
  entering into the Recommendation age.”
                      Chris Anderson, from book Long Tail
Why Recommendations ?
Can Google help ?
  Yes, but only when we really know what we are looking for
           But, what’s does it mean by “interesting” ?
Can Facebook help ?
  Yes, I tend to find my friends’ stuffs interesting
   What if i had only few friends and what they like do not always
                             attract me ?
Can experts help ?
  Yes, but it won’t scale well.
    But it is what they like, not me! Exactly same advice!
Why Recommendations ?
         Recommendation Systems
Systems designed to recommend to me something I may like
Why Recommendations ?
     !"#$%&"'$"'(')*#*+,)
     Recommendation Systems

      -+*#)+.               -#/')             0#)1#




                                    !
2'              23&4"+')1               5,6           7),*%'"&863


                      Graph Representation
The current Crab

Collaborative Filtering algorithms
 User-Based, Item-Based and Factorization Matrix (SVD)

Evaluation of the Recommender Algorithms
 Precision, Recall, F1-Score, RMSE




                           Precision-Recall Charts
The current Crab




   Precision-Recall Charts
Collaborative Filtering




                O Vento                         Toy
Thor                            Armagedon              Items
                 Levou                         Store

like
                                recommends


       Marcel        Rafael           Amanda           Users




                      Similar
The current Crab
The current Crab
>>>#load the dataset
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
The current Crab
>>>#load the dataset

>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
{'DESCR': 'sample_movies data set was collected by the book called
          nProgramming the Collective Intelligence by Toby Segaran nnNotesn-----
          nThis data set consists ofnt* n ratings with (1-5) from n users to n movies.',
 'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},
  2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},
  3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},
  4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},
  5: {2: 4.5, 3: 1.0, 4: 4.0},
  6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},
  7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}},
 'item_ids': {1: 'Lady in the Water',
  2: 'Snakes on a Planet',
  3: 'You, Me and Dupree',
  4: 'Superman Returns',
  5: 'The Night Listener',
  6: 'Just My Luck'},
 'user_ids': {1: 'Jack Matthews',
  2: 'Mick LaSalle',
  3: 'Claudia Puig',
  4: 'Lisa Rose',
  5: 'Toby',
  6: 'Gene Seymour',
  7: 'Michael Phillips'}}
The current Crab
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)
The current Crab

>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)

>>> print m
MatrixPreferenceDataModel (7 by 6)
         1          2          3          4            5        ...
1        3.000000   4.000000   3.500000   5.000000   3.000000
2        3.000000   4.000000   2.000000   3.000000   3.000000
3           ---     3.500000   2.500000   4.000000   4.500000
4        2.500000   3.500000   2.500000   3.500000   3.000000
5           ---     4.500000   1.000000   4.000000       ---
6        3.000000   3.500000   3.500000   5.000000   3.000000
7        2.500000   3.000000       ---    3.500000   4.000000
The current Crab
The current Crab
>>> #import pairwise distance
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m,
       euclidean_distances)
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
>>> #import similarity
>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m,
       euclidean_distances)
>>> similarity[1]
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
 >>> #import similarity
 >>> from crab.similarities import UserSimilarity
 >>> similarity = UserSimilarity(m,
        euclidean_distances)
 >>> similarity[1]
       [(1, 1.0),
(6, 0.66666666666666663),
(4, 0.34054242658316669),
(3, 0.32037724101704074),
(7, 0.32037724101704074),
(2, 0.2857142857142857),
(5, 0.2674788903885893)]
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import
         euclidean_distances
 >>> #import similarity
 >>> from crab.similarities import UserSimilarity
 >>> similarity = UserSimilarity(m,
        euclidean_distances)
 >>> similarity[1]
       [(1, 1.0),
(6, 0.66666666666666663),   MatrixPreferenceDataModel (7 by 6)
                                     1          2          3          4            5
(4, 0.34054242658316669),   1        3.000000   4.000000   3.500000   5.000000   3.000000
(3, 0.32037724101704074),   2        3.000000   4.000000   2.000000   3.000000   3.000000
                            3           ---     3.500000   2.500000   4.000000   4.500000
(7, 0.32037724101704074),   4        2.500000   3.500000   2.500000   3.500000   3.000000
                            5           ---     4.500000   1.000000   4.000000       ---
(2, 0.2857142857142857),    6        3.000000   3.500000   3.500000   5.000000   3.000000
(5, 0.2674788903885893)]    7        2.500000   3.000000       ---    3.500000   4.000000
The current Crab
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ],
       [ 6. , 3. ],
       [ 7. , 2.5],
       [ 4. , 2.5]])
The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

>>> recsys.recommend(5)
array([[ 5.        , 3.45712869],
       [ 1.        , 2.78857832],
       [ 6.        , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ],       MatrixPreferenceDataModel (7 by 6)
                                   1          2          3        4                     5        ...
       [ 6. , 3. ],       1        3.000000   4.000000   3.500000 5.000000            3.000000
                          2        3.000000   4.000000   2.000000 3.000000            3.000000
       [ 7. , 2.5],       3           ---     3.500000   2.500000 4.000000            4.500000
       [ 4. , 2.5]])      4        2.500000   3.500000   2.500000 3.500000            3.000000
                                   5         ---     4.500000   1.000000   4.000000       ---
                                   6      3.000000   3.500000   3.500000   5.000000   3.000000
                                   7      2.500000   3.000000      ---     3.500000   4.000000
The current Crab




Using REST APIs to deploy the recommender
          django-piston, django-rest, django-tastypie
Crab is already in production

   News from Abril Publisher recommendations!
                    Collecting over 10 magazines, 20 books and 100+ articles




  Running on Python
      + Scipy +
       Django

Content-Based-Filtering


Easy-to-use interface

  Still in development
Content Based Filtering

                   Similar




Duro de            O Vento                         Toy
                                Armagedon                  Items
 Matar              Levou                         Store


                                      recommend
          likes

                             Marcel                       Users
Crab is already in production

        PythonBrasil keynotes Recommender
               Recommending keynotes based on a hybrid approach




  Running on Python
      + Scipy +
       Django
Content-Based-Filtering
          +
Collaborative Filtering

   Schedule your
     keynotes

   Still in development
source, the recommendation architecture that we propose will                    would rely more on collaborative-filtering techniques, that is,
aggregate the results of such filtering techniques.                                   Bezerra and Carvalho proposed approaches where the results
                                                                                the reviews from similar users.
   We aim at integrating the previously mentioned hybrid prod-                     Figure 1 shows a overview of our meta recommender
                                                                                     achieved showed to be very promising [19].
                                                                                approach. By combining the content-based filtering and the
uct recommendation approach in a mobile application so the
                                                                                                                                                                                               A.

                   Crab is already in production
users could benefit from useful and logical recommendations.                     collaborative-based one into a hybrid recommender system, it
Moreover, we aim at providing a suited explanation for each                     would use the services/products III. S YSTEM catalogues
                                                                                                                repositories which D ESIGN
recommendation to the user, since the current approaches just                   the services to be recommended, and the review repository
                                                                                        Application data information our mobile recommender sys-
                                                                                that contains the user opinions about those services. All this                                                 for
only deliver product recommendations with a overall score
without pointing out the appropriateness of such recommen-                      datatembecan be from data source containers in the web product description
                                                                                      can    extracted divided into two parts: the                                                             rec
dation [13]. Besides the basic information provided by the                      such(such location-based social network Foursquare its attributes) and the user
                                                                                      as the as location, description and [17] as

                                         Hybrid Meta Approach gives the system’s architecture and
suppliers, the system will deliver the explanation, providing
relevant reviews of similar users, we believe that it will
                                                  tags, etc.). The Figure 3
increase the confidence in the buying decision process and the
                                                                                displayed at the Figure 2 and the location recommendation
                                                                                engine from Google: Google HotPot [18]. by user (such as rating, comments,
                                                                                     reviews or ratings provided
                                                                                                                                                                                               mo
                                                                                                                                                                                               wh
product accepptance rate. In the mobile context this approach
                                                                                                                                                                                               po
could help the users in this process and showing the user
                                                                                   relative components.                                                                                        thi
opinions could contribute to achieve this task.                                                                                                                                                rec
                                                                                                                                                                                               spe
                                                                                     !"#$"%&'$                                                         5&-$
        !"#$%&'%($)                               !".,"/#)                                                                                                                                     acc
        !"*+#,$+'-)                              !"*+#,$+'-)                                                                +,-*.&$
                                                                                                           !(#$()&'*&%$
                                                                                                                           /01&'234&$          !6#$6,00&41&7$
                                                                                                                                                                                               wh
                                                                                                                                                                                               res
                                                                                                                                   !<#$<'&2&'&04&%A$B,431*,0A$&14C$
                                                                                                                                                                                               ves
                                              0+44%6+'%$,.")1%#"2)
      0+($"($)1%#"2)
                                                    3,4$"',(5)
                                                                                                                                                                                               ou
        3,4$"',(5)
                                             )))67,8,#%)+,4%$91$'%4)-1":))))
                                                                                                                                                                                               suc
  !"#$%&"'()*+,#&-,.)
  /$%,0"12()*3$4%)3""5.)
                                             ))))1,;&,<4)<1&%%,')=2)4&:&8$1))
                                             )))))))))))%$4%,5)94,14>?)                                                                                                    <',7)41$
                                                                                                                                                                                               pro
                                                                                                                                                                          8&=,%*1,'>$
                                                                                                                                                                                               exp
                                                                                                                  8&4,99&0731*,0$:0;*0&$                        !B#$B*%1$,2$D4,'&7$<',7)41%$
                                                                                                                                                                !(#$()&'*&%$
                                                                                                                                                                                               ma
                                                                                                                                                                           8&?*&@$
                                                                                                                                                                                               we
                                                                                       Fig. 2.   User Reviews from Foursquare Social Network                              8&=,%*1,'>$
                                                                                                                                                                                               com
                                  7"$%)
                              !"8+99"(2"'))
                                                                                                                                     !8#$830E&7$<',7)41%$
                                                                                   The content-based filtering approach will be used to filter                                                   ext
                                                                                the product/service repository, while the collaborative based
                                                                                                                        8&%).1%$                                                               B.
                                                                                approach will derive the product review recommendations. In
                                                                                addition we will use text mining techniques to distinct the
                               !"8+99"(2%$,+(#)                                 polarity of the user review between positive or negative one.
                                                                                This information summarized would contribute in the product Architecture
                                                                                                   Fig. 3. Mobile Recommender System                                                           rat
                                                                                score recommendation computation. The final product recom-
                Fig. 1.    Meta Recommender Architecture
                                                                                mendation score is computed by integrating the result of both
                                                                                                                                                                                               me
                                                                                recommenders. By now, weproduct/service recommender, the user could
                                                                                        In our mobile are considering to use different                                                         and
   Since one of the goals of this work is to incorporate                        options regarding this integration approach, one and get a list of recommen-
different data sources of user opinions and descriptions, we                         filter some products or services at special                                                                oth
                                                                                is the symbolic data analysis approach (SDA) [19], which
have addopted an meta recommendation architecture. By using                     eachtations. The user user ratings/reviews arehis preferences or give his
                                                                                      product description and also can enter modeled                                                           ow
a meta recommender architecture, the system would provide
a personalized control over the generated recommendation list
                                                                                     feedback to some offered product recommendation.
                                                                                as set of modal symbolic descriptions that summarizes the                                                      Re
                                                                                information provided by the corresponding data sources. It is
Crab is already in production

  Brazilian Social Network called Atepassar.com
         Educational network with more than 60.000 students and 120 video-classes




     Running on Python
    + Numpy + Scipy and
          Django


Backend for Recommendations
MongoDB - mongoengine

   Daily Recommendations
    with Explanations
Evaluating your recommender
 Crab implements the most used recommender metrics.
     Precision, Recall, F1-Score, RMSE



     Using matplotlib
     for a plotter utility

 Implement new metrics

Simulations support maybe (??)
Evaluating your recommender
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()

>>> evaluator.evaluate(recommender=recsys,metric='rmse')
   {'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
    ({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568},
          {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788},
          {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}],
 'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677},
   {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955},
  {'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]},
           {'final_score': {'avg': {'f1score': 0.495955,
                            'mae': 0.429292,
                           'nmae': 0.373739,
                        'precision': 0.63932929,
                         'recall': 0.729939393,
                          'rmse': 0.3466868},
                  'stdev': {'f1score': 0.09938383 ,
                           'mae': 0.0593933,
                          'nmae': 0.03393939,
                        'precision': 0.0192929,
                         'recall': 0.031293939,
                        'rmse': 0.234949494}}})
Distributing the recommendation computations


Use Hadoop and Map-Reduce intensively
  Investigating the Yelp mrjob framework     https://github.com/pfig/mrjob



Develop the Netflix and novel standard-of-the-art used
    Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines



The most commonly used is Slope One technique.
   Simple algebra math with slope one algebra y = a*x+b
Cache/Paralelism with joblib
                         http://packages.python.org/joblib/index.html


 from joblib import Memory
 memory = Memory(cachedir=’’, verbose=0)

 class UserSimilarity(BaseSimilarity):
     ...

        @memory.cache 
        def get_similarity(self, source_id, target_id):
             source_preferences = self.model.preferences_from_user(source_id)
             target_preferences = self.model.preferences_from_user(target_id)
              ...
              return self.distance(source_preferences, target_preferences) 
                  if not source_preferences.shape[1] == 0 
                      and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

          def get_similarities(self, source_id):
              return[(other_id, self.get_similarity(source_id, other_id))
                                for other_id, v in self.model]
Cache/Paralelism with joblib
                            http://packages.python.org/joblib/index.html


    from joblib import Memory
    memory = Memory(cachedir=’’, verbose=0)

    class UserSimilarity(BaseSimilarity):
        ...

           @memory.cache 
           def get_similarity(self, source_id, target_id):
                source_preferences = self.model.preferences_from_user(source_id)
                target_preferences = self.model.preferences_from_user(target_id)
                 ...
                 return self.distance(source_preferences, target_preferences) 
                     if not source_preferences.shape[1] == 0 
                         and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

             def get_similarities(self, source_id):
                 return[(other_id, self.get_similarity(source_id, other_id))
                                   for other_id, v in self.model]


>>> #Without memory.cache
Cache/Paralelism with joblib
                            http://packages.python.org/joblib/index.html


    from joblib import Memory
    memory = Memory(cachedir=’’, verbose=0)

    class UserSimilarity(BaseSimilarity):
        ...

           @memory.cache 
           def get_similarity(self, source_id, target_id):
                source_preferences = self.model.preferences_from_user(source_id)
                target_preferences = self.model.preferences_from_user(target_id)
                 ...
                 return self.distance(source_preferences, target_preferences) 
                     if not source_preferences.shape[1] == 0 
                         and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

             def get_similarities(self, source_id):
                 return[(other_id, self.get_similarity(source_id, other_id))
                                   for other_id, v in self.model]


>>> #Without memory.cache                     >>># With memory.cache
Cache/Paralelism with joblib
                             http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                      >>># With memory.cache
>>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)
Cache/Paralelism with joblib
                             http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                      >>># With memory.cache
>>> timeit similarity.get_similarities          >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                            (‘marcel_caraciolo’)
Cache/Paralelism with joblib
                               http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                       >>># With memory.cache
>>> timeit similarity.get_similarities           >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                             (‘marcel_caraciolo’)
   100 loops, best of 3: 978 ms per loop
Cache/Paralelism with joblib
                               http://packages.python.org/joblib/index.html


     from joblib import Memory
     memory = Memory(cachedir=’’, verbose=0)

      class UserSimilarity(BaseSimilarity):
          ...

            @memory.cache 
            def get_similarity(self, source_id, target_id):
                 source_preferences = self.model.preferences_from_user(source_id)
                 target_preferences = self.model.preferences_from_user(target_id)
                  ...
                  return self.distance(source_preferences, target_preferences) 
                      if not source_preferences.shape[1] == 0 
                          and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

              def get_similarities(self, source_id):
                  return[(other_id, self.get_similarity(source_id, other_id))
                                    for other_id, v in self.model]


>>> #Without memory.cache                       >>># With memory.cache
>>> timeit similarity.get_similarities           >>> timeit similarity.get_similarities
       (‘marcel_caraciolo’)                             (‘marcel_caraciolo’)
   100 loops, best of 3: 978 ms per loop             100 loops, best of 3: 434 ms per loop
Cache/Paralelism with joblib
                      http://packages.python.org/joblib/index.html




 Investigate how to use multiprocessing and parallel packages with similarities
                                  computation




    from joblib import Parallel
    ...

    def get_similarities(self, source_id):
        return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity)
            (source_id, other_id)) for other_id, v in self.model)
Distributed Computing with mrJob
         https://github.com/Yelp/mrjob
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                          https://github.com/Yelp/mrjob


                                                """The classic MapReduce job: count the frequency of words.
                                                """
                                                from mrjob.job import MRJob
                                                import re

                                                WORD_RE = re.compile(r"[w']+")

                                                class MRWordFreqCount(MRJob):

                                                    def mapper(self, _, line):
                                                        for word in WORD_RE.findall(line):
                                                            yield (word.lower(), 1)

                                                    def reducer(self, word, counts):
                                                        yield (word, sum(counts))

                                                if __name__ == '__main__':
                                                    MRWordFreqCount.run()




It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
                                 local (for testing)
Distributed Computing with mrJob
                                         https://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Distributed Computing with mrJob
                                         https://github.com/Yelp/mrjob

Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Future studies with Sparse Matrices
 Real datasets come with lots of empty values
  http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



Solutions:

       scipy.sparse package

       Sharding operations

       Matrix Factorization
        techniques (SVD)




                                                  Apontador Reviews Dataset
Future studies with Sparse Matrices
     Real datasets come with lots of empty values
      http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



   Solutions:

          scipy.sparse package

          Sharding operations

          Matrix Factorization
           techniques (SVD)




  Crab implements a Matrix
Factorization with Expectation
   Maximization algorithm

                                                      Apontador Reviews Dataset
Future studies with Sparse Matrices
     Real datasets come with lots of empty values
      http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html



   Solutions:

          scipy.sparse package

          Sharding operations

          Matrix Factorization
           techniques (SVD)




  Crab implements a Matrix
Factorization with Expectation
   Maximization algorithm
      scikits.crab.svd package
                                                      Apontador Reviews Dataset
Optimizations with Cython
                                          http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.




                      http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Optimizations with Cython
                                                   http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython",
 ["spearman_correlation_cython.pyx"])]

)


                            http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Optimizations with Cython
                                                   http://cython.org/


Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython",
 ["spearman_correlation_cython.pyx"])]

)


                            http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Benchmarks

                                    Pure Python w/   Python w/ Scipy
       Dataset
                                         dicts         and Numpy
MovieLens 100k                         15.32 s           9.56 s
 http://www.grouplens.org/node/73



                                       Old Crab         New Crab
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Benchmarks

                                         Pure Python w/       Python w/ Scipy
            Dataset
                                              dicts             and Numpy
    MovieLens 100k                             15.32 s            9.56 s
      http://www.grouplens.org/node/73



                                               Old Crab           New Crab




Time ellapsed ( Recommend 5 items)



                                           0              4   8       12        16
Why migrate ?
Old Crab running only using Pure Python
     Recommendations demand heavy maths calculations and lots of processing

Compatible with Numpy and Scipy libraries
   High Standard and popular scientific libraries optimized for scientific calculations in Python

Scikits projects are amazing!
    Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)

Turn the Crab framework visible for the community
 Join the scientific researchers and machine learning developers around the Globe coding with
                                 Python to help us in this project


                              Be Fast and Furious
Why migrate ?



Numpy optimized with PyPy

     2.x - 48.x Faster



  http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
How are we working ?
            Sprints, Online Discussions and Issues




https://github.com/muricoca/crab/wiki/UpcomingEvents
How are we working ?
      Our Project’s Home Page




http://muricoca.github.com/crab
Future Releases
       Planned Release 0.1
   Collaborative Filtering Algorithms working, sample datasets to load and test


       Planned Release 0.11
                Sparse Matrixes and Database Models support


       Planned Release 0.12
                Slope One Agorithm, new factorization techniques implemented



....
Join us!

1. Read our Wiki Page
    https://github.com/muricoca/crab/wiki/Developer-Resources

2. Check out our current sprints and open issues
    https://github.com/muricoca/crab/issues

3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
                     discussion list
                  http://groups.google.com/group/scikit-crab
Recommended Books




Toby Segaran, Programming Collective   SatnamAlag, Collective Intelligence in
Intelligence, O'Reilly, 2007           Action, Manning Publications, 2009



   ACM RecSys, KDD , SBSC...
Crab
              A Python Framework for Building
                  Recommendation Engines

           https://github.com/muricoca/crab

Marcel Caraciolo Ricardo Caspirro                            Bruno Melo
   @marcelcaraciolo           @ricardocaspirro                 @brunomelo

                      {marcel, ricardo,bruno}@muricoca.com

More Related Content

What's hot (6)

PDF
Moose workshop
Ynon Perek
 
PDF
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
auexpo Conference
 
PDF
OO Perl with Moose
Nelo Onyiah
 
ODP
Moose talk at FOSDEM 2011 (Perl devroom)
xSawyer
 
PDF
Writing and Sharing Great Modules with the Puppet Forge
Puppet
 
PDF
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Lucidworks
 
Moose workshop
Ynon Perek
 
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
auexpo Conference
 
OO Perl with Moose
Nelo Onyiah
 
Moose talk at FOSDEM 2011 (Perl devroom)
xSawyer
 
Writing and Sharing Great Modules with the Puppet Forge
Puppet
 
Coffee, Danish & Search: Presented by Alan Woodward & Charlie Hull, Flax
Lucidworks
 

Viewers also liked (8)

PDF
Apache Spark Machine Learning
Carol McDonald
 
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PPTX
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
PDF
Recommender Systems with Apache Spark's ALS Function
Will Johnson
 
PDF
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
Apache Spark Machine Learning
Carol McDonald
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
Recommender Systems with Apache Spark's ALS Function
Will Johnson
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Ad

Similar to Crab: A Python Framework for Building Recommender Systems (20)

PDF
Introduction to Crab - Python Framework for Building Recommender Systems
Marcel Caraciolo
 
KEY
Php Code Audits (PHP UK 2010)
Damien Seguy
 
PDF
Symfony & Javascript. Combining the best of two worlds
Ignacio Martín
 
PDF
Advanced Topics in Continuous Deployment
Mike Brittain
 
PDF
Semantic search for Earth Observation products
Gasperi Jerome
 
PDF
Solving the Riddle of Search: Using Sphinx with Rails
freelancing_god
 
PDF
Machine Learning, Key to Your Classification Challenges
Marc Borowczak
 
PDF
Architectural Tradeoff in Learning-Based Software
Pooyan Jamshidi
 
PDF
CoffeeScript Design Patterns
TrevorBurnham
 
PDF
Cheapass.in — presented at JSFoo 2016
Aakash Goel
 
KEY
Django’s nasal passage
Erik Rose
 
KEY
Socket applications
João Moura
 
PDF
Why GC is eating all my CPU?
Roman Elizarov
 
PDF
You Don t Know JS ES6 Beyond Kyle Simpson
gedayelife
 
PDF
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Matt Raible
 
PPTX
Automated release management - DevConFu 2014
Kristoffer Deinoff
 
ODP
Choosing JavaScript Libraries - Refresh-Detroit.org
Chris Lee
 
KEY
Python在豆瓣的应用
Qiangning Hong
 
KEY
What's new in Puppet 3.0
Eric Sorenson
 
PDF
Monkeybars in the Manor
martinbtt
 
Introduction to Crab - Python Framework for Building Recommender Systems
Marcel Caraciolo
 
Php Code Audits (PHP UK 2010)
Damien Seguy
 
Symfony & Javascript. Combining the best of two worlds
Ignacio Martín
 
Advanced Topics in Continuous Deployment
Mike Brittain
 
Semantic search for Earth Observation products
Gasperi Jerome
 
Solving the Riddle of Search: Using Sphinx with Rails
freelancing_god
 
Machine Learning, Key to Your Classification Challenges
Marc Borowczak
 
Architectural Tradeoff in Learning-Based Software
Pooyan Jamshidi
 
CoffeeScript Design Patterns
TrevorBurnham
 
Cheapass.in — presented at JSFoo 2016
Aakash Goel
 
Django’s nasal passage
Erik Rose
 
Socket applications
João Moura
 
Why GC is eating all my CPU?
Roman Elizarov
 
You Don t Know JS ES6 Beyond Kyle Simpson
gedayelife
 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Matt Raible
 
Automated release management - DevConFu 2014
Kristoffer Deinoff
 
Choosing JavaScript Libraries - Refresh-Detroit.org
Chris Lee
 
Python在豆瓣的应用
Qiangning Hong
 
What's new in Puppet 3.0
Eric Sorenson
 
Monkeybars in the Manor
martinbtt
 
Ad

More from Marcel Caraciolo (20)

PDF
Como interpretar seu próprio genoma com Python
Marcel Caraciolo
 
PDF
Joblib: Lightweight pipelining for parallel jobs (v2)
Marcel Caraciolo
 
PDF
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Marcel Caraciolo
 
PDF
Como Python ajudou a automatizar o nosso laboratório v.2
Marcel Caraciolo
 
PDF
Como Python pode ajudar na automação do seu laboratório
Marcel Caraciolo
 
PDF
Python on Science ? Yes, We can.
Marcel Caraciolo
 
PDF
Oficina Python: Hackeando a Web com Python 3
Marcel Caraciolo
 
PDF
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Marcel Caraciolo
 
PDF
Opensource - Como começar e dá dinheiro ?
Marcel Caraciolo
 
PDF
Big Data com Python
Marcel Caraciolo
 
PDF
Benchy, python framework for performance benchmarking of Python Scripts
Marcel Caraciolo
 
PDF
Python e 10 motivos por que devo conhece-la ?
Marcel Caraciolo
 
PDF
Benchy: Lightweight framework for Performance Benchmarks
Marcel Caraciolo
 
PDF
Python, A pílula Azul da programação
Marcel Caraciolo
 
PDF
Construindo Soluções Científicas com Big Data & MapReduce
Marcel Caraciolo
 
PDF
Como Python está mudando a forma de aprendizagem à distância no Brasil
Marcel Caraciolo
 
PDF
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Marcel Caraciolo
 
PDF
Aula WebCrawlers com Regex - PyCursos
Marcel Caraciolo
 
PDF
Arquivos Zip com Python - Aula PyCursos
Marcel Caraciolo
 
PDF
PyFoursquare: Python Library for Foursquare
Marcel Caraciolo
 
Como interpretar seu próprio genoma com Python
Marcel Caraciolo
 
Joblib: Lightweight pipelining for parallel jobs (v2)
Marcel Caraciolo
 
Construindo softwares de bioinformática para análises clínicas : Desafios e...
Marcel Caraciolo
 
Como Python ajudou a automatizar o nosso laboratório v.2
Marcel Caraciolo
 
Como Python pode ajudar na automação do seu laboratório
Marcel Caraciolo
 
Python on Science ? Yes, We can.
Marcel Caraciolo
 
Oficina Python: Hackeando a Web com Python 3
Marcel Caraciolo
 
Recommender Systems with Ruby (adding machine learning, statistics, etc)
Marcel Caraciolo
 
Opensource - Como começar e dá dinheiro ?
Marcel Caraciolo
 
Big Data com Python
Marcel Caraciolo
 
Benchy, python framework for performance benchmarking of Python Scripts
Marcel Caraciolo
 
Python e 10 motivos por que devo conhece-la ?
Marcel Caraciolo
 
Benchy: Lightweight framework for Performance Benchmarks
Marcel Caraciolo
 
Python, A pílula Azul da programação
Marcel Caraciolo
 
Construindo Soluções Científicas com Big Data & MapReduce
Marcel Caraciolo
 
Como Python está mudando a forma de aprendizagem à distância no Brasil
Marcel Caraciolo
 
Novas Tendências para a Educação a Distância: Como reinventar a educação ?
Marcel Caraciolo
 
Aula WebCrawlers com Regex - PyCursos
Marcel Caraciolo
 
Arquivos Zip com Python - Aula PyCursos
Marcel Caraciolo
 
PyFoursquare: Python Library for Foursquare
Marcel Caraciolo
 

Recently uploaded (20)

PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Python basic programing language for automation
DanialHabibi2
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 

Crab: A Python Framework for Building Recommender Systems

  • 1. Crab A Python Framework for Building Recommendation Engines PythonBrasil 2011, São Paulo, SP Marcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo
  • 2. What is Crab ? A python framework for building recommendation engines A Scikit module for collaborative, content and hybrid filtering Mahout Alternative for Python Developers :D Open-Source under the BSD license https://github.com/muricoca/crab
  • 3. When started ? It began one year ago Community-driven, 4 members Since April,2011 the open-source labs Muriçoca incorporated it Since April,2011 rewritting it as Scikit https://github.com/muricoca/
  • 4. Knowing Scikits Scikits are Scipy Toolkits - independent and projects hosted under a common namespace. Scikits Image Scikits MlabWrap Scikits AudioLab Scikit Learn .... http://scikits.appspot.com/scikits
  • 5. Knowing Scikits Scikit-Learn Machine Learning Algorithms + scientific Python packages (Numpy, Scipy and Matplotlib) http://scikit-learn.sourceforge.net/ Our goal: Incorporate the Crab as Scikit and incorporate some parts of them at Scikit-learn
  • 6. Why Recommendations ? The world is an over-crowded place !"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#
  • 7. Why Recommendations * +,&-.$/).#&0#/"1.#$%234(".# ? $/)#5(&6 7&.2.#"$4,#)$8 We are overloaded * 93((3&/.#&0#:&'3".;#5&&<.# $/)#:-.34#2%$4<.#&/(3/" Thousands of news articles and blog posts each day * =/#>$/&3;#?#@A#+B#4,$//"(.;# 2,&-.$/).#&0#7%&6%$:.# Millions of movies, books and music tracks online "$4,#)$8 Several Places, Offers and Events * =/#C"1#D&%<;#."'"%$(# Even Friends sometimes we are overloaded ! 2,&-.$/).#&0#$)#:"..$6".# ."/2#2&#-.#7"%#)$8
  • 8. Why Recommendations ? We really need and consume only a few of them! “A lot of times, people don’t know what they want until you show it to them.” Steve Jobs “We are leaving the Information age, and entering into the Recommendation age.” Chris Anderson, from book Long Tail
  • 9. Why Recommendations ? Can Google help ? Yes, but only when we really know what we are looking for But, what’s does it mean by “interesting” ? Can Facebook help ? Yes, I tend to find my friends’ stuffs interesting What if i had only few friends and what they like do not always attract me ? Can experts help ? Yes, but it won’t scale well. But it is what they like, not me! Exactly same advice!
  • 10. Why Recommendations ? Recommendation Systems Systems designed to recommend to me something I may like
  • 11. Why Recommendations ? !"#$%&"'$"'(')*#*+,) Recommendation Systems -+*#)+. -#/') 0#)1# ! 2' 23&4"+')1 5,6 7),*%'"&863 Graph Representation
  • 12. The current Crab Collaborative Filtering algorithms User-Based, Item-Based and Factorization Matrix (SVD) Evaluation of the Recommender Algorithms Precision, Recall, F1-Score, RMSE Precision-Recall Charts
  • 13. The current Crab Precision-Recall Charts
  • 14. Collaborative Filtering O Vento Toy Thor Armagedon Items Levou Store like recommends Marcel Rafael Amanda Users Similar
  • 17. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies
  • 18. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies()
  • 19. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies() >>> data
  • 20. The current Crab >>>#load the dataset >>> from crab.datasets import load_sample_movies >>> data = load_sample_movies() >>> data {'DESCR': 'sample_movies data set was collected by the book called nProgramming the Collective Intelligence by Toby Segaran nnNotesn----- nThis data set consists ofnt* n ratings with (1-5) from n users to n movies.',  'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},   2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},   3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},   4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},   5: {2: 4.5, 3: 1.0, 4: 4.0},   6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},   7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}},  'item_ids': {1: 'Lady in the Water',   2: 'Snakes on a Planet',   3: 'You, Me and Dupree',   4: 'Superman Returns',   5: 'The Night Listener',   6: 'Just My Luck'},  'user_ids': {1: 'Jack Matthews',   2: 'Mick LaSalle',   3: 'Claudia Puig',   4: 'Lisa Rose',   5: 'Toby',   6: 'Gene Seymour',   7: 'Michael Phillips'}}
  • 22. The current Crab >>> from crab.models import MatrixPreferenceDataModel
  • 23. The current Crab >>> from crab.models import MatrixPreferenceDataModel >>> m = MatrixPreferenceDataModel(data.data)
  • 24. The current Crab >>> from crab.models import MatrixPreferenceDataModel >>> m = MatrixPreferenceDataModel(data.data) >>> print m MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 ... 1 3.000000 4.000000 3.500000 5.000000 3.000000 2 3.000000 4.000000 2.000000 3.000000 3.000000 3 --- 3.500000 2.500000 4.000000 4.500000 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- 6 3.000000 3.500000 3.500000 5.000000 3.000000 7 2.500000 3.000000 --- 3.500000 4.000000
  • 26. The current Crab >>> #import pairwise distance
  • 27. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances
  • 28. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity
  • 29. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity
  • 30. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances)
  • 31. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1]
  • 32. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]
  • 33. The current Crab >>> #import pairwise distance >>> from crab.metrics.pairwise import euclidean_distances >>> #import similarity >>> from crab.similarities import UserSimilarity >>> similarity = UserSimilarity(m, euclidean_distances) >>> similarity[1] [(1, 1.0), (6, 0.66666666666666663), MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 (4, 0.34054242658316669), 1 3.000000 4.000000 3.500000 5.000000 3.000000 (3, 0.32037724101704074), 2 3.000000 4.000000 2.000000 3.000000 3.000000 3 --- 3.500000 2.500000 4.000000 4.500000 (7, 0.32037724101704074), 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- (2, 0.2857142857142857), 6 3.000000 3.500000 3.500000 5.000000 3.000000 (5, 0.2674788903885893)] 7 2.500000 3.000000 --- 3.500000 4.000000
  • 35. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender
  • 36. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
  • 37. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]])
  • 38. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]]) >>> recsys.recommended_because(user_id=5,item_id=1) array([[ 2. , 3. ],        [ 1. , 3. ],        [ 6. , 3. ],        [ 7. , 2.5],        [ 4. , 2.5]])
  • 39. The current Crab >>> from crab.recommenders.knn import UserBasedRecommender >>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True) >>> recsys.recommend(5) array([[ 5. , 3.45712869],        [ 1. , 2.78857832],        [ 6. , 2.38193068]]) >>> recsys.recommended_because(user_id=5,item_id=1) array([[ 2. , 3. ],        [ 1. , 3. ], MatrixPreferenceDataModel (7 by 6)          1 2 3 4 5 ...        [ 6. , 3. ], 1 3.000000 4.000000 3.500000 5.000000 3.000000 2 3.000000 4.000000 2.000000 3.000000 3.000000        [ 7. , 2.5], 3 --- 3.500000 2.500000 4.000000 4.500000        [ 4. , 2.5]]) 4 2.500000 3.500000 2.500000 3.500000 3.000000 5 --- 4.500000 1.000000 4.000000 --- 6 3.000000 3.500000 3.500000 5.000000 3.000000 7 2.500000 3.000000 --- 3.500000 4.000000
  • 40. The current Crab Using REST APIs to deploy the recommender django-piston, django-rest, django-tastypie
  • 41. Crab is already in production News from Abril Publisher recommendations! Collecting over 10 magazines, 20 books and 100+ articles Running on Python + Scipy + Django Content-Based-Filtering Easy-to-use interface Still in development
  • 42. Content Based Filtering Similar Duro de O Vento Toy Armagedon Items Matar Levou Store recommend likes Marcel Users
  • 43. Crab is already in production PythonBrasil keynotes Recommender Recommending keynotes based on a hybrid approach Running on Python + Scipy + Django Content-Based-Filtering + Collaborative Filtering Schedule your keynotes Still in development
  • 44. source, the recommendation architecture that we propose will would rely more on collaborative-filtering techniques, that is, aggregate the results of such filtering techniques. Bezerra and Carvalho proposed approaches where the results the reviews from similar users. We aim at integrating the previously mentioned hybrid prod- Figure 1 shows a overview of our meta recommender achieved showed to be very promising [19]. approach. By combining the content-based filtering and the uct recommendation approach in a mobile application so the A. Crab is already in production users could benefit from useful and logical recommendations. collaborative-based one into a hybrid recommender system, it Moreover, we aim at providing a suited explanation for each would use the services/products III. S YSTEM catalogues repositories which D ESIGN recommendation to the user, since the current approaches just the services to be recommended, and the review repository Application data information our mobile recommender sys- that contains the user opinions about those services. All this for only deliver product recommendations with a overall score without pointing out the appropriateness of such recommen- datatembecan be from data source containers in the web product description can extracted divided into two parts: the rec dation [13]. Besides the basic information provided by the such(such location-based social network Foursquare its attributes) and the user as the as location, description and [17] as Hybrid Meta Approach gives the system’s architecture and suppliers, the system will deliver the explanation, providing relevant reviews of similar users, we believe that it will tags, etc.). The Figure 3 increase the confidence in the buying decision process and the displayed at the Figure 2 and the location recommendation engine from Google: Google HotPot [18]. by user (such as rating, comments, reviews or ratings provided mo wh product accepptance rate. In the mobile context this approach po could help the users in this process and showing the user relative components. thi opinions could contribute to achieve this task. rec spe !"#$"%&'$ 5&-$ !"#$%&'%($) !".,"/#) acc !"*+#,$+'-) !"*+#,$+'-) +,-*.&$ !(#$()&'*&%$ /01&'234&$ !6#$6,00&41&7$ wh res !<#$<'&2&'&04&%A$B,431*,0A$&14C$ ves 0+44%6+'%$,.")1%#"2) 0+($"($)1%#"2) 3,4$"',(5) ou 3,4$"',(5) )))67,8,#%)+,4%$91$'%4)-1":)))) suc !"#$%&"'()*+,#&-,.) /$%,0"12()*3$4%)3""5.) ))))1,;&,<4)<1&%%,')=2)4&:&8$1)) )))))))))))%$4%,5)94,14>?) <',7)41$ pro 8&=,%*1,'>$ exp 8&4,99&0731*,0$:0;*0&$ !B#$B*%1$,2$D4,'&7$<',7)41%$ !(#$()&'*&%$ ma 8&?*&@$ we Fig. 2. User Reviews from Foursquare Social Network 8&=,%*1,'>$ com 7"$%) !"8+99"(2"')) !8#$830E&7$<',7)41%$ The content-based filtering approach will be used to filter ext the product/service repository, while the collaborative based 8&%).1%$ B. approach will derive the product review recommendations. In addition we will use text mining techniques to distinct the !"8+99"(2%$,+(#) polarity of the user review between positive or negative one. This information summarized would contribute in the product Architecture Fig. 3. Mobile Recommender System rat score recommendation computation. The final product recom- Fig. 1. Meta Recommender Architecture mendation score is computed by integrating the result of both me recommenders. By now, weproduct/service recommender, the user could In our mobile are considering to use different and Since one of the goals of this work is to incorporate options regarding this integration approach, one and get a list of recommen- different data sources of user opinions and descriptions, we filter some products or services at special oth is the symbolic data analysis approach (SDA) [19], which have addopted an meta recommendation architecture. By using eachtations. The user user ratings/reviews arehis preferences or give his product description and also can enter modeled ow a meta recommender architecture, the system would provide a personalized control over the generated recommendation list feedback to some offered product recommendation. as set of modal symbolic descriptions that summarizes the Re information provided by the corresponding data sources. It is
  • 45. Crab is already in production Brazilian Social Network called Atepassar.com Educational network with more than 60.000 students and 120 video-classes Running on Python + Numpy + Scipy and Django Backend for Recommendations MongoDB - mongoengine Daily Recommendations with Explanations
  • 46. Evaluating your recommender Crab implements the most used recommender metrics. Precision, Recall, F1-Score, RMSE Using matplotlib for a plotter utility Implement new metrics Simulations support maybe (??)
  • 48. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator
  • 49. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator()
  • 50. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse')
  • 51. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907}
  • 52. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907} >>> evaluator.evaluate_on_split(recommender=recsys, at =2)
  • 53. Evaluating your recommender >>> from crab.metrics.classes import CfEvaluator >>> evaluator = CfEvaluator() >>> evaluator.evaluate(recommender=recsys,metric='rmse') {'rmse': 0.69467177857026907} >>> evaluator.evaluate_on_split(recommender=recsys, at =2) ({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}], 'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677}, {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955}, {'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]}, {'final_score': {'avg': {'f1score': 0.495955, 'mae': 0.429292, 'nmae': 0.373739, 'precision': 0.63932929, 'recall': 0.729939393, 'rmse': 0.3466868}, 'stdev': {'f1score': 0.09938383 , 'mae': 0.0593933, 'nmae': 0.03393939, 'precision': 0.0192929, 'recall': 0.031293939, 'rmse': 0.234949494}}})
  • 54. Distributing the recommendation computations Use Hadoop and Map-Reduce intensively Investigating the Yelp mrjob framework https://github.com/pfig/mrjob Develop the Netflix and novel standard-of-the-art used Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines The most commonly used is Slope One technique. Simple algebra math with slope one algebra y = a*x+b
  • 55. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
  • 56. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache
  • 57. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache
  • 58. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities (‘marcel_caraciolo’)
  • 59. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’)
  • 60. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop
  • 61. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html from joblib import Memory memory = Memory(cachedir=’’, verbose=0) class UserSimilarity(BaseSimilarity):     ...     @memory.cache  def get_similarity(self, source_id, target_id):          source_preferences = self.model.preferences_from_user(source_id)          target_preferences = self.model.preferences_from_user(target_id) ...         return self.distance(source_preferences, target_preferences)             if not source_preferences.shape[1] == 0                 and not target_preferences.shape[1] == 0 else np.array([[np.nan]]) def get_similarities(self, source_id):         return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model] >>> #Without memory.cache >>># With memory.cache >>> timeit similarity.get_similarities >>> timeit similarity.get_similarities (‘marcel_caraciolo’) (‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop
  • 62. Cache/Paralelism with joblib http://packages.python.org/joblib/index.html Investigate how to use multiprocessing and parallel packages with similarities computation from joblib import Parallel ... def get_similarities(self, source_id):         return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity) (source_id, other_id)) for other_id, v in self.model)
  • 63. Distributed Computing with mrJob https://github.com/Yelp/mrjob
  • 64. Distributed Computing with mrJob https://github.com/Yelp/mrjob It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 65. Distributed Computing with mrJob https://github.com/Yelp/mrjob It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 66. Distributed Computing with mrJob https://github.com/Yelp/mrjob """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob):     def mapper(self, _, line):         for word in WORD_RE.findall(line):             yield (word.lower(), 1)     def reducer(self, word, counts):         yield (word, sum(counts)) if __name__ == '__main__':     MRWordFreqCount.run() It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
  • 67. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 68. Distributed Computing with mrJob https://github.com/Yelp/mrjob Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
  • 69. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Apontador Reviews Dataset
  • 70. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a Matrix Factorization with Expectation Maximization algorithm Apontador Reviews Dataset
  • 71. Future studies with Sparse Matrices Real datasets come with lots of empty values http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html Solutions: scipy.sparse package Sharding operations Matrix Factorization techniques (SVD) Crab implements a Matrix Factorization with Expectation Maximization algorithm scikits.crab.svd package Apontador Reviews Dataset
  • 72. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 73. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. # setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext # for notes on compiler flags see: # http://docs.python.org/install/index.html setup( cmdclass = {'build_ext': build_ext}, ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])] ) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 74. Optimizations with Cython http://cython.org/ Cython is a Python extension that lets developers annotate functions so they can be compiled to C. # setup.py from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext # for notes on compiler flags see: # http://docs.python.org/install/index.html setup( cmdclass = {'build_ext': build_ext}, ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])] ) http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
  • 75. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab
  • 76. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 77. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 78. Benchmarks Pure Python w/ Python w/ Scipy Dataset dicts and Numpy MovieLens 100k 15.32 s 9.56 s http://www.grouplens.org/node/73 Old Crab New Crab Time ellapsed ( Recommend 5 items) 0 4 8 12 16
  • 79. Why migrate ? Old Crab running only using Pure Python Recommendations demand heavy maths calculations and lots of processing Compatible with Numpy and Scipy libraries High Standard and popular scientific libraries optimized for scientific calculations in Python Scikits projects are amazing! Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn) Turn the Crab framework visible for the community Join the scientific researchers and machine learning developers around the Globe coding with Python to help us in this project Be Fast and Furious
  • 80. Why migrate ? Numpy optimized with PyPy 2.x - 48.x Faster http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
  • 81. How are we working ? Sprints, Online Discussions and Issues https://github.com/muricoca/crab/wiki/UpcomingEvents
  • 82. How are we working ? Our Project’s Home Page http://muricoca.github.com/crab
  • 83. Future Releases Planned Release 0.1 Collaborative Filtering Algorithms working, sample datasets to load and test Planned Release 0.11 Sparse Matrixes and Database Models support Planned Release 0.12 Slope One Agorithm, new factorization techniques implemented ....
  • 84. Join us! 1. Read our Wiki Page https://github.com/muricoca/crab/wiki/Developer-Resources 2. Check out our current sprints and open issues https://github.com/muricoca/crab/issues 3. Forks, Pull Requests mandatory 4. Join us at irc.freenode.net #muricoca or at our discussion list http://groups.google.com/group/scikit-crab
  • 85. Recommended Books Toby Segaran, Programming Collective SatnamAlag, Collective Intelligence in Intelligence, O'Reilly, 2007 Action, Manning Publications, 2009 ACM RecSys, KDD , SBSC...
  • 86. Crab A Python Framework for Building Recommendation Engines https://github.com/muricoca/crab Marcel Caraciolo Ricardo Caspirro Bruno Melo @marcelcaraciolo @ricardocaspirro @brunomelo {marcel, ricardo,bruno}@muricoca.com