SlideShare a Scribd company logo
Which Algorithms Really Matter?

©MapR Technologies 2013

1
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

2
Topic For Today


What is important? What is not?



Why?



What is the difference from academic research?



Some examples

©MapR Technologies 2013

4
What is Important?


Deployable



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

5
What is Important?


Deployable
–

Clever prototypes don’t count if they can’t be standardized



Robust



Transparent



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

6
What is Important?


Deployable
–



Robust
–



Clever prototypes don’t count
Mishandling is common

Transparent
–

Will degradation be obvious?



Skillset and mindset matched?



Proportionate

©MapR Technologies 2013

7
What is Important?


Deployable
–



Robust
–



Will degradation be obvious?

Skillset and mindset matched?
–



Mishandling is common

Transparent
–



Clever prototypes don’t count

How long will your fancy data scientist enjoy doing standard ops tasks?

Proportionate
–

Where is the highest value per minute of effort?

©MapR Technologies 2013

8
Academic Goals vs Pragmatics


Academic goals
–
–

–



Reproducible
Isolate theoretically important aspects
Work on novel problems

Pragmatics
–
–
–
–
–

Highest net value
Available data is constantly changing
Diligence and consistency have larger impact than cleverness
Many systems feed themselves, exploration and exploitation are both
important
Engineering constraints on budget and schedule

©MapR Technologies 2013

9
Example 1:
Making Recommendations Better

©MapR Technologies 2013

10
Recommendation Advances


What are the most important algorithmic advances in
recommendations over the last 10 years?



Cooccurrence analysis?



Matrix completion via factorization?



Latent factor log-linear models?



Temporal dynamics?

©MapR Technologies 2013

11
The Winner – None of the Above


What are the most important algorithmic advances in
recommendations over the last 10 years?

1. Result dithering
2. Anti-flood

©MapR Technologies 2013

12
The Real Issues


Exploration



Diversity



Speed



Not the last fraction of a percent

©MapR Technologies 2013

13
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

©MapR Technologies 2013

14
Result Dithering


Dithering is used to re-order recommendation results
–

Re-ordering is done randomly



Dithering is guaranteed to make off-line performance worse



Dithering also has a near perfect record of making actual
performance much better

“Made more difference than any other change”
©MapR Technologies 2013

15
Simple Dithering Algorithm


Generate synthetic score from log rank plus Gaussian

s = logr + N(0, e )


Pick noise scale to provide desired level of mixing

Dr µ r exp e


Typically

e Î [ 0.4, 0.8]


Oh… use floor(t/T) as seed

©MapR Technologies 2013

16
Example … ε = 0.5
1
1
1
1
1
1
1
2
4
2
3
2
©MapR Technologies 2013

2
2
4
2
6
2
2
1
1
1
1
1

6
3
3
4
2
3
3
3
2
5
5
3

5
8
2
3
3
5
4
5
7
3
4
4

3
5
6
15
4
24
6
7
3
4
2
7
17

4
7
7
7
16
7
12
6
9
7
7
12

13
6
11
13
9
17
5
4
8
13
8
17

16
34
10
19
5
13
14
17
5
6
6
16
Example … ε = log 2 = 0.69
1
1
1
1
1
1
1
2
2
3
11
1
©MapR Technologies 2013

2
8
3
2
5
2
3
4
3
4
1
8

8
14
8
10
33
7
5
11
1
1
2
7

3
15
2
7
15
3
23
8
4
2
4
3

9
3
10
3
2
5
9
3
6
10
5
22
18

15
2
5
8
9
4
7
1
7
11
7
11

7
22
7
6
11
19
4
44
8
15
3
2

6
10
4
14
29
6
2
9
33
14
14
33
Exploring The Second Page

©MapR Technologies 2013

19
Lesson 1:
Exploration is good

©MapR Technologies 2013

20
Example 2:
Bayesian Bandits

©MapR Technologies 2013

21
Bayesian Bandits


Based on Thompson sampling



Very general sequential test



Near optimal regret



Trade-off exploration and exploitation



Possibly best known solution for exploration/exploitation



Incredibly simple

©MapR Technologies 2013

22
Thompson Sampling


Select each shell according to the probability that it is the best



Probability that it is the best can be computed using posterior

é
ù
P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq
ë
û
j


But I promised a simple answer

©MapR Technologies 2013

23
Thompson Sampling – Take 2


Sample θ

q ~ P(q | D)


Pick i to maximize reward

i = argmax E[rj | q ]
j



Record result from using i

©MapR Technologies 2013

24
Fast Convergence
0.12
0.11
0.1
0.09
0.08

regret

0.07
0.06

ε- greedy, ε = 0.05
0.05
0.04

Bayesian Bandit with Gam m a- Norm al

0.03
0.02
0.01
0
0

100

200

300

400

500

600
n

©MapR Technologies 2013

25

700

800

900

1000

1100
Thompson Sampling on Ads

An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011
©MapR Technologies 2013

26
Bayesian Bandits versus Result Dithering


Many useful systems are difficult to frame in fully Bayesian form



Thompson sampling cannot be applied without posterior sampling



Can still do useful exploration with dithering



But better to use Thompson sampling if possible

©MapR Technologies 2013

27
Lesson 2:
Exploration is pretty
easy to do and pays
big benefits.

©MapR Technologies 2013

28
Example 3:
On-line Clustering

©MapR Technologies 2013

29
The Problem


K-means clustering is useful for feature extraction or compression



At scale and at high dimension, the desirable number of clusters
increases



Very large number of clusters may require more passes through
the data



Super-linear scaling is generally infeasible

©MapR Technologies 2013

30
The Solution


Sketch-based algorithms produce a sketch of the data



Streaming k-means uses adaptive dp-means to produce this sketch
in the form of many weighted centroids which approximate the
original distribution



The size of the sketch grows very slowly with increasing data size



Many operations such as clustering are well behaved on sketches

Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson.
Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan.

©MapR Technologies 2013

31
An Example

©MapR Technologies 2013

32
An Example

©MapR Technologies 2013

33
The Cluster Proximity Features


Every point can be described by the nearest cluster
–
–



Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign
bit + 2 proximities)
–
–



4.3 bits per point in this case
Significant error that can be decreased (to a point) by increasing number of
clusters

Error is negligible
Unwinds the data into a simple representation

Or we can increase the number of clusters (n fold increase adds log
n bits per point, decreases error by sqrt(n)

©MapR Technologies 2013

34
Diagonalized Cluster Proximity

©MapR Technologies 2013

35
Lots of Clusters Are Fine

©MapR Technologies 2013

36
Typical k-means Failure

Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

©MapR Technologies 2013

37
Streaming k-means Ideas


By using a sketch with lots (k log N) of centroids, we avoid
pathological cases



We still get a very good result if the sketch is created
–
–

in one pass
with approximate search



In fact, adaptive dp-means works just fine



In the end, the sketch can be used for clustering or …

©MapR Technologies 2013

38
Lesson 3:
Sketches make big
data small.

©MapR Technologies 2013

39
Example 4:
Search Abuse

©MapR Technologies 2013

40
Recommendations

Alice

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Charles got a bicycle

41
Recommendations

Alice

Bob

Charles

©MapR Technologies 2013

Alice got an apple and a
puppy

Bob got an apple

Charles got a bicycle

42
Recommendations

Alice

Bob

?

What else would Bob like?

Charles

©MapR Technologies 2013

43
Log Files
Alice
Charles
Charles
Alice

Alice
Bob
Bob
©MapR Technologies 2013

44
History Matrix: Users by Items

Alice

✔

Bob

✔

Charles

©MapR Technologies 2013

✔

✔
✔
✔

45

✔
Co-occurrence Matrix: Items by Items
How do you tell which co-occurrences are useful?.

1

2

1

1

2

©MapR Technologies 2013

1

0

-

0

1

1
46

0
0
Co-occurrence Binary Matrix

not
not

©MapR Technologies 2013

1
1

47

1
Indicator Matrix: Anomalous Co-Occurrence
Result: The marked row will be added to the indicator
field in the item document…

✔

✔

©MapR Technologies 2013

48
Indicator Matrix
That one row from indicator matrix becomes the indicator field in the Solr
document used to deploy the recommendation engine.

✔
id: t4
title: puppy
desc: The sweetest little puppy ever.
keywords: puppy, dog, pet
indicators:

(t1)

Note: data for the indicator field is added directly to meta-data for a document in
Solr index. You don’t need to create a separate index for the indicators.
©MapR Technologies 2013

49
Internals of the Recommender Engine

50

©MapR Technologies 2013

50
Internals of the Recommender Engine

51

©MapR Technologies 2013

51
Looking Inside LucidWorks
Real-time recommendation query and results: Evaluation

What to recommend if new user listened to 2122: Fats Domino & 303: Beatles?
Recommendation is “1710 : Chuck Berry”
52

©MapR Technologies 2013

52
Real-life example

©MapR Technologies 2013

53
Lesson 4:
Recursive search abuse pays
Search can implement recs
Which can implement search

©MapR Technologies 2013

54
Summary

©MapR Technologies 2013

55
©MapR Technologies 2013

56
Me, Us


Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG



MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s



Info
Hash tag - #mapr
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR

©MapR Technologies 2013

57

More Related Content

What's hot (20)

PDF
Lec 3 knowledge acquisition representation and inference
Eyob Seyfu
 
PDF
Federated learning
Mindos Cheng
 
PPT
Deep Learning
Roshan Chettri
 
PDF
Neural networks and deep learning
Jörgen Sandig
 
PDF
Sistema operacional
Michael Soto
 
PPTX
Techniques in Deep Learning
Sourya Dey
 
PPTX
Introduction to machine learning
Abdus Sayef Reyadh
 
PDF
Matching networks for one shot learning
Kazuki Fujikawa
 
PPTX
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
WON JOON YOO
 
PDF
Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...
Mehran University Of Engineering and Technology, Pakistan
 
PPTX
CLOUD COMPUTING UNIT - 3.pptx
VivekKumar898803
 
PPTX
EfficientNet
JUGAL GANDHI
 
PPTX
Extreme learning machine:Theory and applications
James Chou
 
PPTX
489594658-Unit-III-Iot architecture.pptx
MBIEDANGOMEGNIFRANKG
 
PPTX
Video compression
DarkNight14
 
PPT
Chord Algorithm
Sijia Lyu
 
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
ODP
Linux Como Tudo Começou
guestaa94fe
 
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
Lec 3 knowledge acquisition representation and inference
Eyob Seyfu
 
Federated learning
Mindos Cheng
 
Deep Learning
Roshan Chettri
 
Neural networks and deep learning
Jörgen Sandig
 
Sistema operacional
Michael Soto
 
Techniques in Deep Learning
Sourya Dey
 
Introduction to machine learning
Abdus Sayef Reyadh
 
Matching networks for one shot learning
Kazuki Fujikawa
 
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
WON JOON YOO
 
Discrete Fourier Series | Discrete Fourier Transform | Discrete Time Fourier ...
Mehran University Of Engineering and Technology, Pakistan
 
CLOUD COMPUTING UNIT - 3.pptx
VivekKumar898803
 
EfficientNet
JUGAL GANDHI
 
Extreme learning machine:Theory and applications
James Chou
 
489594658-Unit-III-Iot architecture.pptx
MBIEDANGOMEGNIFRANKG
 
Video compression
DarkNight14
 
Chord Algorithm
Sijia Lyu
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
 
Linux Como Tudo Começou
guestaa94fe
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 

Similar to Which Algorithms Really Matter (20)

PPTX
How to tell which algorithms really matter
DataWorks Summit
 
PPTX
How to Determine which Algorithms Really Matter
DataWorks Summit
 
PPTX
Predictive Analytics with Hadoop
DataWorks Summit
 
PPTX
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
PPTX
Introduction to Mahout
Ted Dunning
 
PPTX
Goto amsterdam-2013-skinned
Ted Dunning
 
PPTX
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
PPTX
DFW Big Data talk on Mahout Recommenders
Ted Dunning
 
PPTX
Mahout and Recommendations
Ted Dunning
 
PPTX
CMU Lecture on Hadoop Performance
MapR Technologies
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
Boston hug-2012-07
Ted Dunning
 
PPTX
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
PPTX
Realistic Synthetic Generation Allows Secure Development
MapR Technologies
 
PPTX
Graphlab Ted Dunning Clustering
MapR Technologies
 
PPTX
T digest-update
Ted Dunning
 
PPTX
Whats Right and Wrong with Apache Mahout
Ted Dunning
 
PPTX
What's Right and Wrong with Apache Mahout
MapR Technologies
 
PPTX
News From Mahout
MapR Technologies
 
PPTX
Doing-the-impossible
Ted Dunning
 
How to tell which algorithms really matter
DataWorks Summit
 
How to Determine which Algorithms Really Matter
DataWorks Summit
 
Predictive Analytics with Hadoop
DataWorks Summit
 
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
Introduction to Mahout
Ted Dunning
 
Goto amsterdam-2013-skinned
Ted Dunning
 
GoTo Amsterdam 2013 Skinned
MapR Technologies
 
DFW Big Data talk on Mahout Recommenders
Ted Dunning
 
Mahout and Recommendations
Ted Dunning
 
CMU Lecture on Hadoop Performance
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Boston hug-2012-07
Ted Dunning
 
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
MapR Technologies
 
Graphlab Ted Dunning Clustering
MapR Technologies
 
T digest-update
Ted Dunning
 
Whats Right and Wrong with Apache Mahout
Ted Dunning
 
What's Right and Wrong with Apache Mahout
MapR Technologies
 
News From Mahout
MapR Technologies
 
Doing-the-impossible
Ted Dunning
 
Ad

More from Ted Dunning (20)

PPTX
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
PPTX
How to Get Going with Kubernetes
Ted Dunning
 
PPTX
Progress for big data in Kubernetes
Ted Dunning
 
PPTX
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
PPTX
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
PPTX
Machine Learning Logistics
Ted Dunning
 
PPTX
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
PPTX
Machine Learning logistics
Ted Dunning
 
PPTX
Finding Changes in Real Data
Ted Dunning
 
PPTX
Where is Data Going? - RMDC Keynote
Ted Dunning
 
PPTX
Real time-hadoop
Ted Dunning
 
PPTX
Cheap learning-dunning-9-18-2015
Ted Dunning
 
PPTX
Sharing Sensitive Data Securely
Ted Dunning
 
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
PPTX
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
PPTX
Dunning time-series-2015
Ted Dunning
 
PPTX
Anomaly Detection - New York Machine Learning
Ted Dunning
 
PPTX
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
PPTX
Recommendation Techn
Ted Dunning
 
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Machine Learning Logistics
Ted Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Ted Dunning
 
Finding Changes in Real Data
Ted Dunning
 
Where is Data Going? - RMDC Keynote
Ted Dunning
 
Real time-hadoop
Ted Dunning
 
Cheap learning-dunning-9-18-2015
Ted Dunning
 
Sharing Sensitive Data Securely
Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Dunning time-series-2015
Ted Dunning
 
Anomaly Detection - New York Machine Learning
Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Ted Dunning
 
Recommendation Techn
Ted Dunning
 
Ad

Recently uploaded (20)

PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
Français Patch Tuesday - Juillet
Ivanti
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
Français Patch Tuesday - Juillet
Ivanti
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 

Which Algorithms Really Matter

  • 1. Which Algorithms Really Matter? ©MapR Technologies 2013 1
  • 2. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 2
  • 3. Topic For Today  What is important? What is not?  Why?  What is the difference from academic research?  Some examples ©MapR Technologies 2013 4
  • 4. What is Important?  Deployable  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 5
  • 5. What is Important?  Deployable – Clever prototypes don’t count if they can’t be standardized  Robust  Transparent  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 6
  • 6. What is Important?  Deployable –  Robust –  Clever prototypes don’t count Mishandling is common Transparent – Will degradation be obvious?  Skillset and mindset matched?  Proportionate ©MapR Technologies 2013 7
  • 7. What is Important?  Deployable –  Robust –  Will degradation be obvious? Skillset and mindset matched? –  Mishandling is common Transparent –  Clever prototypes don’t count How long will your fancy data scientist enjoy doing standard ops tasks? Proportionate – Where is the highest value per minute of effort? ©MapR Technologies 2013 8
  • 8. Academic Goals vs Pragmatics  Academic goals – – –  Reproducible Isolate theoretically important aspects Work on novel problems Pragmatics – – – – – Highest net value Available data is constantly changing Diligence and consistency have larger impact than cleverness Many systems feed themselves, exploration and exploitation are both important Engineering constraints on budget and schedule ©MapR Technologies 2013 9
  • 9. Example 1: Making Recommendations Better ©MapR Technologies 2013 10
  • 10. Recommendation Advances  What are the most important algorithmic advances in recommendations over the last 10 years?  Cooccurrence analysis?  Matrix completion via factorization?  Latent factor log-linear models?  Temporal dynamics? ©MapR Technologies 2013 11
  • 11. The Winner – None of the Above  What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood ©MapR Technologies 2013 12
  • 12. The Real Issues  Exploration  Diversity  Speed  Not the last fraction of a percent ©MapR Technologies 2013 13
  • 13. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better ©MapR Technologies 2013 14
  • 14. Result Dithering  Dithering is used to re-order recommendation results – Re-ordering is done randomly  Dithering is guaranteed to make off-line performance worse  Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change” ©MapR Technologies 2013 15
  • 15. Simple Dithering Algorithm  Generate synthetic score from log rank plus Gaussian s = logr + N(0, e )  Pick noise scale to provide desired level of mixing Dr µ r exp e  Typically e Î [ 0.4, 0.8]  Oh… use floor(t/T) as seed ©MapR Technologies 2013 16
  • 16. Example … ε = 0.5 1 1 1 1 1 1 1 2 4 2 3 2 ©MapR Technologies 2013 2 2 4 2 6 2 2 1 1 1 1 1 6 3 3 4 2 3 3 3 2 5 5 3 5 8 2 3 3 5 4 5 7 3 4 4 3 5 6 15 4 24 6 7 3 4 2 7 17 4 7 7 7 16 7 12 6 9 7 7 12 13 6 11 13 9 17 5 4 8 13 8 17 16 34 10 19 5 13 14 17 5 6 6 16
  • 17. Example … ε = log 2 = 0.69 1 1 1 1 1 1 1 2 2 3 11 1 ©MapR Technologies 2013 2 8 3 2 5 2 3 4 3 4 1 8 8 14 8 10 33 7 5 11 1 1 2 7 3 15 2 7 15 3 23 8 4 2 4 3 9 3 10 3 2 5 9 3 6 10 5 22 18 15 2 5 8 9 4 7 1 7 11 7 11 7 22 7 6 11 19 4 44 8 15 3 2 6 10 4 14 29 6 2 9 33 14 14 33
  • 18. Exploring The Second Page ©MapR Technologies 2013 19
  • 19. Lesson 1: Exploration is good ©MapR Technologies 2013 20
  • 20. Example 2: Bayesian Bandits ©MapR Technologies 2013 21
  • 21. Bayesian Bandits  Based on Thompson sampling  Very general sequential test  Near optimal regret  Trade-off exploration and exploitation  Possibly best known solution for exploration/exploitation  Incredibly simple ©MapR Technologies 2013 22
  • 22. Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior é ù P(i is best) = ò I êE[ri | q ] = max E[rj | q ]ú P(q | D) dq ë û j  But I promised a simple answer ©MapR Technologies 2013 23
  • 23. Thompson Sampling – Take 2  Sample θ q ~ P(q | D)  Pick i to maximize reward i = argmax E[rj | q ] j  Record result from using i ©MapR Technologies 2013 24
  • 24. Fast Convergence 0.12 0.11 0.1 0.09 0.08 regret 0.07 0.06 ε- greedy, ε = 0.05 0.05 0.04 Bayesian Bandit with Gam m a- Norm al 0.03 0.02 0.01 0 0 100 200 300 400 500 600 n ©MapR Technologies 2013 25 700 800 900 1000 1100
  • 25. Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelle and Li, 2011 ©MapR Technologies 2013 26
  • 26. Bayesian Bandits versus Result Dithering  Many useful systems are difficult to frame in fully Bayesian form  Thompson sampling cannot be applied without posterior sampling  Can still do useful exploration with dithering  But better to use Thompson sampling if possible ©MapR Technologies 2013 27
  • 27. Lesson 2: Exploration is pretty easy to do and pays big benefits. ©MapR Technologies 2013 28
  • 28. Example 3: On-line Clustering ©MapR Technologies 2013 29
  • 29. The Problem  K-means clustering is useful for feature extraction or compression  At scale and at high dimension, the desirable number of clusters increases  Very large number of clusters may require more passes through the data  Super-linear scaling is generally infeasible ©MapR Technologies 2013 30
  • 30. The Solution  Sketch-based algorithms produce a sketch of the data  Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution  The size of the sketch grows very slowly with increasing data size  Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics . Brian Kulis, Michael Jordan. ©MapR Technologies 2013 31
  • 33. The Cluster Proximity Features  Every point can be described by the nearest cluster – –  Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – –  4.3 bits per point in this case Significant error that can be decreased (to a point) by increasing number of clusters Error is negligible Unwinds the data into a simple representation Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n) ©MapR Technologies 2013 34
  • 35. Lots of Clusters Are Fine ©MapR Technologies 2013 36
  • 36. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together ©MapR Technologies 2013 37
  • 37. Streaming k-means Ideas  By using a sketch with lots (k log N) of centroids, we avoid pathological cases  We still get a very good result if the sketch is created – – in one pass with approximate search  In fact, adaptive dp-means works just fine  In the end, the sketch can be used for clustering or … ©MapR Technologies 2013 38
  • 38. Lesson 3: Sketches make big data small. ©MapR Technologies 2013 39
  • 39. Example 4: Search Abuse ©MapR Technologies 2013 40
  • 40. Recommendations Alice Charles ©MapR Technologies 2013 Alice got an apple and a puppy Charles got a bicycle 41
  • 41. Recommendations Alice Bob Charles ©MapR Technologies 2013 Alice got an apple and a puppy Bob got an apple Charles got a bicycle 42
  • 42. Recommendations Alice Bob ? What else would Bob like? Charles ©MapR Technologies 2013 43
  • 44. History Matrix: Users by Items Alice ✔ Bob ✔ Charles ©MapR Technologies 2013 ✔ ✔ ✔ ✔ 45 ✔
  • 45. Co-occurrence Matrix: Items by Items How do you tell which co-occurrences are useful?. 1 2 1 1 2 ©MapR Technologies 2013 1 0 - 0 1 1 46 0 0
  • 46. Co-occurrence Binary Matrix not not ©MapR Technologies 2013 1 1 47 1
  • 47. Indicator Matrix: Anomalous Co-Occurrence Result: The marked row will be added to the indicator field in the item document… ✔ ✔ ©MapR Technologies 2013 48
  • 48. Indicator Matrix That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators. ©MapR Technologies 2013 49
  • 49. Internals of the Recommender Engine 50 ©MapR Technologies 2013 50
  • 50. Internals of the Recommender Engine 51 ©MapR Technologies 2013 51
  • 51. Looking Inside LucidWorks Real-time recommendation query and results: Evaluation What to recommend if new user listened to 2122: Fats Domino & 303: Beatles? Recommendation is “1710 : Chuck Berry” 52 ©MapR Technologies 2013 52
  • 53. Lesson 4: Recursive search abuse pays Search can implement recs Which can implement search ©MapR Technologies 2013 54
  • 56. Me, Us  Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG  MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s  Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR ©MapR Technologies 2013 57

Editor's Notes

  • #42: * A history of what everybody has done. Obviously this is just a cartoon because large numbers of users and interactions with items would be required to build a recommender* Next step will be to predict what a new user might like…
  • #43: *Bob is the “new user” and getting apple is his history
  • #44: *Here is where the recommendation engine needs to go to work…Note to trainer: you might see if audience calls out the answer before revealing next slide…
  • #45: Note to trainer: This is the situation similar to that in which we started, with three users in our history. The difference is that now everybody got a pony. Bob has apple and pony but not a puppy…yet
  • #46: *Binary matrix is stored sparsely
  • #47: *Convert by MapReduce into a binary matrixNote to trainer: Whether consider apple to have occurred with self is open question
  • #48: Old joke: all the world can be divided into 2 categories: Scotch tape and non-Scotch tape… This is a way to think about the co-occurrence
  • #49: Only important co-occurrence is puppy follows apple
  • #50: *Take that row of matrix and combine with all the meta data we might have…*Important thing to get from the co-occurrence matrix is this indicator..Cool thing: analogous to what a lot of recommendation engines do*This row forms the indicator field in a Solr document containing meta-data (you do NOT have to build a separate index for the indicators)Find the useful co-occurrence and get rid of the rest. Sparsify and get the anomalous co-occurrence
  • #51: Note to trainer: take a little time to explore this here and on the next couple of slides. Details enlarged on next slide
  • #52: *This indicator field is where the output of the Mahout recommendation engine are stored (the row from the indicator matrix that identified significant or interesting co-occurrence. *Keep in mind that this recommendation indicator data is added to the same original document in the Solr index that contains meta data for the item in question
  • #53: This is a diagnostics window in the LucidWorksSolr index (not the web interface a user would see). It’s a way for the developer to do a rough evaluation (laugh test) of the choices offered by the recommendation engine.In other words, do these indicator artists represented by their indicator Id make reasonable recommendations Note to trainer: artist 303 happens to be The Beatles. Is that a good match for Chuck Berry?