07 2

Large-scale Data Mining:
MapReduce and Beyond
Part 2: Algorithms

Spiros Papadimitriou, IBM Research
Jimeng Sun, IBM Research
Rong Yan, Facebook

Part 2:Mining using MapReduce
 Mining algorithms using MapReduce
 Information retrieval
 Graph algorithms: PageRank
 Clustering: Canopy clustering, KMeans
 Classification: kNN, Naï Bayes
ve
 MapReduce Mining Summary

2

MapReduce Interface and Data Flow
 Map: (K1, V1)  list(K2, V2)
 Combine: (K2, list(V2))  list(K2, V2)
 Partition: (K2, V2)  reducer_id
 Reduce: (K2, list(V2))  list(K3, V3)
id, doc list(w, id) list(unique_w, id)
Host 1

Host 3
Map Combine w1, list(id) w1, list(unique_id)
Reduce
Map Combine

Host 4
Host 2

Map Combine
Map Combine w2, list(id)
Reduce w2, list(unique_id)
partition
3

Information retrieval using
MapReduce

4

IR: Distributed Grep
 Find the doc_id and line# of a matching pattern
 Map: (id, doc) list(id, line#)
 Reduce: None

Grep “data mining”
Docs Output
1 2
Map1 <1, 123>

3 4 5 Map2 <3, 717>, <5, 1231>

6 Map3 <6, 1012>

5

IR: URL Access Frequency
 Map: (null, log)  (URL, 1)
 Reduce: (URL, 1)  (URL, total_count)

Logs
Map output Reduce output
<u1,1>
Map1 <u1,1>
<u1,2>
<u2,1> Reduce
<u2,1>
Map2 <u3,2>
<u3,1>

Map3 <u3,1>

Also described in Part 1
6

IR: Reverse Web-Link Graph
 Map: (null, page)  (target, source)
 Reduce: (target, source)  (target, list(source))

Pages

Map1 <t1,s2>
<t1,[s2]>
Map2
<t2,s3> Reduce <t2,[s3,s5]>
<t2,s5> <t3,[s5]>

Map3 <t3,s5>

It is the same as matrix transpose
7

IR: Inverted Index
 Map: (id, doc)  list(word, id)
 Reduce: (word, list(id))  (word, list(id))

Doc

Map1 <w1,1>

<w1,[1,5]>
Map2
<w2,2> Reduce <w2,[2]>
<w3,3> <w3,[3]>

Map3 <w1,5>

8

Graph mining using MapReduce

9

PageRank
 PageRank vector q is defined as 1 2

where T 1¡ c
q = cA q + N e 3 4
Browsing Teleporting
0 1
A is the source-by-destination 0 1 1 1
adjacency matrix, B 0 0 1 1 C
A=B @ 0 0
C
0 1 A
 e is all one vector. 0 0 1 0
 N is the number of nodes
 c is the weight between 0 and 1 (eg.0.85)
 PageRank indicates the importance of a page.
 Algorithm: Iterative powering for finding the first
eigen-vector
10

MapReduce: PageRank
PageRank Map()
1 2
 Input: key = page x,
value = (PageRank qx, links[y1…ym])
 Output: key = page x, value = partialx
3 4 1. Emit(x, 0) //guarantee all pages will be emitted
Map: distribute PageRank qi 2. For each outgoing link yi:

q3 • Emit(yi, qx/m)
q1 q2 q4

PageRank Reduce()
 Input: key = page x, value = the list of [partialx]
 Output: key = page x, value = PageRank qx
1. qx = 0
q1 q2 q3 q4
2. For each partial value d in the list:

Reduce: update new PageRank • qx += d
3. qx = cqx+ (1-c)/N
4. Emit(x, qx)
Check out Kang et al ICDM’09
11

Clustering using MapReduce

12

Canopy: single-pass clustering
Canopy creation
 Construct overlapping clusters – canopies
 Make sure no two canopies with too much overlaps
 Key: no canopy centers are too close to each other

C1 C3 T1

C2 C4
T2

overlapping clusters too much overlap two thresholds
T1>T2

McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
Matching“, KDD’00
13

Canopy creation
Input:1)points 2) threshold T1, T2 where T1>T2
Output: cluster centroids
 Put all points into a queue Q
 While (Q is not empty)
– p = dequeue(Q)
– For each canopy c:
 if dist(p,c)< T1: c.add(p)

 if dist(p,c)< T2: strongBound = true

– If not strongBound: create canopy at p
 For all canopy c:
– Set centroid to the mean of all points in c


14

Canopy creation
– p = dequeue(Q)
T1 T2
Strongly marked  if dist(p,c)< T1: c.add(p)
points

C1 C2 – If not strongBound: create canopy at p


15

Canopy creation
Other points
in the cluster  While (Q is not empty)
– p = dequeue(Q)
Canopy center
Strongly marked  if dist(p,c)< T1: c.add(p)
points



16

Canopy creation
– p = dequeue(Q)




17

Canopy creation
– p = dequeue(Q)




18

MapReduce - Canopy Map()
Canopy creation Map()
 Input: A set of points P, threshold T1, T2
 Output: key = null; value = a list of local canopies
(total, count)
 For each p in P:
 For each canopy c:
• if dist(p,c)< T1 then c.total+=p, c.count++;
Map1 • if dist(p,c)< T2 then strongBound = true
Map2 • If not strongBound then create canopy at p
Close()
• Emit(null, (total, count))


19

MapReduce - Canopy Reduce()
Reduce()
 Input: key = null; input values (total, count)
 Output: key = null; value = cluster centroids
 For each intermediate values
 p = total/count
• if dist(p,c)< T1 then c.total+=p,
Map1 results c.count++;
Map2 results
• if dist(p,c)< T2 then strongBound = true
 If not strongBound then create canopy at p
Close()
For simplicity we assume only one reducer
 For each canopy c: emit(null, c.total/c.count)
20

MapReduce - Canopy Reduce()
Reduce()
 Input: key = null; input values (total, count)
 Output: key = null; value = cluster centroids
 For each intermediate values
 p = total/count
• if dist(p,c)< T1 then c.total+=p,
Reducer results c.count++;
• if dist(p,c)< T2 then strongBound = true
 If not strongBound then create canopy at p
Close()
Remark: This assumes only one reducer
 For each canopy c: emit(null, c.total/c.count)

21

Clustering Assignment
Clustering assignment
 For each point p
 Assign p to the closest
canopy center


22

MapReduce: Cluster Assignment
Cluster assignment Map()
 Input: Point p; cluster centriods
 Output:
 Output key = cluster id
 Output value =point id
 currentDist = inf
 For each cluster centroids c
Canopy center
 If dist(p,c)<currentDist
then bestCluster=c, currentDist=dist(p,c);
 Emit(bestCluster, p)
Results can be directly written back to HDFS without a reducer. Or an
identity reducer can be applied to have output sorted on cluster id.


23

KMeans: Multi-pass clustering
Traditional AssignCluster()

AssignCluster():
• For each point p
Kmeans ()
Assign p the closest c
 While not converge:
 AssignCluster() UpdateCentroid()
UpdateCentroids ():
 UpdateCentroids()
 For each cluster
Update cluster center

24

MapReduce – KMeans
KmeansIter()
Map(p) // Assign Cluster
 For c in clusters:
 If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
 Emit(minC.id, (p, 1))
Reduce() //Update Centroids
 For all values (p, c) :
 total += p; count += c;
Map1  Emit(key, (total, count))

Map2
Initial centroids

25

MapReduce – KMeans
KmeansIter()
Map(p) // Assign Cluster
Map: assign each p
1
3  For c in clusters:
to closest centroids
 If dist(p,c)<minDist,
then minC=c, minDist = dist(p,c)
Reduce: update each
 Emit(minC.id, (p, 1))
2 centroid with its new
4 Reduce() //Update Centroids
location (total, count)
 For all values (p, c) :
 total += p; count += c;
Map1  Emit(key, (total, count))
Reduce1
Map2
Reduce2
Initial centroids

26

Classification using MapReduce

27

MapReduce kNN
K=3 Map()
 Input:
1 0 – All points
0 1 1
1 1 – query point p
0 1  Output: k nearest neighbors (local)
1
 Emit the k closest points to p
1 0
0 0 1
0 0

Reduce()
 Input:
Map1 – Key: null; values: local neighbors
– query point p
Map2 Reduce  Output: k nearest neighbors (global)
Query point  Emit the k closest points to p among all
local neighbors

28

Naï Bayes
ve
Q
 Formulation: P (cjd) / P (c) w2d P (wjc)
 Parameter estimation c is a class label, d is a doc, w is a word
 Class ^
prior: P (c) = Nc where Nc is #docs in c, N is #docs
N
 Conditional probability:
^ Tcw
P (wjc) = P Tcw is # of occurrences of w in class c
w0 Tcw0
d1
d2 c1: T1w Goals:
d3 N1=3 1. total number of docs N
d4 2. number of docs in c:Nc
T2w 3. word count histogram in c:Tcw
d5 c2:
d6 N2=4
4. total word count in c:Tcw’
d7
Term vector
30

MapReduce: Naï Bayes
ve
ClassPrior()
Map(doc):
Goals: Emit(class_id, (doc_id, doc_length)
Combine()/Reduce()
1. total number of docs N
 Nc = 0; sTcw = 0
2. number of docs in c:Nc
3. word count histogram in c:Tcw  For each doc_id:

4. total word count in c:Tcw’  Nc++; sTcw +=doc_length
 Emit(c, Nc)

ConditionalProbability()
 Naï Bayes can be implemented
ve Map(doc):

using MapReduce jobs of histogram  For each word w in doc:
 Emit(pair(c, w) , 1)
computation
Combine()/Reduce()
 Tcw = 0
 For each value v: Tcw += v
 Emit(pair(c, w) , Tcw)
31

MapReduce Mining Summary

32

Taxonomy of MapReduce algorithms
One Iteration Multiple Iterations Not good for
MapReduce
Clustering Canopy KMeans

Classification Naï Bayes, kNN
ve Gaussian Mixture SVM

Graphs PageRank

Information Inverted Index Topic modeling
Retrieval (PLSI, LDA)
 One-iteration algorithms are perfect fits
 Multiple-iteration algorithms are OK fits
 but small shared info have to be synchronized across iterations (typically through
filesytem)
 Some algorithms are not good for MapReduce framework
 Those algorithms typically require large shared info with a lot of synchronization.
 Traditional parallel framework like MPI is better suited for those.

33

MapReduce for machine learning algorithms

 The key is to convert into summation form
(Statistical Query model [Kearns’94])
 y=  f(x) where f(x) corresponds to map(),
 corresponds to reduce().
 Naï Bayes
ve
 MR job: P(c)
 MR job: P(w|c)

 Kmeans
 MR job: split data into subgroups and compute partial
sum in Map(), then sum up in Reduce()

34 Map-Reduce for Machine Learning on Multicore [NIPS’06]

Machine learning algorithms using MapReduce

 Linear Regression: y = (XT X)¡ 1 XT y
^
where X 2 Rm£n and y 2 Rm
P T
 MR job1: A = X X = i xi xi T
T
P
 MR job2: b = X y = i xi y(i)
 Finally, solve A^ = b
y

 Locally Weighted Linear Regression:
P
 MR job1: A = i wi xi xi T
P
 MR job2: b = i wi xi y(i)

Machine learning algorithms using MapReduce
(cont.)
 Logistic Regression
 Neural Networks
 PCA
 ICA
 EM for Gaussian Mixture Model


MapReduce Mining Resources

37

Mahout: Hadoop data mining library
 Mahout: http://lucene.apache.org/mahout/
 scalable data mining libraries: mostly implemented in Hadoop
 Data structure for vectors and matrices
 Vectors
 Dense vectors as double[]

 Sparse vectors as HashMap<Integer, Double>

 Operations: assign, cardinality, copy, divide, dot, get,
haveSharedCells, like, minus, normalize, plus, set, size, times,
toArray, viewPart, zSum and cross
 Matrices
 Dense matrix as a double[][]

 SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the
rows or columns of the matrix in a SparseVector
 SparseMatrix as a HashMap<Integer, Vector>

 Operations: assign, assignColumn, assignRow, cardinality, copy,
divide, get, haveSharedCells, like, minus, plus, set, size, times,
38 transpose, toArray, viewPart and zSum

MapReduce Mining Papers
 [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore
 General framework under MapReduce
 [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map-
Reduce
 Co-clustering
 [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System
- Implementation and Observations
 Graph algorithms
 [Das et al. WWW’07] Google news personalization: scalable online
collaborative filtering
 PLSI EM
 [Grossman+Gu KDD’08] Data Mining Using High Performance Data
Clouds: Experimental Studies Using Sector and Sphere. KDD 2008
 Alternative to Hadoop which supports wide-area data collection and
distribution.
39

Summary: algorithms
 Best for MapReduce:
 Single pass, keys are uniformly distributed.
 OK for MapReduce:
 Multiple pass, intermediate states are small
 Bad for MapReduce
 Key distribution is skewed
 Fine-grained synchronization is required.

40

07 2

More Related Content

What's hot (12)

Viewers also liked (20)

Similar to 07 2 (20)

07 2