SlideShare a Scribd company logo
Large-scale Data Mining:
MapReduce and Beyond
              Part 2: Algorithms

   Spiros Papadimitriou, IBM Research
          Jimeng Sun, IBM Research
                 Rong Yan, Facebook
Part 2:Mining using MapReduce
       Mining algorithms using MapReduce
         Information  retrieval
         Graph algorithms: PageRank
         Clustering: Canopy clustering, KMeans
         Classification: kNN, Naï Bayes
                                 ve
       MapReduce Mining Summary




2
MapReduce Interface and Data Flow
              Map: (K1, V1)  list(K2, V2)
              Combine: (K2, list(V2))  list(K2, V2)
              Partition: (K2, V2)  reducer_id
              Reduce: (K2, list(V2))  list(K3, V3)
id, doc                list(w, id)   list(unique_w, id)
    Host 1




                                                                                       Host 3
                 Map             Combine     w1, list(id)             w1, list(unique_id)
                                                             Reduce
                 Map             Combine




                                                                                       Host 4
    Host 2




                                                              Map            Combine
                 Map             Combine      w2, list(id)
                                                             Reduce   w2, list(unique_id)
                                             partition
3
Information retrieval using
    MapReduce



4
IR: Distributed Grep
       Find the doc_id and line# of a matching pattern
       Map: (id, doc) list(id, line#)
       Reduce: None

    Grep “data mining”
                    Docs                Output
               1       2
                              Map1      <1, 123>

        3      4       5      Map2      <3, 717>, <5, 1231>

                       6      Map3       <6, 1012>




5
IR: URL Access Frequency
       Map: (null, log)  (URL, 1)
       Reduce: (URL, 1)  (URL, total_count)

    Logs
                             Map output                   Reduce output
                               <u1,1>
                   Map1        <u1,1>
                                                             <u1,2>
                               <u2,1>            Reduce
                                                             <u2,1>
                   Map2                                      <u3,2>
                               <u3,1>

                   Map3         <u3,1>



                      Also described in Part 1
6
IR: Reverse Web-Link Graph
       Map: (null, page)  (target, source)
       Reduce: (target, source)  (target, list(source))

    Pages
                              Map output                    Reduce output

                    Map1        <t1,s2>
                                                               <t1,[s2]>
                    Map2
                                <t2,s3>           Reduce       <t2,[s3,s5]>
                                <t2,s5>                        <t3,[s5]>

                    Map3         <t3,s5>




                       It is the same as matrix transpose
7
IR: Inverted Index
       Map: (id, doc)  list(word, id)
       Reduce: (word, list(id))  (word, list(id))

    Doc
                              Map output              Reduce output

                    Map1        <w1,1>

                                                          <w1,[1,5]>
                    Map2
                                <w2,2>       Reduce       <w2,[2]>
                                <w3,3>                    <w3,[3]>

                    Map3        <w1,5>




8
Graph mining using MapReduce




9
PageRank
        PageRank vector q is defined as                 1         2

         where           T       1¡ c
                  q = cA q +         N    e              3         4
                        Browsing   Teleporting
                                                      0                 1
         A   is the source-by-destination               0 1   1       1
            adjacency matrix,                         B 0 0    1       1 C
                                                  A=B @ 0 0
                                                                         C
                                                               0       1 A
           e is all one vector.                         0 0   1       0
           N is the number of nodes
           c is the weight between 0 and 1 (eg.0.85)
        PageRank indicates the importance of a page.
        Algorithm: Iterative powering for finding the first
         eigen-vector
10
MapReduce: PageRank
                                            PageRank Map()
      1        2
                                             Input: key = page x,
                                                     value = (PageRank qx, links[y1…ym])
                                             Output: key = page x, value = partialx
      3        4                            1.   Emit(x, 0) //guarantee all pages will be emitted
 Map: distribute PageRank qi                2.   For each outgoing link yi:

                             q3                   •       Emit(yi, qx/m)
     q1            q2                  q4

                                            PageRank Reduce()
                                             Input: key = page x, value = the list of [partialx]
                                             Output: key = page x, value = PageRank qx
                                            1.   qx = 0
     q1   q2            q3        q4
                                            2.   For each partial value d in the list:

     Reduce: update new PageRank                  •       qx += d
                                            3.   qx = cqx+ (1-c)/N
                                            4.   Emit(x, qx)
     Check out Kang et al ICDM’09
11
Clustering using MapReduce




12
Canopy: single-pass clustering
 Canopy creation
  Construct overlapping clusters – canopies
  Make sure no two canopies with too much overlaps
          Key: no canopy centers are too close to each other



             C1      C3                                                                           T1


             C2      C4
                                                                                                   T2


     overlapping clusters                          too much overlap                      two thresholds
                                                                                             T1>T2


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00
13
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

14
Canopy creation
                                              Input:1)points 2) threshold T1, T2 where T1>T2
                                              Output: cluster centroids
                                               Put all points into a queue Q
                                               While (Q is not empty)
                                                 – p = dequeue(Q)
     T1 T2
                                                 – For each canopy c:
      Strongly marked                                  if dist(p,c)< T1: c.add(p)
      points
                                                       if dist(p,c)< T2: strongBound = true

      C1                   C2                    – If not strongBound: create canopy at p
                                               For all canopy c:
                                                 – Set centroid to the mean of all points in c


      McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
      Matching“, KDD’00

15
Canopy creation
                                               Input:1)points 2) threshold T1, T2 where T1>T2
                                               Output: cluster centroids
                                                Put all points into a queue Q
Other points
in the cluster                                  While (Q is not empty)
                                                  – p = dequeue(Q)
                                                  – For each canopy c:
                          Canopy center
       Strongly marked                                  if dist(p,c)< T1: c.add(p)
       points
                                                        if dist(p,c)< T2: strongBound = true

                                                  – If not strongBound: create canopy at p
                                                For all canopy c:
                                                  – Set centroid to the mean of all points in c


       McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
       Matching“, KDD’00

16
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

17
Canopy creation
                                             Input:1)points 2) threshold T1, T2 where T1>T2
                                             Output: cluster centroids
                                              Put all points into a queue Q
                                              While (Q is not empty)
                                                – p = dequeue(Q)
                                                – For each canopy c:
                                                      if dist(p,c)< T1: c.add(p)

                                                      if dist(p,c)< T2: strongBound = true

                                                – If not strongBound: create canopy at p
                                              For all canopy c:
                                                – Set centroid to the mean of all points in c


     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

18
MapReduce - Canopy Map()
                                                  Canopy creation Map()
                                                   Input: A set of points P, threshold T1, T2
                                                   Output: key = null; value = a list of local canopies
                                                    (total, count)
                                                   For each p in P:
                                                         For each canopy c:
                                                            • if dist(p,c)< T1 then c.total+=p, c.count++;
                  Map1                                      • if dist(p,c)< T2 then strongBound = true
                  Map2                                • If not strongBound then create canopy at p
                                                  Close()
                                                   For each canopy c:
                                                      • Emit(null, (total, count))

     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

19
MapReduce - Canopy Reduce()
                                                    Reduce()
                                                     Input: key = null; input values (total, count)
                                                     Output: key = null; value = cluster centroids
                                                     For each intermediate values
                                                            p = total/count
                                                            For each canopy c:
                                                               • if dist(p,c)< T1 then c.total+=p,
                  Map1 results                                   c.count++;
                  Map2 results
                                                               • if dist(p,c)< T2 then strongBound = true
                                                            If not strongBound then create canopy at p
                                                    Close()
     For simplicity we assume only one reducer
                                                     For each canopy c: emit(null, c.total/c.count)
     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00
20
MapReduce - Canopy Reduce()
                                                    Reduce()
                                                     Input: key = null; input values (total, count)
                                                     Output: key = null; value = cluster centroids
                                                     For each intermediate values
                                                           p = total/count
                                                           For each canopy c:
                                                               • if dist(p,c)< T1 then c.total+=p,
                  Reducer results                                c.count++;
                                                               • if dist(p,c)< T2 then strongBound = true
                                                           If not strongBound then create canopy at p
                                                    Close()
     Remark: This assumes only one reducer
                                                     For each canopy c: emit(null, c.total/c.count)

      McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
      Matching“, KDD’00
21
Clustering Assignment
                                                         Clustering assignment
                                                          For each point p
                                                             Assign p to the closest
                                                              canopy center




     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

22
MapReduce: Cluster Assignment
                                                    Cluster assignment Map()
                                                     Input: Point p; cluster centriods
                                                     Output:
                                                            Output key = cluster id
                                                            Output value =point id
                                                     currentDist = inf
                                                     For each cluster centroids c
                  Canopy center
                                                            If dist(p,c)<currentDist
                                                             then bestCluster=c, currentDist=dist(p,c);
                                                     Emit(bestCluster, p)
                        Results can be directly written back to HDFS without a reducer. Or an
                        identity reducer can be applied to have output sorted on cluster id.

     McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference
     Matching“, KDD’00

23
KMeans: Multi-pass clustering
 Traditional                                             AssignCluster()



AssignCluster():
• For each point p
                              Kmeans ()
     Assign p the closest c
                               While not converge:
                                    AssignCluster()     UpdateCentroid()
UpdateCentroids ():
                                    UpdateCentroids()
 For each cluster
     Update cluster center




24
MapReduce – KMeans
                         KmeansIter()
                         Map(p) // Assign Cluster
                          For c in clusters:
                                If dist(p,c)<minDist,
                                 then minC=c, minDist = dist(p,c)
                          Emit(minC.id, (p, 1))
                         Reduce() //Update Centroids
                          For all values (p, c) :
                                total += p; count += c;
     Map1                 Emit(key, (total, count))

     Map2
     Initial centroids




25
MapReduce – KMeans
                                                             KmeansIter()
                                                             Map(p) // Assign Cluster
                                   Map: assign each p
                   1
                             3                                For c in clusters:
                                   to closest centroids
                                                                    If dist(p,c)<minDist,
                                                                     then minC=c, minDist = dist(p,c)
                                   Reduce: update each
                                                              Emit(minC.id, (p, 1))
                  2                centroid with its new
                         4                                   Reduce() //Update Centroids
                                   location (total, count)
                                                              For all values (p, c) :
                                                                    total += p; count += c;
     Map1                                                     Emit(key, (total, count))
                                 Reduce1
     Map2
                                 Reduce2
     Initial centroids




26
Classification using MapReduce




27
MapReduce kNN
     K=3                                      Map()
                                                Input:
            1       0                              – All points
                   0         1       1
                 1          1                      – query point p
                 0 1                            Output: k nearest neighbors (local)
                   1
                                                Emit the k closest points to p
                        1            0
                  0 0       1
             0                   0

                                              Reduce()
                                                Input:
     Map1                                          – Key: null; values: local neighbors
                                                   – query point p
     Map2                            Reduce     Output: k nearest neighbors (global)
     Query point                                Emit the k closest points to p among all
                                                 local neighbors




28
Naï Bayes
       ve
                                                       Q
        Formulation: P (cjd) / P (c) w2d P (wjc)
        Parameter estimation c is a class label, d is a doc, w is a word
          Class         ^
                 prior: P (c) = Nc            where Nc is #docs in c, N is #docs
                                   N
          Conditional probability:
             ^             Tcw
             P (wjc) =   P          Tcw is # of occurrences of w in class c
                          w0 Tcw0
d1
d2                       c1:        T1w           Goals:
d3                       N1=3                     1.   total number of docs N
d4                                                2.   number of docs in c:Nc
                                    T2w           3.   word count histogram in c:Tcw
d5                       c2:
d6                       N2=4
                                                  4.   total word count in c:Tcw’
d7
         Term vector
30
MapReduce: Naï Bayes
                  ve
                                             ClassPrior()
                                             Map(doc):
        Goals:                                 Emit(class_id, (doc_id, doc_length)
                                             Combine()/Reduce()
        1.   total number of docs N
                                              Nc = 0; sTcw = 0
        2.   number of docs in c:Nc
        3.   word count histogram in c:Tcw    For each doc_id:

        4.   total word count in c:Tcw’           Nc++; sTcw +=doc_length
                                              Emit(c, Nc)

                                             ConditionalProbability()
    Naï Bayes can be implemented
         ve                                  Map(doc):

     using MapReduce jobs of histogram        For each word w in doc:
                                                   Emit(pair(c, w) , 1)
     computation
                                             Combine()/Reduce()
                                              Tcw = 0
                                              For each value v: Tcw += v
                                              Emit(pair(c, w) , Tcw)
31
MapReduce Mining Summary




32
Taxonomy of MapReduce algorithms
                         One Iteration           Multiple Iterations Not good for
                                                                     MapReduce
     Clustering          Canopy                  KMeans

     Classification      Naï Bayes, kNN
                           ve                    Gaussian Mixture        SVM

     Graphs                                      PageRank

     Information         Inverted Index          Topic modeling
     Retrieval                                   (PLSI, LDA)
        One-iteration algorithms are perfect fits
        Multiple-iteration algorithms are OK fits
             but small shared info have to be synchronized across iterations (typically through
              filesytem)
        Some algorithms are not good for MapReduce framework
             Those algorithms typically require large shared info with a lot of synchronization.
             Traditional parallel framework like MPI is better suited for those.

33
MapReduce for machine learning algorithms

        The key is to convert into summation form
         (Statistical Query model [Kearns’94])
            y=  f(x) where f(x) corresponds to map(),
              corresponds to reduce().
        Naï Bayes
           ve
          MR job: P(c)
          MR job: P(w|c)

        Kmeans
          MR   job: split data into subgroups and compute partial
             sum in Map(), then sum up in Reduce()


34   Map-Reduce for Machine Learning on Multicore [NIPS’06]
Machine learning algorithms using MapReduce

          Linear Regression:                    y = (XT X)¡ 1 XT y
                                                 ^
                      where X 2 Rm£n and y 2 Rm
                                           P T
            MR job1:             A = X X = i xi xi T
                                       T
                                           P
            MR job2:             b = X y = i xi y(i)
          Finally, solve             A^ = b
                                       y

          Locally Weighted Linear Regression:
                            P
            MR job1:           A = i wi xi xi T
                                   P
            MR job2:           b = i wi xi y(i)
35   Map-Reduce for Machine Learning on Multicore [NIPS’06]
Machine learning algorithms using MapReduce
     (cont.)
      Logistic Regression
      Neural Networks
      PCA
      ICA
      EM for Gaussian Mixture Model




36   Map-Reduce for Machine Learning on Multicore [NIPS’06]
MapReduce Mining Resources




37
Mahout: Hadoop data mining library
        Mahout: http://lucene.apache.org/mahout/
           scalable data mining libraries: mostly implemented in Hadoop
        Data structure for vectors and matrices
           Vectors
               Dense vectors as double[]

               Sparse vectors as HashMap<Integer, Double>

               Operations: assign, cardinality, copy, divide, dot, get,
                 haveSharedCells, like, minus, normalize, plus, set, size, times,
                 toArray, viewPart, zSum and cross
           Matrices
               Dense matrix as a double[][]

               SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the
                 rows or columns of the matrix in a SparseVector
               SparseMatrix as a HashMap<Integer, Vector>

               Operations: assign, assignColumn, assignRow, cardinality, copy,
                 divide, get, haveSharedCells, like, minus, plus, set, size, times,
38               transpose, toArray, viewPart and zSum
MapReduce Mining Papers
        [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore
            General framework under MapReduce
        [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map-
         Reduce
            Co-clustering
        [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System
         - Implementation and Observations
            Graph algorithms
        [Das et al. WWW’07] Google news personalization: scalable online
         collaborative filtering
            PLSI EM
        [Grossman+Gu KDD’08] Data Mining Using High Performance Data
         Clouds: Experimental Studies Using Sector and Sphere. KDD 2008
            Alternative to Hadoop which supports wide-area data collection and
             distribution.
39
Summary: algorithms
        Best for MapReduce:
          Single   pass, keys are uniformly distributed.
        OK for MapReduce:
          Multiple   pass, intermediate states are small
        Bad for MapReduce
          Key distribution is skewed
          Fine-grained synchronization is required.




40
Large-scale Data Mining:
MapReduce and Beyond
              Part 2: Algorithms

   Spiros Papadimitriou, IBM Research
          Jimeng Sun, IBM Research
                 Rong Yan, Facebook

More Related Content

PPT
Big Data Analytics with Hadoop with @techmilind
EMC
 
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
PPTX
Python en la Plataforma ArcGIS
Xander Bakker
 
PPT
Using timed-release cryptography to mitigate the preservation risk of embargo...
Michael Nelson
 
KEY
Gazr
kuro7
 
PPT
C++ classes tutorials
kksupaul
 
PPTX
Dun ddd
Lyuben Todorov
 
PPT
Java3 d 1
Por Non
 
Big Data Analytics with Hadoop with @techmilind
EMC
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Python en la Plataforma ArcGIS
Xander Bakker
 
Using timed-release cryptography to mitigate the preservation risk of embargo...
Michael Nelson
 
Gazr
kuro7
 
C++ classes tutorials
kksupaul
 
Java3 d 1
Por Non
 

What's hot (12)

PPTX
Jamie Pullar- Cats MTL in action
Ryan Adams
 
PPTX
JavaFX 2.0 With Alternative Languages - JavaOne 2011
Stephen Chin
 
PDF
R graphics by Novi Reandy Sasmita
beasiswa
 
PPTX
Procedural Content Generation with Clojure
Mike Anderson
 
PDF
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
23 industrial engineering
mloeb825
 
PDF
Summary (chapter 1 chapter6)
Nizam Mustapha
 
PDF
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
tnoulas
 
PDF
D3.js workshop
Anton Katunin
 
PDF
D3 svg & angular
500Tech
 
PPTX
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
ajhannan
 
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
Jamie Pullar- Cats MTL in action
Ryan Adams
 
JavaFX 2.0 With Alternative Languages - JavaOne 2011
Stephen Chin
 
R graphics by Novi Reandy Sasmita
beasiswa
 
Procedural Content Generation with Clojure
Mike Anderson
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
23 industrial engineering
mloeb825
 
Summary (chapter 1 chapter6)
Nizam Mustapha
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
tnoulas
 
D3.js workshop
Anton Katunin
 
D3 svg & angular
500Tech
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
ajhannan
 
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
Ad

Viewers also liked (20)

PPTX
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Yahoo Developer Network
 
PPT
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Jongwook Woo
 
PPT
Map Reduce
Sri Prasanna
 
PPT
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Jongwook Woo
 
PPT
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
System ID Warehouse
 
PPT
Hadoop World Vertica
Omer Trajman
 
PDF
DFA Minimization in Map-Reduce
Iraj Hedayati
 
PPTX
Big Data Analysis With RHadoop
David Chiu
 
PPT
Wireless hacking and security
Adel Zalok
 
PDF
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
PPTX
2nd year pre clinical RPD Terminology, Components and Classification of parti...
Sajjad Hussain
 
PPTX
Wireless Security null seminar
Nilesh Sapariya
 
PPTX
Presentation on Wireless border security system
Student
 
PPTX
types of dental surveyor
Nimrah Razzaque Memon
 
PDF
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
PDF
Dental surveying of Removal partial denture
Ali Alarasy
 
PPT
Mouth preparation for removable partial denture/ dental education in india
Indian dental academy
 
PPTX
Diagnosis and treatment planning of Removable Partial Denture
dwijk
 
PPT
Mouth preparation for removable partial dentures /certified fixed orthodontic...
Indian dental academy
 
PDF
Mapreduce Algorithms
Amund Tveit
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Yahoo Developer Network
 
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing
Jongwook Woo
 
Map Reduce
Sri Prasanna
 
Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop
Jongwook Woo
 
Super Barcode Training Camp - Motorola AirDefense Wireless Security Presentation
System ID Warehouse
 
Hadoop World Vertica
Omer Trajman
 
DFA Minimization in Map-Reduce
Iraj Hedayati
 
Big Data Analysis With RHadoop
David Chiu
 
Wireless hacking and security
Adel Zalok
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
2nd year pre clinical RPD Terminology, Components and Classification of parti...
Sajjad Hussain
 
Wireless Security null seminar
Nilesh Sapariya
 
Presentation on Wireless border security system
Student
 
types of dental surveyor
Nimrah Razzaque Memon
 
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Dental surveying of Removal partial denture
Ali Alarasy
 
Mouth preparation for removable partial denture/ dental education in india
Indian dental academy
 
Diagnosis and treatment planning of Removable Partial Denture
dwijk
 
Mouth preparation for removable partial dentures /certified fixed orthodontic...
Indian dental academy
 
Mapreduce Algorithms
Amund Tveit
 
Ad

Similar to 07 2 (20)

PDF
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
PDF
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
PDF
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Austin Benson
 
PDF
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Austin Benson
 
ODP
Geospatial Data in R
Barry Rowlingson
 
PPTX
Scala meetup - Intro to spark
Javier Arrieta
 
PPT
Sorting and Routing on Hypercubes and Hypercubic Architectures
CTOGreenITHub
 
PDF
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
PPT
Chap 6 Graph.ppt
shashankbhadouria4
 
PPTX
MapReduce DesignPatterns
Evgeny Benediktov
 
ODP
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
PDF
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
PPTX
lecture-Basic-programing-R-1-basic-eng.pptx
ThoVyNguynVng
 
PDF
Interactive High-Dimensional Visualization of Social Graphs
Tokyo Tech (Tokyo Institute of Technology)
 
PPTX
Mining of massive datasets
Ashic Mahtab
 
PPTX
Dijkstra
jagdeeparora86
 
PPTX
Scoobi - Scala for Startups
bmlever
 
PDF
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Austin Benson
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Austin Benson
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Austin Benson
 
Geospatial Data in R
Barry Rowlingson
 
Scala meetup - Intro to spark
Javier Arrieta
 
Sorting and Routing on Hypercubes and Hypercubic Architectures
CTOGreenITHub
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Chap 6 Graph.ppt
shashankbhadouria4
 
MapReduce DesignPatterns
Evgeny Benediktov
 
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
lecture-Basic-programing-R-1-basic-eng.pptx
ThoVyNguynVng
 
Interactive High-Dimensional Visualization of Social Graphs
Tokyo Tech (Tokyo Institute of Technology)
 
Mining of massive datasets
Ashic Mahtab
 
Dijkstra
jagdeeparora86
 
Scoobi - Scala for Startups
bmlever
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Austin Benson
 

07 2

  • 1. Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
  • 2. Part 2:Mining using MapReduce  Mining algorithms using MapReduce  Information retrieval  Graph algorithms: PageRank  Clustering: Canopy clustering, KMeans  Classification: kNN, Naï Bayes ve  MapReduce Mining Summary 2
  • 3. MapReduce Interface and Data Flow  Map: (K1, V1)  list(K2, V2)  Combine: (K2, list(V2))  list(K2, V2)  Partition: (K2, V2)  reducer_id  Reduce: (K2, list(V2))  list(K3, V3) id, doc list(w, id) list(unique_w, id) Host 1 Host 3 Map Combine w1, list(id) w1, list(unique_id) Reduce Map Combine Host 4 Host 2 Map Combine Map Combine w2, list(id) Reduce w2, list(unique_id) partition 3
  • 5. IR: Distributed Grep  Find the doc_id and line# of a matching pattern  Map: (id, doc) list(id, line#)  Reduce: None Grep “data mining” Docs Output 1 2 Map1 <1, 123> 3 4 5 Map2 <3, 717>, <5, 1231> 6 Map3 <6, 1012> 5
  • 6. IR: URL Access Frequency  Map: (null, log)  (URL, 1)  Reduce: (URL, 1)  (URL, total_count) Logs Map output Reduce output <u1,1> Map1 <u1,1> <u1,2> <u2,1> Reduce <u2,1> Map2 <u3,2> <u3,1> Map3 <u3,1> Also described in Part 1 6
  • 7. IR: Reverse Web-Link Graph  Map: (null, page)  (target, source)  Reduce: (target, source)  (target, list(source)) Pages Map output Reduce output Map1 <t1,s2> <t1,[s2]> Map2 <t2,s3> Reduce <t2,[s3,s5]> <t2,s5> <t3,[s5]> Map3 <t3,s5> It is the same as matrix transpose 7
  • 8. IR: Inverted Index  Map: (id, doc)  list(word, id)  Reduce: (word, list(id))  (word, list(id)) Doc Map output Reduce output Map1 <w1,1> <w1,[1,5]> Map2 <w2,2> Reduce <w2,[2]> <w3,3> <w3,[3]> Map3 <w1,5> 8
  • 9. Graph mining using MapReduce 9
  • 10. PageRank  PageRank vector q is defined as 1 2 where T 1¡ c q = cA q + N e 3 4 Browsing Teleporting 0 1 A is the source-by-destination 0 1 1 1 adjacency matrix, B 0 0 1 1 C A=B @ 0 0 C 0 1 A  e is all one vector. 0 0 1 0  N is the number of nodes  c is the weight between 0 and 1 (eg.0.85)  PageRank indicates the importance of a page.  Algorithm: Iterative powering for finding the first eigen-vector 10
  • 11. MapReduce: PageRank PageRank Map() 1 2  Input: key = page x, value = (PageRank qx, links[y1…ym])  Output: key = page x, value = partialx 3 4 1. Emit(x, 0) //guarantee all pages will be emitted Map: distribute PageRank qi 2. For each outgoing link yi: q3 • Emit(yi, qx/m) q1 q2 q4 PageRank Reduce()  Input: key = page x, value = the list of [partialx]  Output: key = page x, value = PageRank qx 1. qx = 0 q1 q2 q3 q4 2. For each partial value d in the list: Reduce: update new PageRank • qx += d 3. qx = cqx+ (1-c)/N 4. Emit(x, qx) Check out Kang et al ICDM’09 11
  • 13. Canopy: single-pass clustering Canopy creation  Construct overlapping clusters – canopies  Make sure no two canopies with too much overlaps  Key: no canopy centers are too close to each other C1 C3 T1 C2 C4 T2 overlapping clusters too much overlap two thresholds T1>T2 McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 13
  • 14. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 14
  • 15. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) T1 T2 – For each canopy c: Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true C1 C2 – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 15
  • 16. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q Other points in the cluster  While (Q is not empty) – p = dequeue(Q) – For each canopy c: Canopy center Strongly marked  if dist(p,c)< T1: c.add(p) points  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 16
  • 17. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 17
  • 18. Canopy creation Input:1)points 2) threshold T1, T2 where T1>T2 Output: cluster centroids  Put all points into a queue Q  While (Q is not empty) – p = dequeue(Q) – For each canopy c:  if dist(p,c)< T1: c.add(p)  if dist(p,c)< T2: strongBound = true – If not strongBound: create canopy at p  For all canopy c: – Set centroid to the mean of all points in c McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 18
  • 19. MapReduce - Canopy Map() Canopy creation Map()  Input: A set of points P, threshold T1, T2  Output: key = null; value = a list of local canopies (total, count)  For each p in P:  For each canopy c: • if dist(p,c)< T1 then c.total+=p, c.count++; Map1 • if dist(p,c)< T2 then strongBound = true Map2 • If not strongBound then create canopy at p Close()  For each canopy c: • Emit(null, (total, count)) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 19
  • 20. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Map1 results c.count++; Map2 results • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() For simplicity we assume only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 20
  • 21. MapReduce - Canopy Reduce() Reduce()  Input: key = null; input values (total, count)  Output: key = null; value = cluster centroids  For each intermediate values  p = total/count  For each canopy c: • if dist(p,c)< T1 then c.total+=p, Reducer results c.count++; • if dist(p,c)< T2 then strongBound = true  If not strongBound then create canopy at p Close() Remark: This assumes only one reducer  For each canopy c: emit(null, c.total/c.count) McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 21
  • 22. Clustering Assignment Clustering assignment  For each point p  Assign p to the closest canopy center McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 22
  • 23. MapReduce: Cluster Assignment Cluster assignment Map()  Input: Point p; cluster centriods  Output:  Output key = cluster id  Output value =point id  currentDist = inf  For each cluster centroids c Canopy center  If dist(p,c)<currentDist then bestCluster=c, currentDist=dist(p,c);  Emit(bestCluster, p) Results can be directly written back to HDFS without a reducer. Or an identity reducer can be applied to have output sorted on cluster id. McCallum, Nigam and Ungar, "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching“, KDD’00 23
  • 24. KMeans: Multi-pass clustering Traditional AssignCluster() AssignCluster(): • For each point p Kmeans () Assign p the closest c  While not converge:  AssignCluster() UpdateCentroid() UpdateCentroids ():  UpdateCentroids()  For each cluster Update cluster center 24
  • 25. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster  For c in clusters:  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c)  Emit(minC.id, (p, 1)) Reduce() //Update Centroids  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Map2 Initial centroids 25
  • 26. MapReduce – KMeans KmeansIter() Map(p) // Assign Cluster Map: assign each p 1 3  For c in clusters: to closest centroids  If dist(p,c)<minDist, then minC=c, minDist = dist(p,c) Reduce: update each  Emit(minC.id, (p, 1)) 2 centroid with its new 4 Reduce() //Update Centroids location (total, count)  For all values (p, c) :  total += p; count += c; Map1  Emit(key, (total, count)) Reduce1 Map2 Reduce2 Initial centroids 26
  • 28. MapReduce kNN K=3 Map()  Input: 1 0 – All points 0 1 1 1 1 – query point p 0 1  Output: k nearest neighbors (local) 1  Emit the k closest points to p 1 0 0 0 1 0 0 Reduce()  Input: Map1 – Key: null; values: local neighbors – query point p Map2 Reduce  Output: k nearest neighbors (global) Query point  Emit the k closest points to p among all local neighbors 28
  • 29. Naï Bayes ve Q  Formulation: P (cjd) / P (c) w2d P (wjc)  Parameter estimation c is a class label, d is a doc, w is a word  Class ^ prior: P (c) = Nc where Nc is #docs in c, N is #docs N  Conditional probability: ^ Tcw P (wjc) = P Tcw is # of occurrences of w in class c w0 Tcw0 d1 d2 c1: T1w Goals: d3 N1=3 1. total number of docs N d4 2. number of docs in c:Nc T2w 3. word count histogram in c:Tcw d5 c2: d6 N2=4 4. total word count in c:Tcw’ d7 Term vector 30
  • 30. MapReduce: Naï Bayes ve ClassPrior() Map(doc): Goals: Emit(class_id, (doc_id, doc_length) Combine()/Reduce() 1. total number of docs N  Nc = 0; sTcw = 0 2. number of docs in c:Nc 3. word count histogram in c:Tcw  For each doc_id: 4. total word count in c:Tcw’  Nc++; sTcw +=doc_length  Emit(c, Nc) ConditionalProbability()  Naï Bayes can be implemented ve Map(doc): using MapReduce jobs of histogram  For each word w in doc:  Emit(pair(c, w) , 1) computation Combine()/Reduce()  Tcw = 0  For each value v: Tcw += v  Emit(pair(c, w) , Tcw) 31
  • 32. Taxonomy of MapReduce algorithms One Iteration Multiple Iterations Not good for MapReduce Clustering Canopy KMeans Classification Naï Bayes, kNN ve Gaussian Mixture SVM Graphs PageRank Information Inverted Index Topic modeling Retrieval (PLSI, LDA)  One-iteration algorithms are perfect fits  Multiple-iteration algorithms are OK fits  but small shared info have to be synchronized across iterations (typically through filesytem)  Some algorithms are not good for MapReduce framework  Those algorithms typically require large shared info with a lot of synchronization.  Traditional parallel framework like MPI is better suited for those. 33
  • 33. MapReduce for machine learning algorithms  The key is to convert into summation form (Statistical Query model [Kearns’94])  y=  f(x) where f(x) corresponds to map(),  corresponds to reduce().  Naï Bayes ve  MR job: P(c)  MR job: P(w|c)  Kmeans  MR job: split data into subgroups and compute partial sum in Map(), then sum up in Reduce() 34 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 34. Machine learning algorithms using MapReduce  Linear Regression: y = (XT X)¡ 1 XT y ^ where X 2 Rm£n and y 2 Rm P T  MR job1: A = X X = i xi xi T T P  MR job2: b = X y = i xi y(i)  Finally, solve A^ = b y  Locally Weighted Linear Regression: P  MR job1: A = i wi xi xi T P  MR job2: b = i wi xi y(i) 35 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 35. Machine learning algorithms using MapReduce (cont.)  Logistic Regression  Neural Networks  PCA  ICA  EM for Gaussian Mixture Model 36 Map-Reduce for Machine Learning on Multicore [NIPS’06]
  • 37. Mahout: Hadoop data mining library  Mahout: http://lucene.apache.org/mahout/  scalable data mining libraries: mostly implemented in Hadoop  Data structure for vectors and matrices  Vectors  Dense vectors as double[]  Sparse vectors as HashMap<Integer, Double>  Operations: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross  Matrices  Dense matrix as a double[][]  SparseRowMatrix or SparseColumnMatrix as Vector[] as holding the rows or columns of the matrix in a SparseVector  SparseMatrix as a HashMap<Integer, Vector>  Operations: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, 38 transpose, toArray, viewPart and zSum
  • 38. MapReduce Mining Papers  [Chu et al. NIPS’06] Map-Reduce for Machine Learning on Multicore  General framework under MapReduce  [Papadimitriou et al ICDM’08] DisCo: Distributed Co-clustering with Map- Reduce  Co-clustering  [Kang et al. ICDM’09] PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations  Graph algorithms  [Das et al. WWW’07] Google news personalization: scalable online collaborative filtering  PLSI EM  [Grossman+Gu KDD’08] Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. KDD 2008  Alternative to Hadoop which supports wide-area data collection and distribution. 39
  • 39. Summary: algorithms  Best for MapReduce:  Single pass, keys are uniformly distributed.  OK for MapReduce:  Multiple pass, intermediate states are small  Bad for MapReduce  Key distribution is skewed  Fine-grained synchronization is required. 40
  • 40. Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook