SlideShare a Scribd company logo
Social Network Mining
    Solutions using Google App Engine Map Reduce




     J Singh, DataThinks.org



                                        October 19, 2011
MapReduce: A Genealogical Perspective
• Roots
   – Lisp, Scheme
   – APL


• Google OS papers, 2004
   – Exploit extreme parallelism of data


• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these




© J Singh, 2011                            2
                                   2
Social Network Mining
• Finding people based on data in social networks
   –   Love and Romance
   –   Common interests
   –   Similar buying habits
   –   Similar voting propensities
   –   Location


• It‟s not a new problem
   – We have additional solutions for the old problem
        • Examples based on proprietary data: eHarmony, etc.
        • Early examples based on social network data: ShoutFlow,
          WhoIsJustLikeMe.



© J Singh, 2011                                                     3
                                      3
Based on clustering algorithms
• On-line demo of clustering       • Resource intensive.
                                      – Best done in batch mode


                                   • Exploit data parallelism of the
                                     algorithm
                                      – App Engine Map Reduce,
                                        employing one map job for
                                        each cluster
                                      – App Engine Pipeline API,
                                        employing one stage of the
                                        pipeline for each „step‟


                                   • But first, a detour into Map
                                     Reduce…
© J Singh, 2011                                                      4
                               4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp / Scheme
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow




© J Singh, 2011       6
                  6
MapReduce Components in GAE 2011
                  • Input Reader
                     – Several provided by GAE, can write your own


                  • Map function: Written by Programmer

                  • Shuffle function:
                     – Provided by GAE, can write your own


                  • Reduce function: Written by Programmer

                  • Output Writer
                     – Several provided by GAE, can write your own




© J Singh, 2011                                                      7
                               7
Invoking GAE Map Reduce
class MapreducePipeline (…):
    def run(self,
          job_name,             #   A string
          mapper_spec,          #   Mapper function
          reducer_spec,         #   Reducer function
          input_reader_spec,    #   Input reader fn
          output_writer_spec,   #   Output writer
          mapper_params,        #   A dictionary
          reducer_params,       #   A dictionary
          shards,               #   An int
            )


© J Singh, 2011                                        8
                          8
GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
   – Perform complex operations by piping data between primitives
   – But the primitives are not so primitive
   – Data lives in permanent storage between pipeline stages


• MapreducePipeline (prev page) was just one type of pipeline




© J Singh, 2011                                                     9
                                   9
Pipeline API Example Code
Split and Merge example


  class aPipe(pipeline.Pipeline):
      def run(self, e_kind, prop_name, *value_list):
          all_bs = []
          for v in value_list:
              stage = yield bPipe(e_kind, prop_name, v)
              all_bs.append(stage)
          yield common.Append(*all_bs)




© J Singh, 2011                                           10
                            10
Pause and Assess
• Assertion:
   – GAE Map/Reduce is a complete solution for analysis of social
     network mining
   – We know it will scale, the question is how far.


• Working on one Proof of Concept for Social Network Mining
   – Recruiting a second test case


• Will report back in 3-4 months with data on
   – Performance
   – Cost
   – Limits of scalability


© J Singh, 2011                                                     11
                                     11
Adapting the algorithm to M/R
• Clustering Algorithm

   1. Create k randomly placed centroids       Map each
                                               data point

   2. Find the centroid (1-k) closest to each data point


   3. Move each centroid to the average of its members
                                              Reduce
                                           Each Centroid
   4. Repeat 2 and 3 until there is no more change

          Connect to next stage
           using Pipelining API

© J Singh, 2011                                             12
                                  12
About Us
• Involved with Map/Reduce and NoSQL technologies on several
  platforms
   – Google App Engine, MongoDB


• DataThinks.org is a new service of Early Stage IT
   – Building and operating “Big Data” analytics services




                           Thanks
© J Singh, 2011                                                13
                                   13

More Related Content

PPTX
Final ppt
dikshagupta111
 
PDF
presentation644v4
Maikon
 
PDF
Graph Coloring Algorithms on Pregel Model using Hadoop
Nishant Gandhi
 
PDF
Practical implementation of pca on satellite images
Bhanu Pratap
 
PDF
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
PDF
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
PDF
Benchmarking tool for graph algorithms
Yash Khandelwal
 
PDF
FME World Tour 2015 Belfast - Donegal County Council - Daragh McDonough
IMGS
 
Final ppt
dikshagupta111
 
presentation644v4
Maikon
 
Graph Coloring Algorithms on Pregel Model using Hadoop
Nishant Gandhi
 
Practical implementation of pca on satellite images
Bhanu Pratap
 
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Benchmarking tool for graph algorithms
Yash Khandelwal
 
FME World Tour 2015 Belfast - Donegal County Council - Daragh McDonough
IMGS
 

Similar to Social Media Mining using GAE Map Reduce (20)

PDF
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
PDF
Map reduce
xydii
 
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
PDF
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
PDF
Mapreduce Osdi04
Jyotirmoy Dey
 
PPTX
The Hadoop Ecosystem
J Singh
 
PDF
Notes on data-intensive processing with Hadoop Mapreduce
Evert Lammerts
 
PDF
Hadoop.mapreduce
Michael Hepburn
 
PDF
Hadoop: A Hands-on Introduction
Claudio Martella
 
PDF
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Dan Harvey
 
KEY
Hadoop london
Yahoo Developer Network
 
PPTX
Large scale computing with mapreduce
hansen3032
 
PPT
Hadoop by sunitha
Sunitha Satyadas
 
PDF
Hadoop v0.3.1
Matthew McCullough
 
PDF
Big data: analyzing large data sets
R A Akerkar
 
PDF
Cloud is such stuff as dreams are made on
Patrick Chanezon
 
KEY
MapReduce and NoSQL
Aaron Cordova
 
PPTX
Hug france-2012-12-04
MapR Technologies
 
PPTX
Hug france-2012-12-04
Ted Dunning
 
PDF
Hadoop at JavaZone 2010
Matthew McCullough
 
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Map reduce
xydii
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
Mapreduce Osdi04
Jyotirmoy Dey
 
The Hadoop Ecosystem
J Singh
 
Notes on data-intensive processing with Hadoop Mapreduce
Evert Lammerts
 
Hadoop.mapreduce
Michael Hepburn
 
Hadoop: A Hands-on Introduction
Claudio Martella
 
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Dan Harvey
 
Large scale computing with mapreduce
hansen3032
 
Hadoop by sunitha
Sunitha Satyadas
 
Hadoop v0.3.1
Matthew McCullough
 
Big data: analyzing large data sets
R A Akerkar
 
Cloud is such stuff as dreams are made on
Patrick Chanezon
 
MapReduce and NoSQL
Aaron Cordova
 
Hug france-2012-12-04
MapR Technologies
 
Hug france-2012-12-04
Ted Dunning
 
Hadoop at JavaZone 2010
Matthew McCullough
 
Ad

More from J Singh (20)

PDF
OpenLSH - a framework for locality sensitive hashing
J Singh
 
PPTX
Designing analytics for big data
J Singh
 
PDF
Open LSH - september 2014 update
J Singh
 
PPTX
PaaS - google app engine
J Singh
 
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
PPTX
Facebook Analytics with Elastic Map/Reduce
J Singh
 
PPTX
Big Data Laboratory
J Singh
 
PPTX
High Throughput Data Analysis
J Singh
 
PPTX
NoSQL and MapReduce
J Singh
 
PPTX
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
PPTX
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
PPTX
CS 542 -- Query Optimization
J Singh
 
PPTX
CS 542 -- Query Execution
J Singh
 
PPTX
CS 542 Putting it all together -- Storage Management
J Singh
 
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
PPTX
CS 542 Database Index Structures
J Singh
 
PPTX
CS 542 Controlling Database Integrity and Performance
J Singh
 
PPTX
CS 542 Overview of query processing
J Singh
 
PPTX
CS 542 Introduction
J Singh
 
OpenLSH - a framework for locality sensitive hashing
J Singh
 
Designing analytics for big data
J Singh
 
Open LSH - september 2014 update
J Singh
 
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
Facebook Analytics with Elastic Map/Reduce
J Singh
 
Big Data Laboratory
J Singh
 
High Throughput Data Analysis
J Singh
 
NoSQL and MapReduce
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
J Singh
 
Ad

Social Media Mining using GAE Map Reduce

  • 1. Social Network Mining Solutions using Google App Engine Map Reduce J Singh, DataThinks.org October 19, 2011
  • 2. MapReduce: A Genealogical Perspective • Roots – Lisp, Scheme – APL • Google OS papers, 2004 – Exploit extreme parallelism of data • Apache Top Level Project (Hadoop) • MapReduceGAE borrows from these © J Singh, 2011 2 2
  • 3. Social Network Mining • Finding people based on data in social networks – Love and Romance – Common interests – Similar buying habits – Similar voting propensities – Location • It‟s not a new problem – We have additional solutions for the old problem • Examples based on proprietary data: eHarmony, etc. • Early examples based on social network data: ShoutFlow, WhoIsJustLikeMe. © J Singh, 2011 3 3
  • 4. Based on clustering algorithms • On-line demo of clustering • Resource intensive. – Best done in batch mode • Exploit data parallelism of the algorithm – App Engine Map Reduce, employing one map job for each cluster – App Engine Pipeline API, employing one stage of the pipeline for each „step‟ • But first, a detour into Map Reduce… © J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp / Scheme • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6. MapReduce Flow © J Singh, 2011 6 6
  • 7. MapReduce Components in GAE 2011 • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: – Provided by GAE, can write your own • Reduce function: Written by Programmer • Output Writer – Several provided by GAE, can write your own © J Singh, 2011 7 7
  • 8. Invoking GAE Map Reduce class MapreducePipeline (…): def run(self, job_name, # A string mapper_spec, # Mapper function reducer_spec, # Reducer function input_reader_spec, # Input reader fn output_writer_spec, # Output writer mapper_params, # A dictionary reducer_params, # A dictionary shards, # An int ) © J Singh, 2011 8 8
  • 9. GAE Pipeline API • Based on Python Generator functions • The old Unix idea on steroids: – Perform complex operations by piping data between primitives – But the primitives are not so primitive – Data lives in permanent storage between pipeline stages • MapreducePipeline (prev page) was just one type of pipeline © J Singh, 2011 9 9
  • 10. Pipeline API Example Code Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs) © J Singh, 2011 10 10
  • 11. Pause and Assess • Assertion: – GAE Map/Reduce is a complete solution for analysis of social network mining – We know it will scale, the question is how far. • Working on one Proof of Concept for Social Network Mining – Recruiting a second test case • Will report back in 3-4 months with data on – Performance – Cost – Limits of scalability © J Singh, 2011 11 11
  • 12. Adapting the algorithm to M/R • Clustering Algorithm 1. Create k randomly placed centroids Map each data point 2. Find the centroid (1-k) closest to each data point 3. Move each centroid to the average of its members Reduce Each Centroid 4. Repeat 2 and 3 until there is no more change Connect to next stage using Pipelining API © J Singh, 2011 12 12
  • 13. About Us • Involved with Map/Reduce and NoSQL technologies on several platforms – Google App Engine, MongoDB • DataThinks.org is a new service of Early Stage IT – Building and operating “Big Data” analytics services Thanks © J Singh, 2011 13 13