Social Media Mining using GAE Map Reduce

Social Network Mining
Solutions using Google App Engine Map Reduce

J Singh, DataThinks.org

October 19, 2011

MapReduce: A Genealogical Perspective
• Roots
– Lisp, Scheme
– APL

• Google OS papers, 2004
– Exploit extreme parallelism of data

• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these

© J Singh, 2011 2
2

Social Network Mining
• Finding people based on data in social networks
– Love and Romance
– Common interests
– Similar buying habits
– Similar voting propensities
– Location

• It‟s not a new problem
– We have additional solutions for the old problem
• Examples based on proprietary data: eHarmony, etc.
• Early examples based on social network data: ShoutFlow,
WhoIsJustLikeMe.

© J Singh, 2011 3
3

Based on clustering algorithms
• On-line demo of clustering • Resource intensive.
– Best done in batch mode

• Exploit data parallelism of the
algorithm
– App Engine Map Reduce,
employing one map job for
each cluster
– App Engine Pipeline API,
employing one stage of the
pipeline for each „step‟

• But first, a detour into Map
Reduce…
© J Singh, 2011 4
4

MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp / Scheme
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N  1 2 3 4

• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time

© J Singh, 2011 5
5

MapReduce Flow

© J Singh, 2011 6
6

MapReduce Components in GAE 2011
• Input Reader
– Several provided by GAE, can write your own

• Map function: Written by Programmer

• Shuffle function:
– Provided by GAE, can write your own

• Reduce function: Written by Programmer

• Output Writer
– Several provided by GAE, can write your own

© J Singh, 2011 7
7

Invoking GAE Map Reduce
class MapreducePipeline (…):
def run(self,
job_name, # A string
mapper_spec, # Mapper function
reducer_spec, # Reducer function
input_reader_spec, # Input reader fn
output_writer_spec, # Output writer
mapper_params, # A dictionary
reducer_params, # A dictionary
shards, # An int
)

© J Singh, 2011 8
8

GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
– Perform complex operations by piping data between primitives
– But the primitives are not so primitive
– Data lives in permanent storage between pipeline stages

• MapreducePipeline (prev page) was just one type of pipeline

© J Singh, 2011 9
9

Pipeline API Example Code
Split and Merge example

class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)

© J Singh, 2011 10
10

Pause and Assess
• Assertion:
– GAE Map/Reduce is a complete solution for analysis of social
network mining
– We know it will scale, the question is how far.

• Working on one Proof of Concept for Social Network Mining
– Recruiting a second test case

• Will report back in 3-4 months with data on
– Performance
– Cost
– Limits of scalability

© J Singh, 2011 11
11

Adapting the algorithm to M/R
• Clustering Algorithm

1. Create k randomly placed centroids Map each
data point

2. Find the centroid (1-k) closest to each data point

3. Move each centroid to the average of its members
Reduce
Each Centroid
4. Repeat 2 and 3 until there is no more change

Connect to next stage
using Pipelining API

© J Singh, 2011 12
12

About Us
• Involved with Map/Reduce and NoSQL technologies on several
platforms
– Google App Engine, MongoDB

• DataThinks.org is a new service of Early Stage IT
– Building and operating “Big Data” analytics services

Thanks
© J Singh, 2011 13
13

Social Media Mining using GAE Map Reduce

More Related Content

Similar to Social Media Mining using GAE Map Reduce (20)

More from J Singh (20)

Social Media Mining using GAE Map Reduce