CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

“25th CSI Karnataka Student Convention”

Map/Reduce Algorithm
Performance Analysis in Computing
Frequency of
Tweets

Shravanthi U M & Nagashree N
Information Science and Engineering
Bangalore Institute of Technology, Bangalore

AGENDA

 Data
 Big Data
 Twitter and Big Data
 Classical Approach
 Why hadoop Framework
 Map/Reduce
 Our Proposed Approach
 Conclusion
 Q&A

Big Data

Data sets whose size is beyond the ability of commonly
used software tools to capture, manage, and process the
data within a tolerable elapsed time.
Big data sizes are a constantly moving target currently
ranging from a few dozen terabytes to many petabytes
of data in a single data set.
Ex : Web logs , Social Network data , Internet Search
Index etc.

Classical Approach
egrep _____ files[0-1000]

egrep

file0

egrep Remote
FileSystem
file1000

egrep

Hadoop Framework
 Fault tolerance
 Streaming data access - HDFS
emphasizes high throughput.
 Extreme scalability - HDFS will
scale to petabytes; Example: at
Facebook.
 Portability - HDFS is portable
across operating systems.
 Write once read many times
 Locality of computation -move
the program near to the data

HDFS
egrep _____ files[0-1000]
Move Computation to Data

egrep
f0 f_

40 nodes/rack
f_ f_

f_ file0
f3
f1000
egrep
f2
f_
file1000

egrep

Map/Reduce
Input Input
Any file All <key, value> pairs with
(e.g. documents) the same key grouped
(e.g. all <word, count> pairs
where word = “the”)

Map() Reduce()

Output
Output
Anything
Stream of <key, value> pairs
(e.g. sum of counts for a
(e.g. <word, count> pairs)
specific word)

Advantages:
 Fine-grained Map and Reduce tasks
◦ Improved load balancing
◦ Faster recovery from failed tasks
 Automatic re-execution on failure
◦ In a large cluster, some nodes are always slow or flaky
◦ Framework re-executes failed tasks
 Locality optimizations
◦ Map-Reduce queries HDFS for locations of input data
◦ When possible, map tasks are scheduled close to the
inputs (local access, local rack access, remote rack
access)

What did we do…
Python code to extract tweets using
“twitter.Search” API
for i in range(10):
turl=urllib.urlopen("http://search.twitter.com/
search.atom?lang=en&q="+AnnaHazare+"&rpp=100&
page="+str(i))

tweettext=re.findall('<updated>(.*?)</updated>',
turl.read())
print "Got the Page No. ",(i+1)
for i in tweettext:
tweets.append(i)
f.write(i+"n")

Map/Reduce Impelmentation

<6/4/11, 1>
<6/4/11, 1>
<6/4/11, 1> Reduce()
<6/4/11, 1>
<6/4/11, 1>
Server 1 Final
<6/4/11, 1> Result File
<6/4/11, 1>
<6/4/11, 1>
<6/6/11, 1>
<6/6/11,1> 6/4/11 85
<6/6/11, 1>
<6/6/11,1>
<6/6/11,1>
Reduce() 6/6/11 36
<6/6/11, 1>
<15/8/11, 1>
<15/8/11, 1> 15/8/11 125

<15/8/11,1>
<15/8/11,1>
<15/8/11,1>
Reduce()

What’s UNIQUE…

 Business Analytics - Considerable approach to spot
popularity of “New Product”
 Sentimental Analysis

CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets (20)

Recently uploaded (20)

CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets