SlideShare a Scribd company logo
“25th CSI Karnataka Student Convention”


      Map/Reduce Algorithm
Performance Analysis in Computing
          Frequency of
             Tweets

                         Shravanthi U M & Nagashree N
                  Information Science and Engineering
           Bangalore Institute of Technology, Bangalore
AGENDA

   Data
   Big Data
   Twitter and Big Data
   Classical Approach
   Why hadoop Framework
   Map/Reduce
   Our Proposed Approach
   Conclusion
   Q&A
Its all About Data
Big Data

Data sets whose size is beyond the ability of commonly
used software tools to capture, manage, and process the
data within a tolerable elapsed time.
Big data sizes are a constantly moving target currently
ranging from a few dozen terabytes to many petabytes
of data in a single data set.
 Ex : Web logs , Social Network data , Internet Search
Index etc.
“BigData”
Classical Approach
egrep _____ files[0-1000]



     egrep


                                          file0

     egrep                   Remote
                            FileSystem
                                         file1000



     egrep
Hadoop Framework
   Fault tolerance
   Streaming data access - HDFS
    emphasizes high throughput.
   Extreme scalability - HDFS will
    scale to petabytes; Example: at
    Facebook.
   Portability - HDFS is portable
    across operating systems.
   Write once read many times
   Locality of computation -move
    the program near to the data
HDFS
egrep _____ files[0-1000]
Move Computation to Data




       egrep
                                            f0            f_




                            40 nodes/rack
                                                  f_ f_

                                            f_                  file0
                                                          f3
                                                 f1000
       egrep
                                            f2
                                                  f_
                                                               file1000



       egrep
Map/Reduce
            Input                           Input
           Any file              All <key, value> pairs with
     (e.g. documents)              the same key grouped
                                (e.g. all <word, count> pairs
                                    where word = “the”)




          Map()                         Reduce()




                                           Output
          Output
                                          Anything
Stream of <key, value> pairs
                                  (e.g. sum of counts for a
 (e.g. <word, count> pairs)
                                        specific word)
Advantages:
   Fine-grained Map and Reduce tasks
    ◦ Improved load balancing
    ◦ Faster recovery from failed tasks
   Automatic re-execution on failure
    ◦ In a large cluster, some nodes are always slow or flaky
    ◦ Framework re-executes failed tasks
   Locality optimizations
    ◦ Map-Reduce queries HDFS for locations of input data
    ◦ When possible, map tasks are scheduled close to the
      inputs (local access, local rack access, remote rack
      access)
What did we do…
Python code to extract tweets using
“twitter.Search” API
for i in range(10):
      turl=urllib.urlopen("http://search.twitter.com/
      search.atom?lang=en&q="+AnnaHazare+"&rpp=100&
      page="+str(i))

     tweettext=re.findall('<updated>(.*?)</updated>',
                turl.read())
     print "Got the Page No. ",(i+1)
     for i in tweettext:
            tweets.append(i)
                f.write(i+"n")
Extracted DATA
Map/Reduce Impelmentation

               <6/4/11,   1>
               <6/4/11,   1>
               <6/4/11,   1>   Reduce()
               <6/4/11,   1>
               <6/4/11,   1>
                                          Server 1 Final
<6/4/11, 1>                                Result File
<6/4/11, 1>
<6/4/11, 1>
<6/6/11, 1>
               <6/6/11,1>                 6/4/11    85
<6/6/11, 1>
               <6/6/11,1>
               <6/6/11,1>
                               Reduce()   6/6/11    36
<6/6/11, 1>
<15/8/11, 1>
<15/8/11, 1>                              15/8/11   125

               <15/8/11,1>
               <15/8/11,1>
               <15/8/11,1>
                               Reduce()
What’s UNIQUE…

 Business Analytics - Considerable approach to spot
  popularity of “New Product”
 Sentimental Analysis
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

More Related Content

PPTX
Database Homework Help
Database Homework Help
 
PPTX
Database Homework Help
Database Homework Help
 
PPTX
Pig - Analyzing data sets
Creditas
 
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Apache Hive
Ajit Koti
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PDF
Import and Export Big Data using R Studio
Rupak Roy
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
Database Homework Help
Database Homework Help
 
Database Homework Help
Database Homework Help
 
Pig - Analyzing data sets
Creditas
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Hive
Ajit Koti
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Import and Export Big Data using R Studio
Rupak Roy
 
Hive User Meeting August 2009 Facebook
ragho
 

What's hot (20)

PDF
Apache Hive Table Partition and HQL
Rupak Roy
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PDF
Import web resources using R Studio
Rupak Roy
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PDF
ffbase, statistical functions for large datasets
Edwin de Jonge
 
PPTX
Heapify algorithm
Sikandar Pandit
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PPTX
Hive : WareHousing Over hadoop
Chirag Ahuja
 
PPT
Hive User Meeting March 2010 - Hive Team
Zheng Shao
 
PDF
Inside Parquet Format
Yue Chen
 
PPT
Hive ICDE 2010
ragho
 
PDF
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
ODP
One Database To Rule 'em All
Stefanie Janine Stölting
 
PPTX
Advanced topics in hive
Uday Vakalapudi
 
PDF
1 Installing & getting started with R
naroranisha
 
PPT
Apache scoop overview
Nisanth Simon
 
PDF
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Fujio Turner
 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
PDF
Introductive to Hive
Rupak Roy
 
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Hive Table Partition and HQL
Rupak Roy
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Import web resources using R Studio
Rupak Roy
 
Hive(ppt)
Abhinav Tyagi
 
ffbase, statistical functions for large datasets
Edwin de Jonge
 
Heapify algorithm
Sikandar Pandit
 
Introduction to scoop and its functions
Rupak Roy
 
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Hive User Meeting March 2010 - Hive Team
Zheng Shao
 
Inside Parquet Format
Yue Chen
 
Hive ICDE 2010
ragho
 
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
One Database To Rule 'em All
Stefanie Janine Stölting
 
Advanced topics in hive
Uday Vakalapudi
 
1 Installing & getting started with R
naroranisha
 
Apache scoop overview
Nisanth Simon
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Fujio Turner
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Introductive to Hive
Rupak Roy
 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Ad

Viewers also liked (7)

PDF
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
PPTX
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
PDF
Hive
Bala Krishna
 
PPTX
big data overview ppt
VIKAS KATARE
 
PDF
MapReduce: Simplified Data Processing on Large Clusters
Ashraf Uddin
 
PPT
Introduction to Hive for Hadoop
ryanlecompte
 
PDF
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
big data overview ppt
VIKAS KATARE
 
MapReduce: Simplified Data Processing on Large Clusters
Ashraf Uddin
 
Introduction to Hive for Hadoop
ryanlecompte
 
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Ad

Similar to CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets (20)

PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
PPT
Hadoop & Zing
Long Dao
 
PPT
Mapreduce in Search
Amund Tveit
 
PPTX
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
PPTX
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
PDF
Zenith it-hadoop-training
Zenith It Solutions
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PPT
hadoop&zing
zingopen
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPT
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
PDF
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
PDF
Cloudera Impala, updated for v1.0
Scott Leberknight
 
PPT
Easy R
Ajay Ohri
 
PDF
Osd ctw spark
Wisely chen
 
KEY
Intro to Hadoop
jeffturner
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 
Hadoop & Zing
Long Dao
 
Mapreduce in Search
Amund Tveit
 
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
Hive @ Hadoop day seattle_2010
nzhang
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Zenith it-hadoop-training
Zenith It Solutions
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
hadoop&zing
zingopen
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
MapReduce in cgrid and cloud computinge.ppt
gvlbcy
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera, Inc.
 
Cloudera Impala, updated for v1.0
Scott Leberknight
 
Easy R
Ajay Ohri
 
Osd ctw spark
Wisely chen
 
Intro to Hadoop
jeffturner
 

Recently uploaded (20)

PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PPTX
CDH. pptx
AneetaSharma15
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PROTIEN ENERGY MALNUTRITION: NURSING MANAGEMENT.pptx
PRADEEP ABOTHU
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Basics and rules of probability with real-life uses
ravatkaran694
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
CDH. pptx
AneetaSharma15
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 

CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

  • 1. “25th CSI Karnataka Student Convention” Map/Reduce Algorithm Performance Analysis in Computing Frequency of Tweets Shravanthi U M & Nagashree N Information Science and Engineering Bangalore Institute of Technology, Bangalore
  • 2. AGENDA  Data  Big Data  Twitter and Big Data  Classical Approach  Why hadoop Framework  Map/Reduce  Our Proposed Approach  Conclusion  Q&A
  • 4. Big Data Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Ex : Web logs , Social Network data , Internet Search Index etc.
  • 6. Classical Approach egrep _____ files[0-1000] egrep file0 egrep Remote FileSystem file1000 egrep
  • 7. Hadoop Framework  Fault tolerance  Streaming data access - HDFS emphasizes high throughput.  Extreme scalability - HDFS will scale to petabytes; Example: at Facebook.  Portability - HDFS is portable across operating systems.  Write once read many times  Locality of computation -move the program near to the data
  • 8. HDFS egrep _____ files[0-1000] Move Computation to Data egrep f0 f_ 40 nodes/rack f_ f_ f_ file0 f3 f1000 egrep f2 f_ file1000 egrep
  • 9. Map/Reduce Input Input Any file All <key, value> pairs with (e.g. documents) the same key grouped (e.g. all <word, count> pairs where word = “the”) Map() Reduce() Output Output Anything Stream of <key, value> pairs (e.g. sum of counts for a (e.g. <word, count> pairs) specific word)
  • 10. Advantages:  Fine-grained Map and Reduce tasks ◦ Improved load balancing ◦ Faster recovery from failed tasks  Automatic re-execution on failure ◦ In a large cluster, some nodes are always slow or flaky ◦ Framework re-executes failed tasks  Locality optimizations ◦ Map-Reduce queries HDFS for locations of input data ◦ When possible, map tasks are scheduled close to the inputs (local access, local rack access, remote rack access)
  • 11. What did we do… Python code to extract tweets using “twitter.Search” API for i in range(10): turl=urllib.urlopen("http://search.twitter.com/ search.atom?lang=en&q="+AnnaHazare+"&rpp=100& page="+str(i)) tweettext=re.findall('<updated>(.*?)</updated>', turl.read()) print "Got the Page No. ",(i+1) for i in tweettext: tweets.append(i) f.write(i+"n")
  • 13. Map/Reduce Impelmentation <6/4/11, 1> <6/4/11, 1> <6/4/11, 1> Reduce() <6/4/11, 1> <6/4/11, 1> Server 1 Final <6/4/11, 1> Result File <6/4/11, 1> <6/4/11, 1> <6/6/11, 1> <6/6/11,1> 6/4/11 85 <6/6/11, 1> <6/6/11,1> <6/6/11,1> Reduce() 6/6/11 36 <6/6/11, 1> <15/8/11, 1> <15/8/11, 1> 15/8/11 125 <15/8/11,1> <15/8/11,1> <15/8/11,1> Reduce()
  • 14. What’s UNIQUE…  Business Analytics - Considerable approach to spot popularity of “New Product”  Sentimental Analysis