SlideShare a Scribd company logo
R and Hadoop

    Ram Venkat
   Dawn Analytics
What is Hadoop?
•   Hadoop is an open source Apache software for
    running distributed applications on 'big data'
•   It contains a distributed file system (HDFS) and a
    parallel processing 'batch' framework
•   Hadoop is written in java, runs on unix/linux for
    development and production
•   Windows and Mac can be used as development
    platform
•   Yahoo has > 43000 nodes hadoop cluster and
    Facebook has over 100 PB(PB= 1 M GB) of data in
    hadoop clusters
Hadoop overview (1/2)
     Central Idea: Moving computation to data and
    compute across nodes in parallel
•   Data Loading
Hadoop overview (2/2)
Parallel Computation: MapReduce
Map Reduce : Example 'Hello word'
•   Mathematically, this is what MapReduce is about:
    —map (k1, v1) ➔ list(k2, v2)
    —reduce(k2, list(v2)) ➔ list(<k3, v3>)
•   Implementation of the 'hello word' (word count):
    Mapper: K1 -> file name, v1 -> text of the file
             K2 -> word, V2 -> “1”
    Reducer: Sums up the '1' s and produces a list of
    words and their counts
Word Count (slide from Yahoo)
R libraries to work with Hadoop
•    'Hadoop Streaming' - An alternative to the Java
     MapReduce API
•    Hadoop Streaming allows you to write jobs in any
     language supporting stdin/stdout.
•    R has several libraries/ways that help you to work with
     Hadoop:
      – Write your mapper.R and reducer.R and run a shell
        script
      – 'rmr' and 'rhadoop' from revolution analytics
     – 'rhipe' from Purdue University statistical computing
     – 'RHive' interacts with Hadoop via Hive query
Word Count Demo with R(rmr)
  mapper.wordcount = function(key, val) {

                lapply(
                  strsplit( x = val, split = " ")[[1]],
                  function(w) keyval(w,1)
                )
  }

   reducer.wordcount = function(key, val.list) {
             output.key = key
             output.val = sum(unlist(val.list))
                return( keyval(output.key, output.val) )
   }
More advanced example –
    Sentiment Analysis in R(rmr)
•   One area where Hadoop could help out traders is in
    sentiment analysis
•   Oreilly Strata blog 'Trading on sentiment' :
    http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html

•   Demo2 is modified code from this example from Jeffrey
    Breen on airlines sentiment analysis :
    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

•   Jeffrey has been very active in R groups in Chicago
    area, This is another tutorial last month on R and
    Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started-
    with-r-hadoop
Demo2 Sentiment Analysis with rmr
 mapper.score = function(key, val) {

 # clean up tweets with R's regex-driven global substitute, gsub():
   val = gsub('[[:punct:]]', '', val)
   val = gsub('[[:cntrl:]]', '', val)
   val = gsub('d+', '', val)

 # Key is the Airline we added as tag to the tweets
   airline = substr(val,1,2)

 # Run the sentiment analysis
   output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words))

 # our interest is in computing the counts by airlines and scores, so 'this' count is 1
   output.val = 1
   return( keyval(output.key, output.val) )
 }
Demo3 - Hive

•   Hive is a data warehousing infrastructure for Hadoop
•   Provides a familiar SQL like interface to create tables,
    insert and query data
•   Behind the scene , it implements map-reduce
•    Hive is an alternative to our hadoop streaming we
    covered before
•   Demo3 – stock query with Hive
Use cases for Traders
•    Stock sentiment analysis
•    Stock trading pattern analysis
•    Default prediction
•    Fraud/anomoly detection
•    NextGen data warehousing
Hadoop support - Cloudera
•   Cloudera distribution of hadoop is one of the most
    popular distribution (I used their CDH3v5 in my 2
    demos above)
•   Doug Cutting, the creator of Hadoop is the architect
    with Cloudera
•   Adam Muise , a Cloudera engineer at Toronto is the
    organizer of Toronto Hadoop user Group (TOHUG)
•   Upcoming meeting organized by TOHUG on the 30th
    October - “PIG-fest”
Hadoop Tutorials and Books
•   http://hadoop.apache.org/docs/r0.20.2/quickstart.html
•   Cloudera: http://university.cloudera.com/
•   Book: “Hadoop in Action” – Manning
•   Book: “Hadoop - The Definitive Guide” – Oreilly
•   Hadoop Streaming:
    http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html
•   Google Code University:
    http://code.google.com/edu/parallel/mapreduce-tutorial.html
•   Yahoo's Tutorial :
    http://developer.yahoo.com/hadoop/tutorial/module1.html
Thank You

For any clarification, send e-mail to
ram@dawnanalytics.com

More Related Content

What's hot (17)

PPTX
Hadoop Architecture
Dr. C.V. Suresh Babu
 
PPTX
R programming Language , Rahul Singh
Ravi Basil
 
PPTX
Hadoop
Shamama Kamal
 
PPTX
Big data and tools
Shivam Shukla
 
PPTX
Hadoop course curriculm
alogarg
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PDF
9/2017 STL HUG - Back to School
Adam Doyle
 
PPTX
Hadoop overview
Siva Pandeti
 
PPTX
Big data analytics using R
Karthik Padmanabhan ( MLE℠)
 
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
PPTX
Big data business case
Karthik Padmanabhan ( MLE℠)
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PDF
An Introduction of Apache Hadoop
KMS Technology
 
PPTX
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
PPTX
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop Architecture
Dr. C.V. Suresh Babu
 
R programming Language , Rahul Singh
Ravi Basil
 
Hadoop
Shamama Kamal
 
Big data and tools
Shivam Shukla
 
Hadoop course curriculm
alogarg
 
Introduction to Hadoop Technology
Manish Borkar
 
9/2017 STL HUG - Back to School
Adam Doyle
 
Hadoop overview
Siva Pandeti
 
Big data analytics using R
Karthik Padmanabhan ( MLE℠)
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Big data business case
Karthik Padmanabhan ( MLE℠)
 
HADOOP TECHNOLOGY ppt
sravya raju
 
An Introduction of Apache Hadoop
KMS Technology
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
 
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Hadoop And Their Ecosystem
sunera pathan
 

Viewers also liked (16)

PDF
R development
helloapurba
 
PDF
Introduction of R on Hadoop
Chung-Tsai Su
 
PPTX
Distributed R: The Next Generation Platform for Predictive Analytics
Jorge Martinez de Salinas
 
PDF
RHive tutorials - Basic functions
Aiden Seonghak Hong
 
PDF
RHive tutorial - Installation
Aiden Seonghak Hong
 
PDF
Data Hacking with RHadoop
Ed Kohlwey
 
PDF
Hp distributed R User Guide
Andrey Karpov
 
PDF
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
Revolution Analytics
 
PDF
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Jeffrey Breen
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PDF
Tapping the Data Deluge with R
Jeffrey Breen
 
KEY
RHadoop, R meets Hadoop
Revolution Analytics
 
PPT
K mean-clustering algorithm
parry prabhu
 
PDF
Enabling R on Hadoop
DataWorks Summit
 
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
PDF
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
R development
helloapurba
 
Introduction of R on Hadoop
Chung-Tsai Su
 
Distributed R: The Next Generation Platform for Predictive Analytics
Jorge Martinez de Salinas
 
RHive tutorials - Basic functions
Aiden Seonghak Hong
 
RHive tutorial - Installation
Aiden Seonghak Hong
 
Data Hacking with RHadoop
Ed Kohlwey
 
Hp distributed R User Guide
Andrey Karpov
 
R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Al...
Revolution Analytics
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Jeffrey Breen
 
05 k-means clustering
Subhas Kumar Ghosh
 
Tapping the Data Deluge with R
Jeffrey Breen
 
RHadoop, R meets Hadoop
Revolution Analytics
 
K mean-clustering algorithm
parry prabhu
 
Enabling R on Hadoop
DataWorks Summit
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Jeffrey Breen
 
Ad

Similar to R and-hadoop (20)

PPTX
Hive paris
Szehon Ho
 
PPTX
Hive and Pig for .NET User Group
Csaba Toth
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPTX
Big Data and Hadoop Guide
Simplilearn
 
PPTX
Getting started big data
Kibrom Gebrehiwot
 
PDF
Data Science
Subhajit75
 
PDF
What is Apache Hadoop and its ecosystem?
tommychauhan
 
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
PDF
Unit IV.pdf
KennyPratheepKumar
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Hadoop workshop
Purna Chander
 
PPTX
Hadoop with Python
Donald Miner
 
PPTX
Hadoop
Bhushan Kulkarni
 
PDF
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
PPT
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
PPT
Apache Hadoop
Kumaresan Manickavelu
 
PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
Hive paris
Szehon Ho
 
Hive and Pig for .NET User Group
Csaba Toth
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Big Data and Hadoop Guide
Simplilearn
 
Getting started big data
Kibrom Gebrehiwot
 
Data Science
Subhajit75
 
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Unit IV.pdf
KennyPratheepKumar
 
Hadoop Primer
Steve Staso
 
Hadoop workshop
Purna Chander
 
Hadoop with Python
Donald Miner
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Apache Hadoop
Kumaresan Manickavelu
 
Apache hadoop introduction and architecture
Harikrishnan K
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
Ad

R and-hadoop

  • 1. R and Hadoop Ram Venkat Dawn Analytics
  • 2. What is Hadoop? • Hadoop is an open source Apache software for running distributed applications on 'big data' • It contains a distributed file system (HDFS) and a parallel processing 'batch' framework • Hadoop is written in java, runs on unix/linux for development and production • Windows and Mac can be used as development platform • Yahoo has > 43000 nodes hadoop cluster and Facebook has over 100 PB(PB= 1 M GB) of data in hadoop clusters
  • 3. Hadoop overview (1/2) Central Idea: Moving computation to data and compute across nodes in parallel • Data Loading
  • 4. Hadoop overview (2/2) Parallel Computation: MapReduce
  • 5. Map Reduce : Example 'Hello word' • Mathematically, this is what MapReduce is about: —map (k1, v1) ➔ list(k2, v2) —reduce(k2, list(v2)) ➔ list(<k3, v3>) • Implementation of the 'hello word' (word count): Mapper: K1 -> file name, v1 -> text of the file K2 -> word, V2 -> “1” Reducer: Sums up the '1' s and produces a list of words and their counts
  • 6. Word Count (slide from Yahoo)
  • 7. R libraries to work with Hadoop • 'Hadoop Streaming' - An alternative to the Java MapReduce API • Hadoop Streaming allows you to write jobs in any language supporting stdin/stdout. • R has several libraries/ways that help you to work with Hadoop: – Write your mapper.R and reducer.R and run a shell script – 'rmr' and 'rhadoop' from revolution analytics – 'rhipe' from Purdue University statistical computing – 'RHive' interacts with Hadoop via Hive query
  • 8. Word Count Demo with R(rmr) mapper.wordcount = function(key, val) { lapply( strsplit( x = val, split = " ")[[1]], function(w) keyval(w,1) ) } reducer.wordcount = function(key, val.list) { output.key = key output.val = sum(unlist(val.list)) return( keyval(output.key, output.val) ) }
  • 9. More advanced example – Sentiment Analysis in R(rmr) • One area where Hadoop could help out traders is in sentiment analysis • Oreilly Strata blog 'Trading on sentiment' : http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html • Demo2 is modified code from this example from Jeffrey Breen on airlines sentiment analysis : http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ • Jeffrey has been very active in R groups in Chicago area, This is another tutorial last month on R and Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started- with-r-hadoop
  • 10. Demo2 Sentiment Analysis with rmr mapper.score = function(key, val) { # clean up tweets with R's regex-driven global substitute, gsub(): val = gsub('[[:punct:]]', '', val) val = gsub('[[:cntrl:]]', '', val) val = gsub('d+', '', val) # Key is the Airline we added as tag to the tweets airline = substr(val,1,2) # Run the sentiment analysis output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words)) # our interest is in computing the counts by airlines and scores, so 'this' count is 1 output.val = 1 return( keyval(output.key, output.val) ) }
  • 11. Demo3 - Hive • Hive is a data warehousing infrastructure for Hadoop • Provides a familiar SQL like interface to create tables, insert and query data • Behind the scene , it implements map-reduce • Hive is an alternative to our hadoop streaming we covered before • Demo3 – stock query with Hive
  • 12. Use cases for Traders • Stock sentiment analysis • Stock trading pattern analysis • Default prediction • Fraud/anomoly detection • NextGen data warehousing
  • 13. Hadoop support - Cloudera • Cloudera distribution of hadoop is one of the most popular distribution (I used their CDH3v5 in my 2 demos above) • Doug Cutting, the creator of Hadoop is the architect with Cloudera • Adam Muise , a Cloudera engineer at Toronto is the organizer of Toronto Hadoop user Group (TOHUG) • Upcoming meeting organized by TOHUG on the 30th October - “PIG-fest”
  • 14. Hadoop Tutorials and Books • http://hadoop.apache.org/docs/r0.20.2/quickstart.html • Cloudera: http://university.cloudera.com/ • Book: “Hadoop in Action” – Manning • Book: “Hadoop - The Definitive Guide” – Oreilly • Hadoop Streaming: http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html • Google Code University: http://code.google.com/edu/parallel/mapreduce-tutorial.html • Yahoo's Tutorial : http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 15. Thank You For any clarification, send e-mail to [email protected]