SlideShare a Scribd company logo
Distributed Data Analysis with Hadoop and R
Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.
                 StrangeLoop2011
               September 20 | 2011
Flow of this Talk

•  Introductions

•  Hadoop, R and Interfacing the two


•  Our Prototypes

•  A use case for interfacing Hadoop and R

•  Alternatives for Running R on Hadoop

•  Alternatives to Hadoop and R

•  Conclusions

•  References
Who We Are
•  Ramesh Venkataramaiah, Ph. D.
   –  Principal Engineer, TechOps
   –  rvenkataramaiah@orbitz.com
   –  @rvenkatar

•  Jonathan Seidman
   –  Lead Engineer, Business Intelligence/Big Data Team
   –  Co-founder/organizer of Chicago Hadoop User Group (
      http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and
      Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/
   –  jseidman@orbitz.com
   –  @jseidman


•  Orbitz Careers
   –  http://careers.orbitz.com/
   –  @OrbitzTalent
Launched in 2001




                   Over 160 million
                   bookings
                   7th Largest seller of
                   travel in the world
                                           page 4
Hadoop and R
as an analytic platform?
What is Hadoop?


Distributed file system (HDFS) and parallel processing
framework.


       Uses   MapReduce programming model as the core.

                Provides   fault tolerant and scalable storage
                 of very large datasets across machines in a cluster.
What is R? When do we need it?


Open-source stat package with visualization
        Vibrant community support.
                 One-line calculations galore!
                          Steep learning curve but worth it!




Insight into statistical properties and trends…
                        or for machine learning purposes…
                                  or Big Data to be understood well.



                                                                  page 7
Our Options

•  Data volume reduction by sampling
   –  Very bad for long-tail data distribution
   –  Approximation lead to bad conclusion
•  Scaling R
   –  Still in-memory
   –  But make it parallel using segue, Rhipe, R-Hive…
•  Use sql-like interfaces
   –  Apache Hive with Hadoop
   –  File sprawl and process issues
•  Regular DBMS
   –  How to fit square peg in a round hole
   –  No in-line R calls from SQL but commercial efforts are underway.


•  This Talk: How to bring Hadoop’s parallel processing
   capability to R environment.
                                                                         page 8
Our prototypes
  User segmentations
        Hotel bookings
  Airline Performance*


                   * Public dataset	





                                   page 9
We have two distinct dataspaces serving different
constituents




      Semi-structure data       Transactional data
        (e.g. searches)           (e.g. bookings)



      Hadoop Cluster           Data Warehouse

                                                     page 10
Our Hadoop infrastructure allows us to record and process
user activity at the individual level




         Detailed Non-          Transactional Data
      Transactional Data          (e.g. bookings)
       (What Each User
       Sees and Clicks)

                               Data Warehouse



          Hadoop

                                                            page 11
Getting a Buy-in
presented a long-term, semi-structured data growth story and
explained how this will help harness long-tail opportunities at
lowest cost.
         - Traditional DW!               - Big Data!
         -  Classical Stats!             -  Specific spikes!
         -  Sampling!                    -  Median is not the message!

                                         - Create a universal key !
                                         - Always keep source data!
                                         - Operationalize the
                                         infrastructure!




                                                                    * From a blog
An example of “median is not the message”

•  Positional Bias during Hotel Searches
Our Customers pick top positions the most…
Safari Users Seem to be Interested in More Expensive Hotels




                                                              page 15
Seasonal variations
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.




                                                          page 16
Workload and Resource Partition


     �                                   �                               �           �


             �                                   �                   �           �
                         �

                                     �               �                           �
         �           �                                       �               �
                         �                                       �                       �


                                             �           �                   �
                 �               �                                                       �
                             �




                                                                                         page 17
Airline Performance




                      page 18
Description of Use Case


•  Analyze openly available dataset: Airline on-time performance.
•  Dataset was used in Visualization Poster Competition 2009
   –  Consists of flight arrival/departure details from 1987-2008.
   –  Approximately 120 MM records totaling 12GB.
•  Available at: http://stat-computing.org/dataexpo/2009/




                                                                     page 19
Our dataset




              page 20
Airline Delay Plot: R code



> deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F)
                                                                                   !
> View(deptdelays.monthly.full)!
> names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)!

> Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full
$Delay,decreasing=TRUE),]


> Top_10_Delay_by_Month <- Delay_by_Month[1:10,]!
> Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))!

> symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)!
> text(Month, Delay, Airline, cex= 0.5)!




                                                                             page 21
Airline delay




                page 22
Multiple Distributions: R code
>   library(lattice)!
>   deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)!
>   h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))!
>   update(h)!




                                                                            page 23
Running R on Hadoop:
 Hadoop Streaming



                       page 24
Hadoop Streaming – Overview


•  An alternative to the Java MapReduce API which allows you to
   write jobs in any language supporting stdin/stdout.
•  Limited to text data in current versions of Hadoop. Support for
   binary streams added in 0.21.0.
•  Requires installation of R on all DataNodes.




                                                                     page 25
Hadoop Streaming – Dataflow



                   1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                   1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map       1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                   1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

                   1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                                                                                             *	

                   1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Map function receives input records line-by-line via standard input.


                                                                                                page 26
Hadoop Streaming – Dataflow Continued



                   DL|1987|10   	

 14 	

                   PI|1988|1    	

 0 	

Input to reduce    PI|1988|1    	

 17 	

                   PS|1987|10
                   PS|1987|10
                                	

 1 	

                                	

 11 	

                                                                                          *	

                   PS|1987|10   	

 -2 	





                                1987         	

 10   	

 1   	

 DL   	

 14 	

 Output from reduce             1988
                                1987
                                             	

 1
                                             	

 10
                                                      	

 2
                                                      	

 3
                                                              	

 PI
                                                              	

 PS
                                                                       	

 8.5 	

                                                                       	

 3.333333 	





* Reduce receives map output key/value pairs sorted by key, line-by-line.


                                                                                             page 27
Hadoop Streaming Example – map.R




                                   page 28
Hadoop Streaming Example – reduce.R




                                      page 29
Running R on Hadoop:
 Hadoop Interactive



                       page 30
Hadoop Interactive (hive) – Overview


•  Very unfortunate acronym.
•  Provides an interface to Hadoop from the R environment.
   –  Functions to access HDFS
   –  Control Hadoop
   –  And run streaming jobs directly from R
•  Allows HDFS data, including the output from MapReduce
   processing, to be manipulated and analyzed directly from R.
•  Seems to still have some rough edges.




                                                                 page 31
Hadoop Interactive – Example




                               page 32
Running R on Hadoop:
       RHIPE



                       page 33
RHIPE – Overview

•  Active project with frequent updates and active community.
•  RHIPE is based on Hadoop streaming source, but provides
   some significant enhancements, such as support for binary
   files (sort of).
•  Developed to provide R users with access to same Hadoop
   functionality available to Java developers.
   –  For example, provides rhcounter() and rhstatus(),
      analagous to counters and the reporter interface in the Java
      API.




                                                                     page 34
RHIPE – Overview

•  Can be somewhat confusing and intimidating.
  –  Then again, the same can be said for the Java API.
  –  Worth taking the time to get comfortable with.




                                                          page 35
RHIPE – Overview


•  Allows developers to work directly on data stored in HDFS in
   the R environment.
•  Also allows developers to write MapReduce jobs in R and
   execute them on the Hadoop cluster.
•  RHIPE uses Google protocol buffers to serialize data. Most R
   data types are supported.
   –  Using protocol buffers increases efficiency and provides
      interoperability with other languages.
•  Must be installed on all DataNodes.




                                                                  page 36
RHIPE – MapReduce

map <- expression({}) !
reduce <- expression( !
       pre = {…},!
       reduce = {…}, !
       post = {…}!
 ) !
z <- rhmr(map=map,reduce=reduce,!
             inout=c("text","sequence ), !
             ifolder=INPUT_PATH ,!
             ofolder=OUTPUT_PATH,!
             …)!
rhex(z) !




                                             page 37
RHIPE – Dataflow



                   Keys = […]	

                   Values =	

                    [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...	

                     1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…	

Input to map         1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...	

                     1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...	

       *
                     1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…	

                     1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...]	





                                 PI|1988|1     	

 17 	

                                 PI|1988|1     	

 0 	

                                 PS|1987|10    	

 11 	

 Output from map                 PS|1987|10    	

 -2 	

                                 PS|1987|10    	

 1 	

                                 DL|1987|10    	

 14 	





* Note that Input to map is a vector of keys and a vector of values.


                                                                                                    page 38
RHIPE – Dataflow Continued



                   DL|1987|10    	

 [14] 	


Input to reduce    PI|1988|1    	

 [0, 17] 	

                                                                                                 *
                   PS|1987|10   	

 [1,11,-2] 	





                                1987                	

 10   	

 1   	

 DL   	

 14 	

  Output from reduce            1988
                                1987
                                                    	

 1
                                                    	

 10
                                                             	

 2
                                                             	

 3
                                                                     	

 PI
                                                                     	

 PS
                                                                              	

 8.5 	

                                                                              	

 3.333333 	





* Input to reduce is a key and a vector containing a subset of intermediate
  values associated with that key. The reduce will get called until no more
  values exist for the key.

                                                                                                     page 39
RHIPE – Example




                  page 40
RHIPE – Example




                  page 41
RHIPE – Example




                  page 42
Running R on Hadoop:
        rmr



                       page 43
rmr Overview


•  New project from Revolution Analytics introduced August 2011.
•  Part of RHadoop, a suite of open-source projects which also
   includes:
   –  rhdfs – functions to access and manage HDFS from within
      R.
   –  rhbase – functions providing basic connectivity to HBase.
•  Goals are to provide productive environment for MapReduce
   programming in an R-like way - “…stay true to map reduce and
   true to R …”
•  Reduce gets all intermediate values for each key (yay!).
•  Like RHIPE, based on streaming source.




                                                                   page 44
rmr – Example




                page 45
Running R on Hadoop:
       Segue



                       page 46
Segue – Overview


•  Intended to work around single-threading in R by taking
   advantage of Hadoop streaming to provide simple parallel
   processing.
   –  For example, running multiple simulations in parallel.
•  Suitable for embarrassingly pleasantly parallel problems – big
   CPU, not big data.
•  Runs on Amazon’s Elastic Map Reduce (EMR).
   –  Not intended for internal clusters.
•  Provides emrlapply(), a parallel version of lapply()!




                                                                    page 47
Segue – Example




                  page 48
Performance Comparison:
 Streaming and RHIPE



                      page 49
Performance Testing – Environment and Setup


•  Twenty-eight DataNodes:
   –  Dual hex-core
   –  24GB RAM
   –  Shared cluster.


•  Data
   –  Airline dataset
   –  22 input files
   –  About 12GB uncompressed data




                                              page 50
Performance Comparison


Number of Reducers   Streaming      RHIPE
264                  246 seconds*   96 seconds*




*All numbers are an average of 3 runs.

                                                  page 51
Alternatives


Alternate languages/libraries:


•  Apache Mahout
   –  Scalable machine learning library.
   –  Offers clustering, classification, collaborative filtering on
      Hadoop.
•  Python
   –  Many modules available to support scientific and statistical
      computing.




                                                                      page 52
Alternatives


Alternative parallel processing frameworks:


•  Revolution Analytics
   –  Provides commercial packages to support processing big
      data with R.
•  Other HPC/parallel processing packages for R, e.g. Rmpi or
   snow.




                                                                page 53
Alternatives


Apache Hive + RJDBC?


•  We haven t been able to get it to work yet.
•  You can however wrap calls to the Hive client in R to return R
   objects. See https://github.com/satpreetsingh/rDBwrappers/wiki




                                                                    page 54
Alternatives


Interesting solutions that you can t have:


•  Ricardo
   –  Developed as part of a research project at IBM.
   –  Interesting paper published, but apparently no plans to
      make available.




                                                                page 55
Conclusions


•  If practical, consider using Hadoop to aggregate data for input
   to R analyses.
•  Avoid using R for general purpose MapReduce use.




                                                                     page 56
Conclusions


•  For simple MapReduce jobs, or embarrassingly parallel jobs
   on a local cluster, consider Hadoop streaming.
  –  Limited to processing text only.
  –  But easy to test at the command line outside of Hadoop:
     •  $ cat DATAFILE |./map.R |sort |./reduce.R!
•  To run compute-bound analyses with relatively small amount of
   data on the cloud look at Segue.




                                                                   page 57
Conclusions


•  Otherwise, your best bet is RHIPE, but definitely check out rmr.
•  Also consider alternatives – Mahout, Python, etc.




                                                                      page 58
Conclusions


On an operational note:


•  Make sure your cluster nodes are consistent – same version of
   R installed, required libraries are installed on each node, etc.




                                                                      page 59
Example Code


•  https://github.com/jseidman/hadoop-R




                                          page 60
References


•  Hadoop
   –  Apache Hadoop project: http://hadoop.apache.org/
   –  Hadoop The Definitive Guide, Tom White, O Reilly Press,
      2011
•  R
   –  R Project for Statistical Computing: http://www.r-project.org/
   –  R Cookbook, Paul Teetor, O Reilly Press, 2011
   –  Getting Started With R: Some Resources:
      http://quanttrader.info/public/gettingStartedWithR.html




                                                                       page 61
References


•  Hadoop Streaming
  –  Documentation on Apache Hadoop Wiki:
     http://hadoop.apache.org/mapreduce/docs/current/
     streaming.html
  –  Word count example in R :
     https://forums.aws.amazon.com/thread.jspa?
     messageID=129163




                                                        page 62
References


•  Hadoop InteractiVE
  –  Project page on CRAN:
     http://cran.r-project.org/web/packages/hive/index.html
  –  Simple Parallel Computing in R Using Hadoop:
     http://www.rmetrics.org/Meielisalp2009/Presentations/
     Theussl1.pdf




                                                              page 63
References


•  RHIPE
  –  RHIPE - R and Hadoop Integrated Processing Environment:
     http://www.stat.purdue.edu/~sguha/rhipe/
  –  Code: http://code.google.com/p/rhipe/
  –  Wiki: http://code.google.com/p/rhipe/w/list
  –  Installing RHIPE on CentOS:
     https://groups.google.com/forum/#!topic/brumail/
     qH1wjtBgwYI
  –  Introduction to using RHIPE:
     http://ml.stat.purdue.edu/rhafen/rhipe/
  –  RHIPE combines Hadoop and the R analytics language, SD
     Times: http://www.sdtimes.com/link/34792


                                                               page 64
References


•  RHIPE
  –  Using R and Hadoop to Analyze VoIP Network Data for
     QoS, Hadoop World 2010:
     •  video:
        http://www.cloudera.com/videos/
        hw10_video_using_r_and_hadoop_to_analyze_voip_net
        work_data_for_qos
     •  slides:
        http://www.cloudera.com/resource/
        hw10_voice_over_ip_studying_traffic_characteristics_for
        _quality_of_service
  –  RHIPE examples (k-means, etc.):
     http://groups.google.com/group/brumail/browse_thread/
     thread/e403db404f039e31?pli=1

                                                                  page 65
References


•  RHadoop (including rmr)
  –  Github: https://github.com/RevolutionAnalytics/RHadoop
  –  Advanced Big Data Analytics with R and Hadoop
    whitepaper:
    http://info.revolutionanalytics.com/R-and-Hadoop-Big-Data-
    Analytics-White-Paper.html




                                                                 page 66
References


•  Segue
  –  Project page: http://code.google.com/p/segue/
  –  Google Group:http://groups.google.com/group/segue-r
  –  Abusing Amazon s Elastic MapReduce Hadoop service…
     easily, from R, Jefferey Breen:
     http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-
     amazon-elastic-mapreduce-hadoop/
  –  Presentation at Chicago Hadoop Users Group March 23,
     2011:
     http://files.meetup.com/1634302/segue-presentation-
     RUG.pdf




                                                                page 67
References

•  Sawmill (A framework for integrating a PMML-compliant Scoring
   Engine with Hadoop).
  –  More information:
     •  Open Data Group www.opendatagroup.com
     •  oscon-info@opendatagroup.com
  –  Augustus, an open source system for building & scoring
     statistical models
     •  augustus.googlecode.com
  –  PMML
     •  Data Mining Group: dmg.org
  –  Analytics over Clouds using Hadoop, presentation at Chicago
     Hadoop User Group:
     http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf

                                                                   page 68
References


•  Ricardo
   –  Ricardo: Integrating R and Hadoop, paper:
      http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010-
      das.pdf
   –  Ricardo: Integrating R and Hadoop, Powerpoint
      presentation:
      http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo-
      SIGMOD10.pptx




                                                           page 69
References


•  General references on Hadoop and R
  –  Pete Skomoroch s R and Hadoop bookmarks:
     http://www.delicious.com/pskomoroch/R+hadoop
  –  Pigs, Bees, and Elephants: A Comparison of Eight
     MapReduce Languages:
     http://www.dataspora.com/2011/04/pigs-bees-and-
     elephants-a-comparison-of-eight-mapreduce-languages/
  –  Quora – How can R and Hadoop be used together?:
     http://www.quora.com/How-can-R-and-Hadoop-be-used-
     together




                                                            page 70
References


•  Mahout
   –  Mahout project: http://mahout.apache.org/
   –  Mahout in Action, Owen, et. al., Manning Publications, 2011
•  Python
   –  Think Stats, Probability and Statistics for Programmers, Allen
      B. Downey, O Reilly Press, 2011
•  CRAN Task View: High-Performance and Parallel Computing with
   R, a set of resources compiled by Dirk Eddelbuettel:
   http://cran.r-project.org/web/views/
   HighPerformanceComputing.html




                                                                    page 71
References


•  Other examples of airline data analysis with R:
   –  A simple Big Data analysis using the RevoScaleR package
      in Revolution R:
      http://www.r-bloggers.com/a-simple-big-data-analysis-using-
      the-revoscaler-package-in-revolution-r/




                                                                    page 72
And finally…


 Parallel R (working title), Q Ethan McCallum, Stephen
  Weston, O Reilly Press, due autumn 2011


    R meets Big Data - a basket of strategies to help you use R
      for large-scale analysis and computation.




                                                                  page 73

More Related Content

What's hot (20)

PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
PPTX
Big Data & Hadoop Tutorial
Edureka!
 
PDF
Introduction to Big data & Hadoop -I
Edureka!
 
PPTX
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
 
PPTX
Big data concepts
Serkan Özal
 
PDF
BI, Hive or Big Data Analytics?
Datameer
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
PDF
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
 
PDF
Integration of HIve and HBase
Hortonworks
 
PPTX
Hadoop for beginners free course ppt
Njain85
 
PDF
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Edureka!
 
PDF
Intro to HDFS and MapReduce
Ryan Tabora
 
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
PPT
Big Data and Hadoop Basics
Sonal Tiwari
 
PDF
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
PDF
Hadoop tools with Examples
Joe McTee
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PDF
VMUGIT UC 2013 - 08a VMware Hadoop
VMUG IT
 
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Big Data & Hadoop Tutorial
Edureka!
 
Introduction to Big data & Hadoop -I
Edureka!
 
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
 
Big data concepts
Serkan Özal
 
BI, Hive or Big Data Analytics?
Datameer
 
Hadoop and Big Data
Harshdeep Kaur
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Edureka!
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Yahoo Developer Network
 
Integration of HIve and HBase
Hortonworks
 
Hadoop for beginners free course ppt
Njain85
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Edureka!
 
Intro to HDFS and MapReduce
Ryan Tabora
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big Data and Hadoop Basics
Sonal Tiwari
 
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
Hadoop tools with Examples
Joe McTee
 
Introduction to Big Data & Hadoop
Edureka!
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUG IT
 

Viewers also liked (20)

PDF
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
PDF
HW09 Social network analysis with Hadoop
Cloudera, Inc.
 
PPTX
Video Analysis in Hadoop
DataWorks Summit
 
PDF
Large-scale social media analysis with Hadoop
jakehofman
 
PPTX
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
PPTX
Taller hadoop
Christian Ariza Porras
 
PPTX
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
PPTX
Amazon Elastic Computing 2
Athanasios Anastasiou
 
PPTX
Hadoop administration
Aneesh Pulickal Karunakaran
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PDF
Hadoop Trends
Hortonworks
 
PPTX
Hadoop fault-tolerance
Ravindra Bandara
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
PDF
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
PPTX
Hadoop as data refinery
Steve Loughran
 
PDF
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
gethue
 
PDF
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
PPTX
Hadoop data analysis
Vakul Vankadaru
 
DOCX
Resume of Vimal 4.1
Vimal Suthar
 
PPTX
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Log analysis with Hadoop in livedoor 2013
SATOSHI TAGOMORI
 
HW09 Social network analysis with Hadoop
Cloudera, Inc.
 
Video Analysis in Hadoop
DataWorks Summit
 
Large-scale social media analysis with Hadoop
jakehofman
 
Hadoop Cluster Configuration and Data Loading - Module 2
Rohit Agrawal
 
Taller hadoop
Christian Ariza Porras
 
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Amazon Elastic Computing 2
Athanasios Anastasiou
 
Hadoop administration
Aneesh Pulickal Karunakaran
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop Trends
Hortonworks
 
Hadoop fault-tolerance
Ravindra Bandara
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Hadoop as data refinery
Steve Loughran
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
gethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Hadoop data analysis
Vakul Vankadaru
 
Resume of Vimal 4.1
Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Jongwook Woo
 
Ad

Similar to Distributed Data Analysis with Hadoop and R - Strangeloop 2011 (20)

PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
KEY
Hadoop london
Yahoo Developer Network
 
PPTX
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hortonworks
 
PPTX
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PDF
R, Hadoop and Amazon Web Services
Portland R User Group
 
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
KEY
Processing Big Data
cwensel
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PDF
Big data landscape
Natalino Busa
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
Hadoop: A Hands-on Introduction
Claudio Martella
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Emergent Distributed Data Storage
hybrid cloud
 
PPTX
Why hadoop for data science?
Hortonworks
 
PDF
Mining Large-Scale Temporal Dynamics with Hadoop
DataWorks Summit
 
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 
Big data ppt
Thirunavukkarasu Ps
 
R, Hadoop and Amazon Web Services
Portland R User Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Processing Big Data
cwensel
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Big data landscape
Natalino Busa
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop: A Hands-on Introduction
Claudio Martella
 
Pig and Python to Process Big Data
Shawn Hermans
 
Hadoop Overview & Architecture
EMC
 
Emergent Distributed Data Storage
hybrid cloud
 
Why hadoop for data science?
Hortonworks
 
Mining Large-Scale Temporal Dynamics with Hadoop
DataWorks Summit
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Ad

More from Jonathan Seidman (9)

PDF
Foundations for Successful Data Projects – Strata London 2019
Jonathan Seidman
 
PDF
Foundations strata sf-2019_final
Jonathan Seidman
 
PDF
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
PDF
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
PDF
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
PDF
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
PPTX
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
PDF
Extending the EDW with Hadoop - Chicago Data Summit 2011
Jonathan Seidman
 
PDF
Real World Machine Learning at Orbitz, Strata 2011
Jonathan Seidman
 
Foundations for Successful Data Projects – Strata London 2019
Jonathan Seidman
 
Foundations strata sf-2019_final
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Jonathan Seidman
 

Recently uploaded (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Digital Circuits, important subject in CS
contactparinay1
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 

Distributed Data Analysis with Hadoop and R - Strangeloop 2011

  • 1. Distributed Data Analysis with Hadoop and R Jonathan Seidman and Ramesh Venkataramaiah, Ph. D. StrangeLoop2011 September 20 | 2011
  • 2. Flow of this Talk •  Introductions •  Hadoop, R and Interfacing the two •  Our Prototypes •  A use case for interfacing Hadoop and R •  Alternatives for Running R on Hadoop •  Alternatives to Hadoop and R •  Conclusions •  References
  • 3. Who We Are •  Ramesh Venkataramaiah, Ph. D. –  Principal Engineer, TechOps –  [email protected] –  @rvenkatar •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group ( http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/ –  [email protected] –  @jseidman •  Orbitz Careers –  http://careers.orbitz.com/ –  @OrbitzTalent
  • 4. Launched in 2001 Over 160 million bookings 7th Largest seller of travel in the world page 4
  • 5. Hadoop and R as an analytic platform?
  • 6. What is Hadoop? Distributed file system (HDFS) and parallel processing framework. Uses MapReduce programming model as the core. Provides fault tolerant and scalable storage of very large datasets across machines in a cluster.
  • 7. What is R? When do we need it? Open-source stat package with visualization Vibrant community support. One-line calculations galore! Steep learning curve but worth it! Insight into statistical properties and trends… or for machine learning purposes… or Big Data to be understood well. page 7
  • 8. Our Options •  Data volume reduction by sampling –  Very bad for long-tail data distribution –  Approximation lead to bad conclusion •  Scaling R –  Still in-memory –  But make it parallel using segue, Rhipe, R-Hive… •  Use sql-like interfaces –  Apache Hive with Hadoop –  File sprawl and process issues •  Regular DBMS –  How to fit square peg in a round hole –  No in-line R calls from SQL but commercial efforts are underway. •  This Talk: How to bring Hadoop’s parallel processing capability to R environment. page 8
  • 9. Our prototypes User segmentations Hotel bookings Airline Performance* * Public dataset page 9
  • 10. We have two distinct dataspaces serving different constituents Semi-structure data Transactional data (e.g. searches) (e.g. bookings) Hadoop Cluster Data Warehouse page 10
  • 11. Our Hadoop infrastructure allows us to record and process user activity at the individual level Detailed Non- Transactional Data Transactional Data (e.g. bookings) (What Each User Sees and Clicks) Data Warehouse Hadoop page 11
  • 12. Getting a Buy-in presented a long-term, semi-structured data growth story and explained how this will help harness long-tail opportunities at lowest cost. - Traditional DW! - Big Data! -  Classical Stats! -  Specific spikes! -  Sampling! -  Median is not the message! - Create a universal key ! - Always keep source data! - Operationalize the infrastructure! * From a blog
  • 13. An example of “median is not the message” •  Positional Bias during Hotel Searches
  • 14. Our Customers pick top positions the most…
  • 15. Safari Users Seem to be Interested in More Expensive Hotels page 15
  • 16. Seasonal variations •  Customer hotel stay gets longer during summer months •  Could help in designing search based on seasons. page 16
  • 17. Workload and Resource Partition � � � � � � � � � � � � � � � � � � � � � � � � � � page 17
  • 19. Description of Use Case •  Analyze openly available dataset: Airline on-time performance. •  Dataset was used in Visualization Poster Competition 2009 –  Consists of flight arrival/departure details from 1987-2008. –  Approximately 120 MM records totaling 12GB. •  Available at: http://stat-computing.org/dataexpo/2009/ page 19
  • 20. Our dataset page 20
  • 21. Airline Delay Plot: R code > deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F) ! > View(deptdelays.monthly.full)! > names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)! > Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full $Delay,decreasing=TRUE),]
 > Top_10_Delay_by_Month <- Delay_by_Month[1:10,]! > Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))! > symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)! > text(Month, Delay, Airline, cex= 0.5)! page 21
  • 22. Airline delay page 22
  • 23. Multiple Distributions: R code > library(lattice)! > deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)! > h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))! > update(h)! page 23
  • 24. Running R on Hadoop: Hadoop Streaming page 24
  • 25. Hadoop Streaming – Overview •  An alternative to the Java MapReduce API which allows you to write jobs in any language supporting stdin/stdout. •  Limited to text data in current versions of Hadoop. Support for binary streams added in 0.21.0. •  Requires installation of R on all DataNodes. page 25
  • 26. Hadoop Streaming – Dataflow 1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… * 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL... PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Map function receives input records line-by-line via standard input. page 26
  • 27. Hadoop Streaming – Dataflow Continued DL|1987|10 14 PI|1988|1 0 Input to reduce PI|1988|1 17 PS|1987|10 PS|1987|10 1 11 * PS|1987|10 -2 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Reduce receives map output key/value pairs sorted by key, line-by-line. page 27
  • 28. Hadoop Streaming Example – map.R page 28
  • 29. Hadoop Streaming Example – reduce.R page 29
  • 30. Running R on Hadoop: Hadoop Interactive page 30
  • 31. Hadoop Interactive (hive) – Overview •  Very unfortunate acronym. •  Provides an interface to Hadoop from the R environment. –  Functions to access HDFS –  Control Hadoop –  And run streaming jobs directly from R •  Allows HDFS data, including the output from MapReduce processing, to be manipulated and analyzed directly from R. •  Seems to still have some rough edges. page 31
  • 32. Hadoop Interactive – Example page 32
  • 33. Running R on Hadoop: RHIPE page 33
  • 34. RHIPE – Overview •  Active project with frequent updates and active community. •  RHIPE is based on Hadoop streaming source, but provides some significant enhancements, such as support for binary files (sort of). •  Developed to provide R users with access to same Hadoop functionality available to Java developers. –  For example, provides rhcounter() and rhstatus(), analagous to counters and the reporter interface in the Java API. page 34
  • 35. RHIPE – Overview •  Can be somewhat confusing and intimidating. –  Then again, the same can be said for the Java API. –  Worth taking the time to get comfortable with. page 35
  • 36. RHIPE – Overview •  Allows developers to work directly on data stored in HDFS in the R environment. •  Also allows developers to write MapReduce jobs in R and execute them on the Hadoop cluster. •  RHIPE uses Google protocol buffers to serialize data. Most R data types are supported. –  Using protocol buffers increases efficiency and provides interoperability with other languages. •  Must be installed on all DataNodes. page 36
  • 37. RHIPE – MapReduce map <- expression({}) ! reduce <- expression( ! pre = {…},! reduce = {…}, ! post = {…}! ) ! z <- rhmr(map=map,reduce=reduce,! inout=c("text","sequence ), ! ifolder=INPUT_PATH ,! ofolder=OUTPUT_PATH,! …)! rhex(z) ! page 37
  • 38. RHIPE – Dataflow Keys = […] Values = [1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI... 1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI… Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO... 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO... * 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO… 1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...] PI|1988|1 17 PI|1988|1 0 PS|1987|10 11 Output from map PS|1987|10 -2 PS|1987|10 1 DL|1987|10 14 * Note that Input to map is a vector of keys and a vector of values. page 38
  • 39. RHIPE – Dataflow Continued DL|1987|10 [14] Input to reduce PI|1988|1 [0, 17] * PS|1987|10 [1,11,-2] 1987 10 1 DL 14 Output from reduce 1988 1987 1 10 2 3 PI PS 8.5 3.333333 * Input to reduce is a key and a vector containing a subset of intermediate values associated with that key. The reduce will get called until no more values exist for the key. page 39
  • 40. RHIPE – Example page 40
  • 41. RHIPE – Example page 41
  • 42. RHIPE – Example page 42
  • 43. Running R on Hadoop: rmr page 43
  • 44. rmr Overview •  New project from Revolution Analytics introduced August 2011. •  Part of RHadoop, a suite of open-source projects which also includes: –  rhdfs – functions to access and manage HDFS from within R. –  rhbase – functions providing basic connectivity to HBase. •  Goals are to provide productive environment for MapReduce programming in an R-like way - “…stay true to map reduce and true to R …” •  Reduce gets all intermediate values for each key (yay!). •  Like RHIPE, based on streaming source. page 44
  • 45. rmr – Example page 45
  • 46. Running R on Hadoop: Segue page 46
  • 47. Segue – Overview •  Intended to work around single-threading in R by taking advantage of Hadoop streaming to provide simple parallel processing. –  For example, running multiple simulations in parallel. •  Suitable for embarrassingly pleasantly parallel problems – big CPU, not big data. •  Runs on Amazon’s Elastic Map Reduce (EMR). –  Not intended for internal clusters. •  Provides emrlapply(), a parallel version of lapply()! page 47
  • 48. Segue – Example page 48
  • 50. Performance Testing – Environment and Setup •  Twenty-eight DataNodes: –  Dual hex-core –  24GB RAM –  Shared cluster. •  Data –  Airline dataset –  22 input files –  About 12GB uncompressed data page 50
  • 51. Performance Comparison Number of Reducers Streaming RHIPE 264 246 seconds* 96 seconds* *All numbers are an average of 3 runs. page 51
  • 52. Alternatives Alternate languages/libraries: •  Apache Mahout –  Scalable machine learning library. –  Offers clustering, classification, collaborative filtering on Hadoop. •  Python –  Many modules available to support scientific and statistical computing. page 52
  • 53. Alternatives Alternative parallel processing frameworks: •  Revolution Analytics –  Provides commercial packages to support processing big data with R. •  Other HPC/parallel processing packages for R, e.g. Rmpi or snow. page 53
  • 54. Alternatives Apache Hive + RJDBC? •  We haven t been able to get it to work yet. •  You can however wrap calls to the Hive client in R to return R objects. See https://github.com/satpreetsingh/rDBwrappers/wiki page 54
  • 55. Alternatives Interesting solutions that you can t have: •  Ricardo –  Developed as part of a research project at IBM. –  Interesting paper published, but apparently no plans to make available. page 55
  • 56. Conclusions •  If practical, consider using Hadoop to aggregate data for input to R analyses. •  Avoid using R for general purpose MapReduce use. page 56
  • 57. Conclusions •  For simple MapReduce jobs, or embarrassingly parallel jobs on a local cluster, consider Hadoop streaming. –  Limited to processing text only. –  But easy to test at the command line outside of Hadoop: •  $ cat DATAFILE |./map.R |sort |./reduce.R! •  To run compute-bound analyses with relatively small amount of data on the cloud look at Segue. page 57
  • 58. Conclusions •  Otherwise, your best bet is RHIPE, but definitely check out rmr. •  Also consider alternatives – Mahout, Python, etc. page 58
  • 59. Conclusions On an operational note: •  Make sure your cluster nodes are consistent – same version of R installed, required libraries are installed on each node, etc. page 59
  • 61. References •  Hadoop –  Apache Hadoop project: http://hadoop.apache.org/ –  Hadoop The Definitive Guide, Tom White, O Reilly Press, 2011 •  R –  R Project for Statistical Computing: http://www.r-project.org/ –  R Cookbook, Paul Teetor, O Reilly Press, 2011 –  Getting Started With R: Some Resources: http://quanttrader.info/public/gettingStartedWithR.html page 61
  • 62. References •  Hadoop Streaming –  Documentation on Apache Hadoop Wiki: http://hadoop.apache.org/mapreduce/docs/current/ streaming.html –  Word count example in R : https://forums.aws.amazon.com/thread.jspa? messageID=129163 page 62
  • 63. References •  Hadoop InteractiVE –  Project page on CRAN: http://cran.r-project.org/web/packages/hive/index.html –  Simple Parallel Computing in R Using Hadoop: http://www.rmetrics.org/Meielisalp2009/Presentations/ Theussl1.pdf page 63
  • 64. References •  RHIPE –  RHIPE - R and Hadoop Integrated Processing Environment: http://www.stat.purdue.edu/~sguha/rhipe/ –  Code: http://code.google.com/p/rhipe/ –  Wiki: http://code.google.com/p/rhipe/w/list –  Installing RHIPE on CentOS: https://groups.google.com/forum/#!topic/brumail/ qH1wjtBgwYI –  Introduction to using RHIPE: http://ml.stat.purdue.edu/rhafen/rhipe/ –  RHIPE combines Hadoop and the R analytics language, SD Times: http://www.sdtimes.com/link/34792 page 64
  • 65. References •  RHIPE –  Using R and Hadoop to Analyze VoIP Network Data for QoS, Hadoop World 2010: •  video: http://www.cloudera.com/videos/ hw10_video_using_r_and_hadoop_to_analyze_voip_net work_data_for_qos •  slides: http://www.cloudera.com/resource/ hw10_voice_over_ip_studying_traffic_characteristics_for _quality_of_service –  RHIPE examples (k-means, etc.): http://groups.google.com/group/brumail/browse_thread/ thread/e403db404f039e31?pli=1 page 65
  • 66. References •  RHadoop (including rmr) –  Github: https://github.com/RevolutionAnalytics/RHadoop –  Advanced Big Data Analytics with R and Hadoop whitepaper: http://info.revolutionanalytics.com/R-and-Hadoop-Big-Data- Analytics-White-Paper.html page 66
  • 67. References •  Segue –  Project page: http://code.google.com/p/segue/ –  Google Group:http://groups.google.com/group/segue-r –  Abusing Amazon s Elastic MapReduce Hadoop service… easily, from R, Jefferey Breen: http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to- amazon-elastic-mapreduce-hadoop/ –  Presentation at Chicago Hadoop Users Group March 23, 2011: http://files.meetup.com/1634302/segue-presentation- RUG.pdf page 67
  • 68. References •  Sawmill (A framework for integrating a PMML-compliant Scoring Engine with Hadoop). –  More information: •  Open Data Group www.opendatagroup.com •  [email protected] –  Augustus, an open source system for building & scoring statistical models •  augustus.googlecode.com –  PMML •  Data Mining Group: dmg.org –  Analytics over Clouds using Hadoop, presentation at Chicago Hadoop User Group: http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf page 68
  • 69. References •  Ricardo –  Ricardo: Integrating R and Hadoop, paper: http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010- das.pdf –  Ricardo: Integrating R and Hadoop, Powerpoint presentation: http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo- SIGMOD10.pptx page 69
  • 70. References •  General references on Hadoop and R –  Pete Skomoroch s R and Hadoop bookmarks: http://www.delicious.com/pskomoroch/R+hadoop –  Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04/pigs-bees-and- elephants-a-comparison-of-eight-mapreduce-languages/ –  Quora – How can R and Hadoop be used together?: http://www.quora.com/How-can-R-and-Hadoop-be-used- together page 70
  • 71. References •  Mahout –  Mahout project: http://mahout.apache.org/ –  Mahout in Action, Owen, et. al., Manning Publications, 2011 •  Python –  Think Stats, Probability and Statistics for Programmers, Allen B. Downey, O Reilly Press, 2011 •  CRAN Task View: High-Performance and Parallel Computing with R, a set of resources compiled by Dirk Eddelbuettel: http://cran.r-project.org/web/views/ HighPerformanceComputing.html page 71
  • 72. References •  Other examples of airline data analysis with R: –  A simple Big Data analysis using the RevoScaleR package in Revolution R: http://www.r-bloggers.com/a-simple-big-data-analysis-using- the-revoscaler-package-in-revolution-r/ page 72
  • 73. And finally… Parallel R (working title), Q Ethan McCallum, Stephen Weston, O Reilly Press, due autumn 2011 R meets Big Data - a basket of strategies to help you use R for large-scale analysis and computation. page 73