Distributed Data Analysis with Hadoop and R - Strangeloop 2011

Distributed Data Analysis with Hadoop and R
Jonathan Seidman and Ramesh Venkataramaiah, Ph. D.
StrangeLoop2011
September 20 | 2011

Flow of this Talk

•  Introductions

•  Hadoop, R and Interfacing the two

•  Our Prototypes

•  A use case for interfacing Hadoop and R

•  Alternatives for Running R on Hadoop

•  Alternatives to Hadoop and R

•  Conclusions

•  References

Who We Are
•  Ramesh Venkataramaiah, Ph. D.
–  Principal Engineer, TechOps
–  rvenkataramaiah@orbitz.com
–  @rvenkatar

•  Jonathan Seidman
–  Lead Engineer, Business Intelligence/Big Data Team
–  Co-founder/organizer of Chicago Hadoop User Group (
http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG) and
Chicago Big Data (http://www.meetup.com/Chicago-Big-Data/
–  jseidman@orbitz.com
–  @jseidman

•  Orbitz Careers
–  http://careers.orbitz.com/
–  @OrbitzTalent

Launched in 2001

Over 160 million
bookings
7th Largest seller of
travel in the world
page 4

Hadoop and R
as an analytic platform?

What is Hadoop?

Distributed file system (HDFS) and parallel processing
framework.

Uses MapReduce programming model as the core.

Provides fault tolerant and scalable storage
of very large datasets across machines in a cluster.

What is R? When do we need it?

Open-source stat package with visualization
Vibrant community support.
One-line calculations galore!
Steep learning curve but worth it!

Insight into statistical properties and trends…
or for machine learning purposes…
or Big Data to be understood well.

page 7

Our Options

•  Data volume reduction by sampling
–  Very bad for long-tail data distribution
–  Approximation lead to bad conclusion
•  Scaling R
–  Still in-memory
–  But make it parallel using segue, Rhipe, R-Hive…
•  Use sql-like interfaces
–  Apache Hive with Hadoop
–  File sprawl and process issues
•  Regular DBMS
–  How to fit square peg in a round hole
–  No in-line R calls from SQL but commercial efforts are underway.

•  This Talk: How to bring Hadoop’s parallel processing
capability to R environment.
page 8

Our prototypes
User segmentations
Hotel bookings
Airline Performance*

* Public dataset

page 9

We have two distinct dataspaces serving different
constituents

Semi-structure data Transactional data
(e.g. searches) (e.g. bookings)

Hadoop Cluster Data Warehouse

page 10

Our Hadoop infrastructure allows us to record and process
user activity at the individual level

Detailed Non- Transactional Data
Transactional Data (e.g. bookings)
(What Each User
Sees and Clicks)

Data Warehouse

Hadoop

page 11

Getting a Buy-in
presented a long-term, semi-structured data growth story and
explained how this will help harness long-tail opportunities at
lowest cost.
- Traditional DW! - Big Data!
-  Classical Stats! -  Specific spikes!
-  Sampling! -  Median is not the message!

- Create a universal key !
- Always keep source data!
- Operationalize the
infrastructure!

* From a blog

An example of “median is not the message”

•  Positional Bias during Hotel Searches

Our Customers pick top positions the most…

Safari Users Seem to be Interested in More Expensive Hotels

page 15

Seasonal variations
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.

page 16

Workload and Resource Partition

� � � �

� � � �
�

� � �
� � � �
� � �

� � �
� � �
�

page 17

Airline Performance

page 18

Description of Use Case

•  Analyze openly available dataset: Airline on-time performance.
•  Dataset was used in Visualization Poster Competition 2009
–  Consists of flight arrival/departure details from 1987-2008.
–  Approximately 120 MM records totaling 12GB.
•  Available at: http://stat-computing.org/dataexpo/2009/

page 19

Our dataset

page 20

Airline Delay Plot: R code

> deptdelays.monthly.full <- read.delim("~/OSCON2011/Delays_by_Month.dat", header=F)
!
> View(deptdelays.monthly.full)!
> names(deptdelays.monthly.full) <- c("Year","Month","Count","Airline","Delay”)!

> Delay_by_month <- deptdelays.monthly.full[order(deptdelays.monthly.full
$Delay,decreasing=TRUE),] 

> Top_10_Delay_by_Month <- Delay_by_Month[1:10,]!
> Top_10_Normal <- ((Delay - mean(Delay)) / sd(Delay))!

> symbols( Month, Delay, circles= Top_10_Normal, inches=.3, fg="white”,bg="red”,…)!
> text(Month, Delay, Airline, cex= 0.5)!

page 21

Airline delay

page 22

Multiple Distributions: R code
> library(lattice)!
> deptdelays.monthly.full$Year <- as.character(deptdelays.monthly.full$Year)!
> h <- histogram(~Delay|Year,data=deptdelays.monthly.full,layout=c(5,5))!
> update(h)!

page 23

Running R on Hadoop:
Hadoop Streaming

page 24

Hadoop Streaming – Overview

•  An alternative to the Java MapReduce API which allows you to
write jobs in any language supporting stdin/stdout.
•  Limited to text data in current versions of Hadoop. Support for
binary streams added in 0.21.0.
•  Requires installation of R on all DataNodes.

page 25

Hadoop Streaming – Dataflow

1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...

1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…

Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...

1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...

1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…

*

1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...

PI|1988|1

17

PI|1988|1

0

PS|1987|10

11

Output from map PS|1987|10

-2

PS|1987|10

1

DL|1987|10

14

* Map function receives input records line-by-line via standard input.

page 26

Hadoop Streaming – Dataflow Continued

DL|1987|10

14

PI|1988|1

0

Input to reduce PI|1988|1

17

PS|1987|10
PS|1987|10

1

11

*

PS|1987|10

-2

1987

10

1

DL

14

Output from reduce 1988
1987

1

10

2

3

PI

PS

8.5

3.333333

* Reduce receives map output key/value pairs sorted by key, line-by-line.

page 27

Hadoop Streaming Example – map.R

page 28

Hadoop Streaming Example – reduce.R

page 29

Hadoop Interactive

page 30

Hadoop Interactive (hive) – Overview

•  Very unfortunate acronym.
•  Provides an interface to Hadoop from the R environment.
–  Functions to access HDFS
–  Control Hadoop
–  And run streaming jobs directly from R
•  Allows HDFS data, including the output from MapReduce
processing, to be manipulated and analyzed directly from R.
•  Seems to still have some rough edges.

page 31

Hadoop Interactive – Example

page 32

RHIPE

page 33

RHIPE – Overview

•  Active project with frequent updates and active community.
•  RHIPE is based on Hadoop streaming source, but provides
some significant enhancements, such as support for binary
files (sort of).
•  Developed to provide R users with access to same Hadoop
functionality available to Java developers.
–  For example, provides rhcounter() and rhstatus(),
analagous to counters and the reporter interface in the Java
API.

page 34

RHIPE – Overview

•  Can be somewhat confusing and intimidating.
–  Then again, the same can be said for the Java API.
–  Worth taking the time to get comfortable with.

page 35

RHIPE – Overview

•  Allows developers to work directly on data stored in HDFS in
the R environment.
•  Also allows developers to write MapReduce jobs in R and
execute them on the Hadoop cluster.
•  RHIPE uses Google protocol buffers to serialize data. Most R
data types are supported.
–  Using protocol buffers increases efficiency and provides
interoperability with other languages.
•  Must be installed on all DataNodes.

page 36

RHIPE – MapReduce

map <- expression({}) !
reduce <- expression( !
pre = {…},!
reduce = {…}, !
post = {…}!
) !
z <- rhmr(map=map,reduce=reduce,!
inout=c("text","sequence ), !
ifolder=INPUT_PATH ,!
ofolder=OUTPUT_PATH,!
…)!
rhex(z) !

page 37

RHIPE – Dataflow

Keys = […]

Values =

[1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI...

1988,1,17,7,1331,1331,1440,1435,PI,942,NA,69,64,NA,5,0,SYR,BWI…

Input to map 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO...

1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO...

*
1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO…

1987,10,30,5,1712,1658,1811,1800,DL,475,NA,59,62,NA,11,14,LEX,ATL...]

PI|1988|1

17

PI|1988|1

0

PS|1987|10

11

Output from map PS|1987|10

-2

PS|1987|10

1

DL|1987|10

14

* Note that Input to map is a vector of keys and a vector of values.

page 38

RHIPE – Dataflow Continued

DL|1987|10

[14]

Input to reduce PI|1988|1

[0, 17]

*
PS|1987|10

[1,11,-2]

1987

10

1

DL

14

Output from reduce 1988
1987

1

10

2

3

PI

PS

8.5

3.333333

* Input to reduce is a key and a vector containing a subset of intermediate
values associated with that key. The reduce will get called until no more
values exist for the key.

page 39

RHIPE – Example

page 40

RHIPE – Example

page 41

RHIPE – Example

page 42

rmr

page 43

rmr Overview

•  New project from Revolution Analytics introduced August 2011.
•  Part of RHadoop, a suite of open-source projects which also
includes:
–  rhdfs – functions to access and manage HDFS from within
R.
–  rhbase – functions providing basic connectivity to HBase.
•  Goals are to provide productive environment for MapReduce
programming in an R-like way - “…stay true to map reduce and
true to R …”
•  Reduce gets all intermediate values for each key (yay!).
•  Like RHIPE, based on streaming source.

page 44

rmr – Example

page 45

Segue

page 46

Segue – Overview

•  Intended to work around single-threading in R by taking
advantage of Hadoop streaming to provide simple parallel
processing.
–  For example, running multiple simulations in parallel.
•  Suitable for embarrassingly pleasantly parallel problems – big
CPU, not big data.
•  Runs on Amazon’s Elastic Map Reduce (EMR).
–  Not intended for internal clusters.
•  Provides emrlapply(), a parallel version of lapply()!

page 47

Segue – Example

page 48

Performance Comparison:
Streaming and RHIPE

page 49

Performance Testing – Environment and Setup

•  Twenty-eight DataNodes:
–  Dual hex-core
–  24GB RAM
–  Shared cluster.

•  Data
–  Airline dataset
–  22 input files
–  About 12GB uncompressed data

page 50

Performance Comparison

Number of Reducers Streaming RHIPE
264 246 seconds* 96 seconds*

*All numbers are an average of 3 runs.

page 51

Alternatives

Alternate languages/libraries:

•  Apache Mahout
–  Scalable machine learning library.
–  Offers clustering, classification, collaborative filtering on
Hadoop.
•  Python
–  Many modules available to support scientific and statistical
computing.

page 52

Alternatives

Alternative parallel processing frameworks:

•  Revolution Analytics
–  Provides commercial packages to support processing big
data with R.
•  Other HPC/parallel processing packages for R, e.g. Rmpi or
snow.

page 53

Alternatives

Apache Hive + RJDBC?

•  We haven t been able to get it to work yet.
•  You can however wrap calls to the Hive client in R to return R
objects. See https://github.com/satpreetsingh/rDBwrappers/wiki

page 54

Alternatives

Interesting solutions that you can t have:

•  Ricardo
–  Developed as part of a research project at IBM.
–  Interesting paper published, but apparently no plans to
make available.

page 55

Conclusions

•  If practical, consider using Hadoop to aggregate data for input
to R analyses.
•  Avoid using R for general purpose MapReduce use.

page 56

Conclusions

•  For simple MapReduce jobs, or embarrassingly parallel jobs
on a local cluster, consider Hadoop streaming.
–  Limited to processing text only.
–  But easy to test at the command line outside of Hadoop:
•  $ cat DATAFILE |./map.R |sort |./reduce.R!
•  To run compute-bound analyses with relatively small amount of
data on the cloud look at Segue.

page 57

Conclusions

•  Otherwise, your best bet is RHIPE, but definitely check out rmr.
•  Also consider alternatives – Mahout, Python, etc.

page 58

Conclusions

On an operational note:

•  Make sure your cluster nodes are consistent – same version of
R installed, required libraries are installed on each node, etc.

page 59

Example Code

•  https://github.com/jseidman/hadoop-R

page 60

References

•  Hadoop
–  Apache Hadoop project: http://hadoop.apache.org/
–  Hadoop The Definitive Guide, Tom White, O Reilly Press,
2011
•  R
–  R Project for Statistical Computing: http://www.r-project.org/
–  R Cookbook, Paul Teetor, O Reilly Press, 2011
–  Getting Started With R: Some Resources:
http://quanttrader.info/public/gettingStartedWithR.html

page 61

References

•  Hadoop Streaming
–  Documentation on Apache Hadoop Wiki:
http://hadoop.apache.org/mapreduce/docs/current/
streaming.html
–  Word count example in R :
https://forums.aws.amazon.com/thread.jspa?
messageID=129163

page 62

References

•  Hadoop InteractiVE
–  Project page on CRAN:
http://cran.r-project.org/web/packages/hive/index.html
–  Simple Parallel Computing in R Using Hadoop:
http://www.rmetrics.org/Meielisalp2009/Presentations/
Theussl1.pdf

page 63

References

•  RHIPE
–  RHIPE - R and Hadoop Integrated Processing Environment:
http://www.stat.purdue.edu/~sguha/rhipe/
–  Code: http://code.google.com/p/rhipe/
–  Wiki: http://code.google.com/p/rhipe/w/list
–  Installing RHIPE on CentOS:
https://groups.google.com/forum/#!topic/brumail/
qH1wjtBgwYI
–  Introduction to using RHIPE:
http://ml.stat.purdue.edu/rhafen/rhipe/
–  RHIPE combines Hadoop and the R analytics language, SD
Times: http://www.sdtimes.com/link/34792

page 64

References

•  RHIPE
–  Using R and Hadoop to Analyze VoIP Network Data for
QoS, Hadoop World 2010:
•  video:
http://www.cloudera.com/videos/
hw10_video_using_r_and_hadoop_to_analyze_voip_net
work_data_for_qos
•  slides:
http://www.cloudera.com/resource/
hw10_voice_over_ip_studying_traffic_characteristics_for
_quality_of_service
–  RHIPE examples (k-means, etc.):
http://groups.google.com/group/brumail/browse_thread/
thread/e403db404f039e31?pli=1

page 65

References

•  RHadoop (including rmr)
–  Github: https://github.com/RevolutionAnalytics/RHadoop
–  Advanced Big Data Analytics with R and Hadoop
whitepaper:
http://info.revolutionanalytics.com/R-and-Hadoop-Big-Data-
Analytics-White-Paper.html

page 66

References

•  Segue
–  Project page: http://code.google.com/p/segue/
–  Google Group:http://groups.google.com/group/segue-r
–  Abusing Amazon s Elastic MapReduce Hadoop service…
easily, from R, Jefferey Breen:
http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-
amazon-elastic-mapreduce-hadoop/
–  Presentation at Chicago Hadoop Users Group March 23,
2011:
http://files.meetup.com/1634302/segue-presentation-
RUG.pdf

page 67

References

•  Sawmill (A framework for integrating a PMML-compliant Scoring
Engine with Hadoop).
–  More information:
•  Open Data Group www.opendatagroup.com
•  oscon-info@opendatagroup.com
–  Augustus, an open source system for building & scoring
statistical models
•  augustus.googlecode.com
–  PMML
•  Data Mining Group: dmg.org
–  Analytics over Clouds using Hadoop, presentation at Chicago
Hadoop User Group:
http://files.meetup.com/1634302/CHUG 20100721 Sawmill.pdf

page 68

References

•  Ricardo
–  Ricardo: Integrating R and Hadoop, paper:
http://www.cs.ucsb.edu/~sudipto/papers/sigmod2010-
das.pdf
–  Ricardo: Integrating R and Hadoop, Powerpoint
presentation:
http://www.uweb.ucsb.edu/~sudipto/talks/Ricardo-
SIGMOD10.pptx

page 69

References

•  General references on Hadoop and R
–  Pete Skomoroch s R and Hadoop bookmarks:
http://www.delicious.com/pskomoroch/R+hadoop
–  Pigs, Bees, and Elephants: A Comparison of Eight
MapReduce Languages:
http://www.dataspora.com/2011/04/pigs-bees-and-
elephants-a-comparison-of-eight-mapreduce-languages/
–  Quora – How can R and Hadoop be used together?:
http://www.quora.com/How-can-R-and-Hadoop-be-used-
together

page 70

References

•  Mahout
–  Mahout project: http://mahout.apache.org/
–  Mahout in Action, Owen, et. al., Manning Publications, 2011
•  Python
–  Think Stats, Probability and Statistics for Programmers, Allen
B. Downey, O Reilly Press, 2011
•  CRAN Task View: High-Performance and Parallel Computing with
R, a set of resources compiled by Dirk Eddelbuettel:
http://cran.r-project.org/web/views/
HighPerformanceComputing.html

page 71

References

•  Other examples of airline data analysis with R:
–  A simple Big Data analysis using the RevoScaleR package
in Revolution R:
http://www.r-bloggers.com/a-simple-big-data-analysis-using-
the-revoscaler-package-in-revolution-r/

page 72

And finally…

Parallel R (working title), Q Ethan McCallum, Stephen
Weston, O Reilly Press, due autumn 2011

R meets Big Data - a basket of strategies to help you use R
for large-scale analysis and computation.

page 73

Distributed Data Analysis with Hadoop and R - Strangeloop 2011

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Distributed Data Analysis with Hadoop and R - Strangeloop 2011 (20)

More from Jonathan Seidman (9)

Recently uploaded (20)

Distributed Data Analysis with Hadoop and R - Strangeloop 2011