SlideShare a Scribd company logo
RHive : Integrating R and Hive
                             Introduction



                                                     JunHo Cho
                                           Data Analysis Platform Team




Friday, November 11, 11
Analysis of Data




Friday, November 11, 11
Analysis of Data




                                                            CF      Classifier
                                                                                 Decision Tree
                    MapReduce                      Recommendation
                                                                                Graph
                                                      Clustering




Friday, November 11, 11
Related Works


                    •     RHIPE

                    •     RHadoop
                                                                      duce
                                                                  R e
                    •     hive (Hadoop InteractiVE)
                                                               Map
                                                           and
                    •     seuge
                                                      r st
                                               un de
                                        u st
                                       M



Friday, November 11, 11
RHive is inspired by ...

                    •     Many analysts have been used R for a long time

                    •     Many analysts can use SQL language

                    •     There are already a lot of statistical functions in R

                    •     R needs a capability to analyze big data

                    •     Hive supports SQL-like query language (HQL)

                    •     Hive supports MapReduce to execute HQL




                          R is the best solution for familiarity
                          Hive is the best solution for capability


Friday, November 11, 11
RHive Components


                   •      Hadoop

                          •   store and analyze big data

                   •      Hive

                          •   use HQL instead of MapReduce programming

                   •      R

                          •   support friendly environment to analysts




Friday, November 11, 11
RHive - Architecture
                   Execute R Function Objects and R Objects
                   through Hive Query

                 Execute Hive Query through R
                                                                              rcal <- function(arg1,arg2) {
                                       SELECT R(‘rcal’,col1,col2)                 coeff * sum(arg1,arg2)
                                       from tab1                              }



                                                                                     RServe


                                           01010100101   01010100101   01010100101
                                           01010010101   01010010101   01010010101
                                           01001010101   01001010101   01001010101
                                           10101000111   10101000111   10101000111




                                                      R Function          R Object         RUDF           RUDAF




Friday, November 11, 11
RHive API

                •         Extension R Functions
                      •     rhive.connect       •   rhive.napply        •   rhive.load.table
                      •     rhive.query         •   rhive.sapply        •   rhive.desc.table
                      •     rhive.assign        •   rhive.aggregate
                      •     rhive.export        •   rhive.list.tables




                •         Extension Hive Functions
                      •     RUDF

                      •     RUDAF

                      •     GenericUDTFExpand

                      •     GenericUDTFUnFold


Friday, November 11, 11
RUDF - R User-defined Functions

                   SELECT R(‘R function-name’,col1,col2,...,TYPE)
                    •     UDF doesn’t know return type until calling R function

                          •    TYPE : return type

                Example : R function which sums all passed columns


                sumCols <- function(arg1,...) {
                   sum(arg1,...)
                }
                rhive.assign(‘sumCols’,sumCols)
                rhive.exportAll(‘sumCols’,hadoop-clusters)
                result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)
                plot(result)



Friday, November 11, 11
RUDAF - R User-defined Aggregation Function
                            SELECT RA(‘R function-name’,col1,col2,...)
                    •        R can not manipulate large dataset

                    •        Support UDAF’s life cycle

                           •    iterate, partial-merge, merge, terminate

                    •        Return type is only string delimited by ‘,’ - “data1,data2,data3,...”

                          partial aggregation                                                                  partial aggregation
                                                                 aggregation values



                    FUN                FUN.partial                                                   FUN.merge           FUN.terminate

                 01010100101   01010100101   01010100101   01010100101   01010100101   01010100101   01010100101
                 01010010101   01010010101   01010010101   01010010101   01010010101   01010010101   01010010101
                 01001010101   01001010101   01001010101   01001010101   01001010101   01001010101   01001010101
                 10101000111   10101000111   10101000111   10101000111   10101000111   10101000111   10101000111




Friday, November 11, 11
UDTF : unfold and expand
                    •     RUDAF only returns string delimited by ‘,’

                    •     Convert RUDAF’s result to R data.frame



                   unfold(string_value,type1,type2,...,delimiter)
                   expand(string_value,type,delimiter)

                  RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key

                  select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’,
                  attr1,attr2,attr3,attr4) as x from table group by cluster-key




Friday, November 11, 11
napply and sapply

                   rhive.napply(table-name,FUN,col1,...)
                   rhive.sapply(table-name,FUN,col1,...)
                  •       napply : R apply function for Numeric type

                  •       sapply : R apply function for String type



                      Example : R function which sums all passed columns

                      sumCols <- function(arg1,...) {
                         sum(arg1,...)
                      }
                      result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)
                      rhive.load.table(result)




Friday, November 11, 11
napply

         •       ‘napply’ is similar to R apply function

         •       Store big result to HDFS as Hive table


   rhive.napply       <- function(tablename, FUN, col = NULL, ...) {
         if(is.null(col))
              cols <- ""
         else
              cols <- paste(",",col)

         for(element in c(...)) {
              cols <- paste(cols,",",element)
         }

         exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")

   !     rhive.assign(exportname,FUN)
   !     rhive.exportAll(exportname)
         tmptable <- paste(exportname,”_table”)
   !     rhive.query(
                   paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))

   !     tmptable
   }




Friday, November 11, 11
aggregate

                 rhive.aggregate(table-name,hive-FUN,...,goups)

            •      RHive aggregation function to aggregate data stored in HDFS using HIVE Function




                  Example : Aggregate using SUM (Hive aggregation function)

                  result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)
                  rhive.load.table(result)




Friday, November 11, 11
Examples - predict flight delay
    library(RHive)
    rhive.connect()

    - Retrieve training set from large dataset stored in HDFS
    train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())

    train$arrdelay <- as.numeric(train$arrdelay)

    train$distance <- as.numeric(train$distance)

    train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]                   Native R code
    model <- lm(arrdelay ~ distance + dayofweek,data=train)

    - Export R object data
    rhive.assign("model", model)

    - Analyze big data using model calculated by R
    predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {

            if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)                HiveQuery + R code
            res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))

            return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)


Friday, November 11, 11
DEMO

Friday, November 11, 11
Conclusion

                    •     RHive supports HQL, not MapReduce model style

                    •     RHive allows analytics to do everything in R console

                    •     RHive interacts R data and HDFS data



                    •     Future & Current Works
                          •   Integrate Hadoop HDFS

                          •   Support Transform/Map-Reduce Scripts

                          •   Distributed Rserve

                          •   Support more R style API

                          •   Support machine learning algorithms (k-means, classifier, ...)


Friday, November 11, 11
Cooperators


                   •      JunHo Cho

                   •      Seonghak Hong

                   •      Choonghyun Ryu




                                      YO U !
Friday, November 11, 11
How to join RHive project


               •          Logo




               •          github (https://github.com/nexr/RHive)

               •          CRAN (http://cran.r-project.org/web/packages/RHive)

               •          Welcome to join RHive project




Friday, November 11, 11
References



              •       Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)

              •       RHIPE (http://ml.stat.purdue.edu/rhipe)

              •       Hive (http://hive.apache.org)

              •       Parallels R by Q. Ethan McCallum and Stephen Weston




Friday, November 11, 11
jun.cho@nexr.com




Friday, November 11, 11
Appendix




Friday, November 11, 11
RHIPE

                •         the R and Hadoop Integrated Processing Environment

                •         Must understand the MapReduce model

            map <- function() {...}                                             shuffle / sort
            reduce <- function() {...}                                Mapper                    Reducer
            rmr <- rhmr(map,reduce,...)
                                                                                ProtocolBuf
                            R
                     Fork                                                R                        R
                          RHMR
                                                           R Objects (map)       R Objects (reduce)

                                R Objects (map, reduce)                        PersonalServer
                                R Conf


                                                           HDFS

Friday, November 11, 11
RHadoop
                  •       Manipulate Hadoop data stores and HBASE directly from R
                  •       Write MapReduce models in R using Hadoop Streaming
                  •       Must understand the MapReduce model


           map <- function() {...}
           reduce <- function() {...}
           mapreduce(input,output,map,reduce,...)


                              R
              rhbase           rhdfs          rmr
                                                           execute hadoop
                                                           streaming
                                                                              R                         R
                                                                                     Hadoop Streaming

                      manipulate                                                      shuffle / sort
                                       store R objecs as file                Mapper                    Reducer


                                   HBASE                                             HDFS

Friday, November 11, 11
hive(Hadoop InteractiVE)
                •         R extension facilitating distributed computing via the MapReduce
                          paradigm
                •         Provide an interface to Hadoop, HDFS and Hadoop Streaming
                •         Must understand the MapReduce model

           map <- function() {...}
           reduce <- function() {...}
           hive_stream(map,reduce,...)


                             R
                                                          execute hadoop     R                         R
               hive        DFS     hive_stream
                                                          streaming
                                                                                    Hadoop Streaming

           manipulate                                                                shuffle / sort
                                 save R script on local                    Mapper                    Reducer


                                                                 HDFS

Friday, November 11, 11
seuge

            •       Simple parallel processing with a fast and easy setup on Amazon’s WS.
            •       Parallel lapply function for the EMR engine using Hadoop streaming.
            •       Does not support MapReduce model but only Map model.



           data <- list(...)
           emrlapply(clusterObject,data,FUN,..)                               Amazon S3
                                                  upload R objects
                                  R

                          emrlapply   awsFunctions                     EMR              Hadoop Streaming
                                                                                                           R



                                                                                                     Mapper
                  save R objects (data + FUN) on local
                                                                       bootstrap (setup R)
                                                                       mapper.R




Friday, November 11, 11
Ricardo
                    •     Integrate R and Jaql (JSON Query Language)
                    •     Must know how to use uncommon query, Jaql
                    •     Not open-source




                                                                       Ref : Ricardo-SIGMOD10



Friday, November 11, 11

More Related Content

What's hot (20)

PPTX
Unit 2
vishal choudhary
 
PPTX
Apache pig
Sadiq Basha
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PPTX
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
PPTX
Hadoop ecosystem
Ran Silberman
 
PPTX
Map reduce and Hadoop on windows
Muhammad Shahid
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PPTX
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
PDF
Enabling R on Hadoop
DataWorks Summit
 
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
PDF
An Overview of Hadoop
Asif Ali
 
PDF
Mapreduce by examples
Andrea Iacono
 
PDF
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 
PDF
Introduction to Map-Reduce
Brendan Tierney
 
PDF
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
PDF
Large Scale Data Processing & Storage
Ilayaraja P
 
PDF
Map Reduce Execution Architecture
Rupak Roy
 
PPTX
Overview of Spark for HPC
Glenn K. Lockwood
 
PDF
Topic 6: MapReduce Applications
Zubair Nabi
 
PPTX
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 
Apache pig
Sadiq Basha
 
Hive User Meeting August 2009 Facebook
ragho
 
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Hadoop ecosystem
Ran Silberman
 
Map reduce and Hadoop on windows
Muhammad Shahid
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
Enabling R on Hadoop
DataWorks Summit
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
An Overview of Hadoop
Asif Ali
 
Mapreduce by examples
Andrea Iacono
 
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 
Introduction to Map-Reduce
Brendan Tierney
 
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Large Scale Data Processing & Storage
Ilayaraja P
 
Map Reduce Execution Architecture
Rupak Roy
 
Overview of Spark for HPC
Glenn K. Lockwood
 
Topic 6: MapReduce Applications
Zubair Nabi
 
Ecossistema Hadoop no Magazine Luiza
Nelson Forte
 

Similar to Integrate Hive and R (20)

PDF
Rhive 0.0 3
JunHo Cho
 
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
PDF
R and-hadoop
Bryan Downing
 
PDF
How to use hadoop and r for big data parallel processing
Bryan Downing
 
PPTX
The Hadoop Ecosystem
J Singh
 
ODP
Implementing R7RS on R6RS Scheme
Kato Takashi
 
PDF
Performance evaluation of cloudera impala (with Comparison to Hive)
Yukinori Suda
 
PDF
Introduction to R software, by Leire ibaibarriaga
DTU - Technical University of Denmark
 
PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
KEY
MapReduce and NoSQL
Aaron Cordova
 
PDF
サンプルから見るMap reduceコード
Shinpei Ohtani
 
PDF
サンプルから見るMapReduceコード
Shinpei Ohtani
 
PDF
Hadoop: A Hands-on Introduction
Claudio Martella
 
PPTX
Special topics in finance lecture 2
Dr. Muhammad Ali Tirmizi., Ph.D.
 
PPTX
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
sabari Giri
 
PPTX
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
sabari Giri
 
PDF
Extending lifespan with Hadoop and R
Radek Maciaszek
 
PDF
Machine Learning with Mahout
bigdatasyd
 
Rhive 0.0 3
JunHo Cho
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
R and-hadoop
Bryan Downing
 
How to use hadoop and r for big data parallel processing
Bryan Downing
 
The Hadoop Ecosystem
J Singh
 
Implementing R7RS on R6RS Scheme
Kato Takashi
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Yukinori Suda
 
Introduction to R software, by Leire ibaibarriaga
DTU - Technical University of Denmark
 
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
MapReduce and NoSQL
Aaron Cordova
 
サンプルから見るMap reduceコード
Shinpei Ohtani
 
サンプルから見るMapReduceコード
Shinpei Ohtani
 
Hadoop: A Hands-on Introduction
Claudio Martella
 
Special topics in finance lecture 2
Dr. Muhammad Ali Tirmizi., Ph.D.
 
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
sabari Giri
 
RPreliminariesdsjhfsdsfhjshfjsdhjfhjfhdfjhf
sabari Giri
 
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Machine Learning with Mahout
bigdatasyd
 
Ad

Recently uploaded (20)

PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Q2 Leading a Tableau User Group - Onboarding
lward7
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Q2 Leading a Tableau User Group - Onboarding
lward7
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Ad

Integrate Hive and R

  • 1. RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform Team Friday, November 11, 11
  • 2. Analysis of Data Friday, November 11, 11
  • 3. Analysis of Data CF Classifier Decision Tree MapReduce Recommendation Graph Clustering Friday, November 11, 11
  • 4. Related Works • RHIPE • RHadoop duce R e • hive (Hadoop InteractiVE) Map and • seuge r st un de u st M Friday, November 11, 11
  • 5. RHive is inspired by ... • Many analysts have been used R for a long time • Many analysts can use SQL language • There are already a lot of statistical functions in R • R needs a capability to analyze big data • Hive supports SQL-like query language (HQL) • Hive supports MapReduce to execute HQL R is the best solution for familiarity Hive is the best solution for capability Friday, November 11, 11
  • 6. RHive Components • Hadoop • store and analyze big data • Hive • use HQL instead of MapReduce programming • R • support friendly environment to analysts Friday, November 11, 11
  • 7. RHive - Architecture Execute R Function Objects and R Objects through Hive Query Execute Hive Query through R rcal <- function(arg1,arg2) { SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2) from tab1 } RServe 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 R Function R Object RUDF RUDAF Friday, November 11, 11
  • 8. RHive API • Extension R Functions • rhive.connect • rhive.napply • rhive.load.table • rhive.query • rhive.sapply • rhive.desc.table • rhive.assign • rhive.aggregate • rhive.export • rhive.list.tables • Extension Hive Functions • RUDF • RUDAF • GenericUDTFExpand • GenericUDTFUnFold Friday, November 11, 11
  • 9. RUDF - R User-defined Functions SELECT R(‘R function-name’,col1,col2,...,TYPE) • UDF doesn’t know return type until calling R function • TYPE : return type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } rhive.assign(‘sumCols’,sumCols) rhive.exportAll(‘sumCols’,hadoop-clusters) result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”) plot(result) Friday, November 11, 11
  • 10. RUDAF - R User-defined Aggregation Function SELECT RA(‘R function-name’,col1,col2,...) • R can not manipulate large dataset • Support UDAF’s life cycle • iterate, partial-merge, merge, terminate • Return type is only string delimited by ‘,’ - “data1,data2,data3,...” partial aggregation partial aggregation aggregation values FUN FUN.partial FUN.merge FUN.terminate 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 Friday, November 11, 11
  • 11. UDTF : unfold and expand • RUDAF only returns string delimited by ‘,’ • Convert RUDAF’s result to R data.frame unfold(string_value,type1,type2,...,delimiter) expand(string_value,type,delimiter) RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key Friday, November 11, 11
  • 12. napply and sapply rhive.napply(table-name,FUN,col1,...) rhive.sapply(table-name,FUN,col1,...) • napply : R apply function for Numeric type • sapply : R apply function for String type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4) rhive.load.table(result) Friday, November 11, 11
  • 13. napply • ‘napply’ is similar to R apply function • Store big result to HDFS as Hive table rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col) for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="") ! rhive.assign(exportname,FUN) ! rhive.exportAll(exportname) tmptable <- paste(exportname,”_table”) ! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep="")) ! tmptable } Friday, November 11, 11
  • 14. aggregate rhive.aggregate(table-name,hive-FUN,...,goups) • RHive aggregation function to aggregate data stored in HDFS using HIVE Function Example : Aggregate using SUM (Hive aggregation function) result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”) rhive.load.table(result) Friday, November 11, 11
  • 15. Examples - predict flight delay library(RHive) rhive.connect() - Retrieve training set from large dataset stored in HDFS train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) train$arrdelay <- as.numeric(train$arrdelay) train$distance <- as.numeric(train$distance) train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code model <- lm(arrdelay ~ distance + dayofweek,data=train) - Export R object data rhive.assign("model", model) - Analyze big data using model calculated by R predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) { if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3)) return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’) Friday, November 11, 11
  • 17. Conclusion • RHive supports HQL, not MapReduce model style • RHive allows analytics to do everything in R console • RHive interacts R data and HDFS data • Future & Current Works • Integrate Hadoop HDFS • Support Transform/Map-Reduce Scripts • Distributed Rserve • Support more R style API • Support machine learning algorithms (k-means, classifier, ...) Friday, November 11, 11
  • 18. Cooperators • JunHo Cho • Seonghak Hong • Choonghyun Ryu YO U ! Friday, November 11, 11
  • 19. How to join RHive project • Logo • github (https://github.com/nexr/RHive) • CRAN (http://cran.r-project.org/web/packages/RHive) • Welcome to join RHive project Friday, November 11, 11
  • 20. References • Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf) • RHIPE (http://ml.stat.purdue.edu/rhipe) • Hive (http://hive.apache.org) • Parallels R by Q. Ethan McCallum and Stephen Weston Friday, November 11, 11
  • 23. RHIPE • the R and Hadoop Integrated Processing Environment • Must understand the MapReduce model map <- function() {...} shuffle / sort reduce <- function() {...} Mapper Reducer rmr <- rhmr(map,reduce,...) ProtocolBuf R Fork R R RHMR R Objects (map) R Objects (reduce) R Objects (map, reduce) PersonalServer R Conf HDFS Friday, November 11, 11
  • 24. RHadoop • Manipulate Hadoop data stores and HBASE directly from R • Write MapReduce models in R using Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} mapreduce(input,output,map,reduce,...) R rhbase rhdfs rmr execute hadoop streaming R R Hadoop Streaming manipulate shuffle / sort store R objecs as file Mapper Reducer HBASE HDFS Friday, November 11, 11
  • 25. hive(Hadoop InteractiVE) • R extension facilitating distributed computing via the MapReduce paradigm • Provide an interface to Hadoop, HDFS and Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} hive_stream(map,reduce,...) R execute hadoop R R hive DFS hive_stream streaming Hadoop Streaming manipulate shuffle / sort save R script on local Mapper Reducer HDFS Friday, November 11, 11
  • 26. seuge • Simple parallel processing with a fast and easy setup on Amazon’s WS. • Parallel lapply function for the EMR engine using Hadoop streaming. • Does not support MapReduce model but only Map model. data <- list(...) emrlapply(clusterObject,data,FUN,..) Amazon S3 upload R objects R emrlapply awsFunctions EMR Hadoop Streaming R Mapper save R objects (data + FUN) on local bootstrap (setup R) mapper.R Friday, November 11, 11
  • 27. Ricardo • Integrate R and Jaql (JSON Query Language) • Must know how to use uncommon query, Jaql • Not open-source Ref : Ricardo-SIGMOD10 Friday, November 11, 11