SlideShare a Scribd company logo
R ON HADOOP
Kostiantyn Kudriavtsev
Lviv Hadoop User Group, June 19, 2014
Agenda
• What is R?
• Linear Regression
• R on Hadoop
• Summary
What is R?
Object-oriented and functional language for Stats, Math and
Data Science created by statisticians with comprehensive
data visualisation and statistical modelling capabilities;
5000+ (and grow) freely available specialised algorithms for
finance, economics, genomics, linguistic and so on;
2M+ users with specialised domain skills;
… but some drawbacks are:
- limited by RAM
- single thread
R development environment
RStudio is de-
facto standard
IDE for R
development and
available in local
or server mode.
Might be used not
only for coding,
but also
visualisation.
Suitable to
develop R
solutions on top of
Hadoop.
Apache Hadoop is an software framework that supports data-
intensive distributed applications based on MapReduce
algorithm (MR). Main idea: move computation to data.
MR idea:
- Map step: Map(k1,v1) → list(k2,v2)
- Magic here (sort by k2, data transfer between
nodes, etc)
- Reduce step: Reduce(k2, list (v2)) → (k3, v3)
What is Hadoop?
Linear regression
Web-store might use linear
regression to predict sales of
goods or discover trends.
sale(Product) ~ visitors(Product)
Linear regression might be
used here:
sale = α * visitors + β
Linear regression in R
df <- read.csv("Phone.csv", header=TRUE)
qq <-
qplot(visited,purchased,colour=product_page,
data=df)
qq + geom_smooth(method='lm', formula=y~x)
Linear regression in R
df.p2 <- df[df$product_page == 'phone_2', ]
m <- lm(purchased ~ visited, data=df.p2)
summary(m)
R on Hadoop
Several options:
• Hadoop streaming
• RHadoop
• RHipe
• RSpark
• Oracle R Advanced Analytics for Hadoop
• etc.
R Hadoop streaming
Hadoop was mainly designed to use Java and
provides comprehensive Java API.
Other languages can be used through “Streaming
API” Streaming API utilised standard input (reading)
and standard output (writing) OS possibilities. It
provides lightweight API for MapReduce in compare
to Java API.
Streaming requires writing two separate scripts (per
mapper and reducer) in any language (Python,
Ruby, R, C#, Go, OCalm, Lisp, etc)
R Hadoop streaming
Streaming API drawbacks:
● while the inputs to the reducer are grouped by key, they are still iterated
over line-by-line, and the boundaries between keys must be detected by the
user
● no possibilities to utilize different mappers in one MapReduce job
● no possibilities to create different outputs from reducer
● counters update through stderr
Additional disadvantage of implementing streaming in R:
•strong output control for R functions, because they are “buzzy”, however
only meaning data must be pushed
R Hadoop streaming: Mapper
R Hadoop streaming: Reducer
RHadoop
RHadoop - set of libraries (written in R language)
for R languages aim to facilitate using R
languages with Hadoop streaming to develop MR
jobs. So, it has general drawbacks for Hadoop
streaming.
RHadoop
RHadoop is still R through Hadoop Streaming
Advantages compared to Streaming:
● don’t need to manage key change in Reducer
● don’t need to control functions output manually
● simple R API covers Streaming API
● R code can be run on local env/Hadoop without
changes
Demo time
R on Hadoop in Real Life
Several steps are required to achieve the goal:
1. Data ingestion
2. Data preparation
3. R processing
4. Postprocessing
http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry-
Rogozhin.jpg
Learned Lessons
R is slow… for million calculations
it’s even slow with Hadoop!
How to improve the speed?
Rewrite flow - maximum preprocessing work before R
step.
Hadoop streaming supports mapper/reducer in
different languages.
Think twice. R is great for exploratory analysis and
researches, but in production might cause performance
penalty.
Q&A
• Thank you for your attention

More Related Content

Similar to R on Hadoop (20)

PDF
Getting started with R & Hadoop
Jeffrey Breen
 
PDF
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
PPTX
Integration Method of R and Hadoop and Intro
jokerroyy2023
 
PDF
R, Hadoop and Amazon Web Services
Portland R User Group
 
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
PDF
Enabling R on Hadoop
DataWorks Summit
 
PPTX
Big Data Analysis With RHadoop
David Chiu
 
PDF
R and-hadoop
Bryan Downing
 
PDF
How to use hadoop and r for big data parallel processing
Bryan Downing
 
PDF
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
PPTX
R Hadoop integration
Dzung Nguyen
 
PPTX
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
PDF
Microsoft R - Data Science at Scale
Sascha Dittmann
 
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
PDF
Extending lifespan with Hadoop and R
Radek Maciaszek
 
PPTX
Hadoop-and-R-Programming-Powering-Big-Data-Analytics.pptx
MdTahammulNoor
 
PPTX
R for hadoopers
Gwen (Chen) Shapira
 
PPTX
Hadoop With R language.pptx
ujjwalmatoliya
 
Getting started with R & Hadoop
Jeffrey Breen
 
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Integration Method of R and Hadoop and Intro
jokerroyy2023
 
R, Hadoop and Amazon Web Services
Portland R User Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
Enabling R on Hadoop
DataWorks Summit
 
Big Data Analysis With RHadoop
David Chiu
 
R and-hadoop
Bryan Downing
 
How to use hadoop and r for big data parallel processing
Bryan Downing
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
R Hadoop integration
Dzung Nguyen
 
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Hadoop-and-R-Programming-Powering-Big-Data-Analytics.pptx
MdTahammulNoor
 
R for hadoopers
Gwen (Chen) Shapira
 
Hadoop With R language.pptx
ujjwalmatoliya
 

Recently uploaded (20)

PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Ad

R on Hadoop

  • 1. R ON HADOOP Kostiantyn Kudriavtsev Lviv Hadoop User Group, June 19, 2014
  • 2. Agenda • What is R? • Linear Regression • R on Hadoop • Summary
  • 3. What is R? Object-oriented and functional language for Stats, Math and Data Science created by statisticians with comprehensive data visualisation and statistical modelling capabilities; 5000+ (and grow) freely available specialised algorithms for finance, economics, genomics, linguistic and so on; 2M+ users with specialised domain skills; … but some drawbacks are: - limited by RAM - single thread
  • 4. R development environment RStudio is de- facto standard IDE for R development and available in local or server mode. Might be used not only for coding, but also visualisation. Suitable to develop R solutions on top of Hadoop.
  • 5. Apache Hadoop is an software framework that supports data- intensive distributed applications based on MapReduce algorithm (MR). Main idea: move computation to data. MR idea: - Map step: Map(k1,v1) → list(k2,v2) - Magic here (sort by k2, data transfer between nodes, etc) - Reduce step: Reduce(k2, list (v2)) → (k3, v3) What is Hadoop?
  • 6. Linear regression Web-store might use linear regression to predict sales of goods or discover trends. sale(Product) ~ visitors(Product) Linear regression might be used here: sale = α * visitors + β
  • 7. Linear regression in R df <- read.csv("Phone.csv", header=TRUE) qq <- qplot(visited,purchased,colour=product_page, data=df) qq + geom_smooth(method='lm', formula=y~x)
  • 8. Linear regression in R df.p2 <- df[df$product_page == 'phone_2', ] m <- lm(purchased ~ visited, data=df.p2) summary(m)
  • 9. R on Hadoop Several options: • Hadoop streaming • RHadoop • RHipe • RSpark • Oracle R Advanced Analytics for Hadoop • etc.
  • 10. R Hadoop streaming Hadoop was mainly designed to use Java and provides comprehensive Java API. Other languages can be used through “Streaming API” Streaming API utilised standard input (reading) and standard output (writing) OS possibilities. It provides lightweight API for MapReduce in compare to Java API. Streaming requires writing two separate scripts (per mapper and reducer) in any language (Python, Ruby, R, C#, Go, OCalm, Lisp, etc)
  • 11. R Hadoop streaming Streaming API drawbacks: ● while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user ● no possibilities to utilize different mappers in one MapReduce job ● no possibilities to create different outputs from reducer ● counters update through stderr Additional disadvantage of implementing streaming in R: •strong output control for R functions, because they are “buzzy”, however only meaning data must be pushed
  • 14. RHadoop RHadoop - set of libraries (written in R language) for R languages aim to facilitate using R languages with Hadoop streaming to develop MR jobs. So, it has general drawbacks for Hadoop streaming.
  • 15. RHadoop RHadoop is still R through Hadoop Streaming Advantages compared to Streaming: ● don’t need to manage key change in Reducer ● don’t need to control functions output manually ● simple R API covers Streaming API ● R code can be run on local env/Hadoop without changes
  • 17. R on Hadoop in Real Life Several steps are required to achieve the goal: 1. Data ingestion 2. Data preparation 3. R processing 4. Postprocessing http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry- Rogozhin.jpg
  • 18. Learned Lessons R is slow… for million calculations it’s even slow with Hadoop! How to improve the speed? Rewrite flow - maximum preprocessing work before R step. Hadoop streaming supports mapper/reducer in different languages. Think twice. R is great for exploratory analysis and researches, but in production might cause performance penalty.
  • 19. Q&A • Thank you for your attention