SlideShare a Scribd company logo
R Hadoop integration
- Objectives
- Contents:
• Introduction of R
• Implementation of R integration with Hadoop
• When to use R in combination with Hadoop
• Examples using Hadoop
- Q&A
- References
Security Classification: Internal
Objectives
3
• Understand R
• Understand when to use R in combination
with Hadoop
• Understand the implementation of
integration
R Hadoop integration
Security Classification: InternalR integration with Hadoop 5
• Software for Statistical Data Analysis
• Based on S
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software
Security Classification: InternalR integration with Hadoop 6
• Free and Open Source
• Strong User Community
• Highly extensible, flexible
• Implementation of high end statistical methods
• Flexible graphics and intelligent defaults
But ..
• Steep learning curve
• Slow for large datasets
Security Classification: InternalR integration with Hadoop 7
R Hadoop integration
Security Classification: InternalR integration with Hadoop 9
• Use Hadoop to execute R code
• Use R to access data stored in Hadoop
Security Classification: InternalR integration with Hadoop 10
No Factor Mantra Guideline
1 R's natural strength Use R for statistical
computing
Consider integrating when your project can
be solved using code available in R, or when it
is not easily solved in other languages
2 Hadoop's natural
strength
Use Hadoop for
distributed storage &
batch computing
Consider integrating when your problem
requires lots of storage or when it could
benefit from parallelization
3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all"
panaceas. Consider not integrating if it is
easier to solve your problem with other tools
4 Processing time Work smart, not hard Although some problems can benefit from
parallelization, consider not integrating if the
gains are negligible since this can help you
reduce the complexity of your project
Security Classification: InternalR integration with Hadoop 11
N
o
Scenario Use
R/Hadoop
?
Why? Example
1 Analyzing small data
stored in Hadoop
Y R can quickly download data analyze it
locally
Want to analyze summary datasets derived from
map reduce jobs done in Hadoop
2 Extracting complex
features from large
data stored in Hadoop
Y R has more built-in and contributed
functions that analyze data than many
standard programming languages
R is a natural language to use to write an algorithm
or classifier that extracts information about objects
contained in images
3 Applying prediction
and classification
models to datasets
Y R is better at modeling than many
standard programming languages
Using a logistic regression model to generate
predictions in a large dataset
4 Implementing an
"iteration-based"
machine
learning algorithm
Maybe 1) Other languages may be faster than R
for your analysis
2) Hadoop reads and writes a lot of data
to disks, other "big data" tools, like Spark
(and SparkR) are designed for speed in
these scenarios by working in memory
Training a k-means classification algorithm or
logistic regression on a large dataset
5 Simple preprocessing
of large data stored in
Hadoop
N Standard programming languages are
much faster than R at executing many
basic text and image processing
tasks
Pre-processing twitter tweets for use in a natural
language processing project
Security Classification: InternalR integration with Hadoop 12
Security Classification: InternalR integration with Hadoop 13
rhdfs:
• Manipulate HDFS directly from R
• Mimic as much of the HDFS Java API as possible
• Examples:
– Read a HDFS text file into a data frame.
– Serialize/Deserialize a model to HDFS
– Write an HDFS file to local storage
• rhdfs/pkg/inst/unitTests
• rhdfs/pkg/inst/examples
Security Classification: InternalR integration with Hadoop 14
rhbase:
• Manipulate HBASE tables and their content
• Uses Thrift C++ API as the mechanism to
communicate to HBASE
• Examples:
– Create a data frame from a collection of rows
and columns in an HBASE table
– Update an HBASE table with values from a data
frame
Security Classification: InternalR integration with Hadoop 15
rmr:
• Designed to be the simplest and most elegant way to
write MapReduce programs
• Gives the R programmer the tools necessary to perform
data analysis in a way that is “R” like
• Provides an abstraction layer to hide the implementation
details
Security Classification: InternalR integration with Hadoop 16
Security Classification: InternalR integration with Hadoop 17
Security Classification: InternalR integration with Hadoop 18
Security Classification: InternalR integration with Hadoop 19
Security Classification: InternalR integration with Hadoop 20
Security Classification: InternalR integration with Hadoop 21
Security Classification: InternalR integration with Hadoop 22
R Hadoop integration
Security Classification: Internal
References
Big data and Hadoop
introduction 24
- http://cran-rproject.org
- http://revolutionanalytics.com
- Hadoop for dummies
R – a brief introduction
Gilberto Câmara
R Hadoop integration

More Related Content

What's hot (20)

PDF
Udev for Device Management in Linux
Deepak Soundararajan
 
PDF
Cloud robotics
IIT Bombay
 
PPTX
Genetic algorithms vs Traditional algorithms
Dr. C.V. Suresh Babu
 
PDF
20 Latest Computer Science Seminar Topics on Emerging Technologies
Seminar Links
 
PPTX
Arm cortex R(real time)processor series
Ronak047
 
PDF
Inside the Volta GPU Architecture and CUDA 9
inside-BigData.com
 
PDF
Course 101: Lecture 1: Introduction to Embedded Systems
Ahmed El-Arabawy
 
PDF
Chap.3 Knowledge Representation Issues Chap.4 Inference in First Order Logic
Khushali Kathiriya
 
PDF
Hardware Acceleration for Machine Learning
CastLabKAIST
 
PPTX
8 bit alu design
Shobhan Pujari
 
PPTX
IEEE Posix Standards
Hussain Biedouh
 
PPT
Multicore computers
Syed Zaid Irshad
 
PDF
Stuart russell and peter norvig artificial intelligence - a modern approach...
Lê Anh Đạt
 
PPTX
Snapdragon
Kaushal Kabra
 
PPTX
Arm programmer's model
v Kalairajan
 
PPTX
Genetic Algorithm
SEKHARREDDYAMBATI
 
PDF
Virtualization in Cloud Computing
Pyingkodi Maran
 
PPTX
Logics for non monotonic reasoning-ai
ShaishavShah8
 
PPT
Patterns of Semantic Integration
Optum
 
PPT
1010 chapter11
Kamisettykishorekumar
 
Udev for Device Management in Linux
Deepak Soundararajan
 
Cloud robotics
IIT Bombay
 
Genetic algorithms vs Traditional algorithms
Dr. C.V. Suresh Babu
 
20 Latest Computer Science Seminar Topics on Emerging Technologies
Seminar Links
 
Arm cortex R(real time)processor series
Ronak047
 
Inside the Volta GPU Architecture and CUDA 9
inside-BigData.com
 
Course 101: Lecture 1: Introduction to Embedded Systems
Ahmed El-Arabawy
 
Chap.3 Knowledge Representation Issues Chap.4 Inference in First Order Logic
Khushali Kathiriya
 
Hardware Acceleration for Machine Learning
CastLabKAIST
 
8 bit alu design
Shobhan Pujari
 
IEEE Posix Standards
Hussain Biedouh
 
Multicore computers
Syed Zaid Irshad
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Lê Anh Đạt
 
Snapdragon
Kaushal Kabra
 
Arm programmer's model
v Kalairajan
 
Genetic Algorithm
SEKHARREDDYAMBATI
 
Virtualization in Cloud Computing
Pyingkodi Maran
 
Logics for non monotonic reasoning-ai
ShaishavShah8
 
Patterns of Semantic Integration
Optum
 
1010 chapter11
Kamisettykishorekumar
 

Viewers also liked (7)

PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PPTX
T-SQL performance improvement - session 2 - Owned copy
Dzung Nguyen
 
PPT
JIRA Service Desk + ChatOps Webinar Deck
Addteq
 
PPTX
Big data and Hadoop introduction
Dzung Nguyen
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PDF
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
Adam Laskowski
 
KEY
RHadoop, R meets Hadoop
Revolution Analytics
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
T-SQL performance improvement - session 2 - Owned copy
Dzung Nguyen
 
JIRA Service Desk + ChatOps Webinar Deck
Addteq
 
Big data and Hadoop introduction
Dzung Nguyen
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
Adam Laskowski
 
RHadoop, R meets Hadoop
Revolution Analytics
 
Ad

Similar to R Hadoop integration (20)

PPT
Unit-3_BDA.ppt
PoojaShah174393
 
PPT
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
 
PPTX
R & Python on Hadoop
Ming Yuan
 
PDF
Big Data - Analytics with R
Techsparks
 
PPTX
Hadoop in a Nutshell
Anthony Thomas
 
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
PDF
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
PDF
What is Apache Hadoop and its ecosystem?
tommychauhan
 
PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
PDF
Unit IV.pdf
KennyPratheepKumar
 
PDF
Hadoop - A Very Short Introduction
dewang_mistry
 
PDF
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM Analytics
 
PPT
Hadoop in action
Mahmoud Yassin
 
PPTX
Hadoop With R language.pptx
ujjwalmatoliya
 
PPTX
Lecture 2 Hadoop.pptx
Anonymous9etQKwW
 
PPSX
Hadoop
Nishant Gandhi
 
PPTX
Integration Method of R and Hadoop and Intro
jokerroyy2023
 
Unit-3_BDA.ppt
PoojaShah174393
 
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
 
R & Python on Hadoop
Ming Yuan
 
Big Data - Analytics with R
Techsparks
 
Hadoop in a Nutshell
Anthony Thomas
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Apache hadoop introduction and architecture
Harikrishnan K
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
Unit IV.pdf
KennyPratheepKumar
 
Hadoop - A Very Short Introduction
dewang_mistry
 
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM Analytics
 
Hadoop in action
Mahmoud Yassin
 
Hadoop With R language.pptx
ujjwalmatoliya
 
Lecture 2 Hadoop.pptx
Anonymous9etQKwW
 
Integration Method of R and Hadoop and Intro
jokerroyy2023
 
Ad

R Hadoop integration

  • 2. - Objectives - Contents: • Introduction of R • Implementation of R integration with Hadoop • When to use R in combination with Hadoop • Examples using Hadoop - Q&A - References
  • 3. Security Classification: Internal Objectives 3 • Understand R • Understand when to use R in combination with Hadoop • Understand the implementation of integration
  • 5. Security Classification: InternalR integration with Hadoop 5 • Software for Statistical Data Analysis • Based on S • Programming Environment • Interpreted Language • Data Storage, Analysis, Graphing • Free and Open Source Software
  • 6. Security Classification: InternalR integration with Hadoop 6 • Free and Open Source • Strong User Community • Highly extensible, flexible • Implementation of high end statistical methods • Flexible graphics and intelligent defaults But .. • Steep learning curve • Slow for large datasets
  • 7. Security Classification: InternalR integration with Hadoop 7
  • 9. Security Classification: InternalR integration with Hadoop 9 • Use Hadoop to execute R code • Use R to access data stored in Hadoop
  • 10. Security Classification: InternalR integration with Hadoop 10 No Factor Mantra Guideline 1 R's natural strength Use R for statistical computing Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages 2 Hadoop's natural strength Use Hadoop for distributed storage & batch computing Consider integrating when your problem requires lots of storage or when it could benefit from parallelization 3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools 4 Processing time Work smart, not hard Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project
  • 11. Security Classification: InternalR integration with Hadoop 11 N o Scenario Use R/Hadoop ? Why? Example 1 Analyzing small data stored in Hadoop Y R can quickly download data analyze it locally Want to analyze summary datasets derived from map reduce jobs done in Hadoop 2 Extracting complex features from large data stored in Hadoop Y R has more built-in and contributed functions that analyze data than many standard programming languages R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images 3 Applying prediction and classification models to datasets Y R is better at modeling than many standard programming languages Using a logistic regression model to generate predictions in a large dataset 4 Implementing an "iteration-based" machine learning algorithm Maybe 1) Other languages may be faster than R for your analysis 2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory Training a k-means classification algorithm or logistic regression on a large dataset 5 Simple preprocessing of large data stored in Hadoop N Standard programming languages are much faster than R at executing many basic text and image processing tasks Pre-processing twitter tweets for use in a natural language processing project
  • 12. Security Classification: InternalR integration with Hadoop 12
  • 13. Security Classification: InternalR integration with Hadoop 13 rhdfs: • Manipulate HDFS directly from R • Mimic as much of the HDFS Java API as possible • Examples: – Read a HDFS text file into a data frame. – Serialize/Deserialize a model to HDFS – Write an HDFS file to local storage • rhdfs/pkg/inst/unitTests • rhdfs/pkg/inst/examples
  • 14. Security Classification: InternalR integration with Hadoop 14 rhbase: • Manipulate HBASE tables and their content • Uses Thrift C++ API as the mechanism to communicate to HBASE • Examples: – Create a data frame from a collection of rows and columns in an HBASE table – Update an HBASE table with values from a data frame
  • 15. Security Classification: InternalR integration with Hadoop 15 rmr: • Designed to be the simplest and most elegant way to write MapReduce programs • Gives the R programmer the tools necessary to perform data analysis in a way that is “R” like • Provides an abstraction layer to hide the implementation details
  • 16. Security Classification: InternalR integration with Hadoop 16
  • 17. Security Classification: InternalR integration with Hadoop 17
  • 18. Security Classification: InternalR integration with Hadoop 18
  • 19. Security Classification: InternalR integration with Hadoop 19
  • 20. Security Classification: InternalR integration with Hadoop 20
  • 21. Security Classification: InternalR integration with Hadoop 21
  • 22. Security Classification: InternalR integration with Hadoop 22
  • 24. Security Classification: Internal References Big data and Hadoop introduction 24 - http://cran-rproject.org - http://revolutionanalytics.com - Hadoop for dummies R – a brief introduction Gilberto Câmara

Editor's Notes

  • #6: R is a software that provides a programming environment for doing statistical data analysis. This software was written by Robert Gentleman and Ross Ihaka and the name of the software bear the name of the creators. It is a free implementation of S, another popular statistical software. R can be effectively used for data storage, data analysis and a variety of graphing functions. R is distributed free and it is an open source software.
  • #7: R is a great software. It is freely distributed (free both in price as well as in freedom of usage, no restrictions). It has a very strong user community who are ready to help newbies and share information. It has extensive documentation. Best of all, it is extremely scalable, meaning from very low end to very high end, all types of statistical methods can be easily implemented using R. The graphics of R are very flexible and there are many intelligent defaults. Intelligent defaults mean R can guess what you are trying to do and act accordingly. On the downside, it can be time-consuming to learn to use it effectively. The learning process is slow, sometimes frustrating, but in the end, it is a rewarding experience. However, for very large datasets, R can sometimes be slow, but there are several ways to speed up R. The newer versions are invariably faster than the older ones, so continuous upgrading of the software is a good way to speed things up.