Big Data Analytics with R
Derek McCrae Norton, Senior Sales Engineer
April 2, 2014
Agenda
 Introduction
 Big Data
 Analytics
 R
 Revolution R Enterprise
 Synergy
 Conclusion
© 2013 Revolution Analytics
Who are you anyway?
 Statistician
– My degrees are all in statistics.
 Consultant
– My experience has been mostly in Marketing Analytics focusing on Predictive
Analytics.
 Sales Engineer
– Still consulting, just with a much heavier emphasis on client interaction.
 Founder/Director Atlanta R Users Group.
– Shameless plug. Please join if interested.
– http://www.meetup.com/R-Users-Atlanta/
 Husband, Father, Outdoorsman, Serial Hobbyist, …
© 2013 Revolution Analytics
Big Data
© 2013 Revolution Analytics
Big Data and Big Opportunities
© 2013 Revolution Analytics
“Big data is data that
exceeds the processing
capability of conventional
database systems”
Edd Dumbill
O’Reilly Radar*, Jan 2012
Worldwide data created and replicated, Zettabytes
1
2
35
* radar.oreilly.com/2012/01/what-is-big-data.html
What is Big Data?
Big Data is a loosely defined term used to describe
data sets so large and complex that they become
awkward to work with using standard statistical
software.
© 2013 Revolution Analytics
Snijders, Matzat, & Reips (2012)
Does Big Data Mean Hadoop?
 The short answer is no.
 The longer answer is maybe.
 Hadoop adoption is
turning that maybe
into a probably.
© 2013 Revolution Analytics
?
Analytics
© 2013 Revolution Analytics
What is Analytics?
Analytics is the combination of mathematical,
statistical, and heuristic techniques to glean useful
insights from data and to implement actions derived
from those insights.
© 2013 Revolution Analytics
Derek McCrae Norton
Analytics
 The current buzzword is “Data Science,” but I
don’t really agree with that nomenclature.
– What statistician, analyst, (data scientist) actually
follows the scientific method?
 That being said, the current definition of “Data Science”
is a pretty good surrogate for what we are discussing.
 Whatever descriptors you use, one thing is clear… You must use
something to help you carry out the actual work.
– R, Python, SAS, etc.
– RDBMS, Hadoop, etc.
© 2013 Revolution Analytics
© 2013 Revolution Analytics
What is the R language?
 A Platform…
– A Procedural Language for Stats, Math and Data Science
– A Complete Data Visualization Framework
– Provided as Open Source
 A Community…
– 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and
Machine Learning Projects
– Active User Groups Across the World
 An Ecosystem
– CRAN: 5000+ Freely Available Packages
– Applicable to Big Data if scaled
© 2013 Revolution Analytics
THE R USER COMMUNITY
A brief history of R
 1993: Research project in Auckland, NZ
– Ross Ihaka and Robert Gentlemen
 1995: Released as open-source software
– Generally compatible with the “S” language
 1997: R core group formed
 2000: R 1.0.0 released
 2004: First international
user conference in Vienna
 2013: R 3.0.0 released
© 2013 Revolution Analytics
R is Free
 Open Source, licensed under GPL (like Linux!)
– Free as in beer
– Free as in freedom
 Flexible
 Open for integration
– Data (SAS, SPSS, Excel, SQL Server, Oracle, …)
– Systems (applications, webservers, …)
 Broad user-base
– De-facto standard for data analysis teaching
© 2013 Revolution Analytics
16
R is exploding in popularity & function
Web Site Popularity
Number of links to main web site
R
SAS
SPSS
S-Plus
Stata
Scholarly Activity
Google Scholar hits (’05-’09 CAGR)
R 46%
SAS -11%
SPSS -27%
S-Plus 0%
Stata 10%
Internet Discussion
Mean monthly traffic on email discussion list
R
SAS
Stata
SPSS
S-Plus
Package Growth
Number of R packages listed on CRAN
4,332 as of
Feb 2013
© 2013 Revolution Analytics
So why isn’t everyone using R?
“The best thing about R is that it was developed by
statisticians. The worst thing about R is that it was
developed by statisticians.”
© 2013 Revolution Analytics
Bo Cowgill
Google (at SF R Meetup)
Otherwise R is Great! Right?
 Who here has used R?
– Thoughts?
 Who has never seen this?
 Who here has more than 1 core/processor?
 Who has ever used r-help?
– ’They’ did write documentation that told you that Perl was needed, but
‘they’ can’t read it for you. - Brian D. Ripley, R-help (February 2001)
– This is all documented in TFM. Those who WTFM don’t want to have to
WTFM again on the mailing list. RTFM. - Barry Rowlingson, R-help
(October 2003)
© 2013 Revolution Analytics
What is Revolution R
Enterprise?
© 2013 Revolution Analytics
Motivators
© 2013 Revolution Analytics
Big Data In-memory bound Hybrid memory & disk
scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open source
packages plus Big Data
ready packages
Supercharges R
Commercial
Viability
Risk of deployment of
open source
Commercial license Eliminate risk with open
source
Introducing Revolution R Enterprise
(RRE)
The Big Data Big Analytics Platform
DistributedR
DevelopR DeployR
ScaleR
ConnectR
 Big Data Big Analytics Ready
– Enterprise readiness
– High performance analytics
– Multi-platform architecture
– Data source integration
– Development tools
– Deployment tools
© 2013 Revolution Analytics
The Platform Step by Step:
R Capabilities
R+CRAN
• Open source R interpreter
• UPDATED R 3.0.2
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing
R scripts, functions and
packages
RevoR
• Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math
Available On:
• PlatformTM LSFTM Linux®
• Microsoft® HPC Clusters
• Windows® & Linux Servers
• Windows & Linux Workstations
• IBM® Netezza®
• NEW Cloudera Hadoop®
• NEW Hortonworks Hadoop
• NEW Teradata® Database
• Intel® Hadoop
• IBM BigInsightsTM
© 2013 Revolution Analytics
The Platform Step by Step:
Parallelization & Data Sourcing ConnectR
• High-speed & direct connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed format
text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical
tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic,
GLM
• Machine learning
• Monte Carlo simulation
• NEW Tools for distributing
customized algorithms across nodes
DistributedR
• Distributed computing framework
• Delivers portability across platforms
Available on:
• Windows Servers
• Red Hat and NEW SuSE Linux Servers
• IBM Platform LSF Linux
• Microsoft HPC Clusters
• NEW Teradata Database
• NEW Cloudera Hadoop
• NEW Hortonworks Hadoop
© 2013 Revolution Analytics
A single package
(RevoScaleR)
DeployR
• Web services software
development kit for integration
analytics via Java, JavaScript or
.NET APIs
• Integrates R Into application
infrastructures
Capabilities:
• Invokes R Scripts from
web services calls
• RESTful interface for
easy integration
• Works with web & mobile apps,
leading BI & Visualization tools and
business rules engines
DevelopR
• Integrated development
environment for R
• Visual ‘step-into’ debugger
Available on:
• Windows
The Platform Step by Step:
Tools & Deployment
DevelopR DeployR
© 2013 Revolution Analytics
DistributedR
ScaleR
ConnectR
DeployR
Write Once. Deploy Anywhere.
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Desktop
Server
Clustered Systems IBM Platform LSF
Microsoft HPC
EDW Teradata
Hadoop Hortonworks
Cloudera
© 2013 Revolution Analytics
Synergy
© 2013 Revolution Analytics
Put it all together
 Talent fresh out of school knows R.
 RRE is R plus more.
 RRE provides a unified way of carrying out analytics (small or big).
 RRE code is portable…
© 2013 Revolution Analytics
Scale and Portability
 Set “compute context” to define hardware (one line of code)
– Native job-scheduler handles distribution, monitoring, failover etc.
 Same code runs on other supported architectures
– Just change compute context
© 2013 Revolution Analytics
42 seconds instead of 6 minutes on the local machine
References
1. Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of
knowledge in the field of Internet. International Journal of Internet
Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html
2. Conway, D, THE DATA SCIENCE VENN DIAGRAM
© 2013 Revolution Analytics

Big Data Analytics with R

  • 1.
    Big Data Analyticswith R Derek McCrae Norton, Senior Sales Engineer April 2, 2014
  • 2.
    Agenda  Introduction  BigData  Analytics  R  Revolution R Enterprise  Synergy  Conclusion © 2013 Revolution Analytics
  • 3.
    Who are youanyway?  Statistician – My degrees are all in statistics.  Consultant – My experience has been mostly in Marketing Analytics focusing on Predictive Analytics.  Sales Engineer – Still consulting, just with a much heavier emphasis on client interaction.  Founder/Director Atlanta R Users Group. – Shameless plug. Please join if interested. – http://www.meetup.com/R-Users-Atlanta/  Husband, Father, Outdoorsman, Serial Hobbyist, … © 2013 Revolution Analytics
  • 4.
    Big Data © 2013Revolution Analytics
  • 5.
    Big Data andBig Opportunities © 2013 Revolution Analytics “Big data is data that exceeds the processing capability of conventional database systems” Edd Dumbill O’Reilly Radar*, Jan 2012 Worldwide data created and replicated, Zettabytes 1 2 35 * radar.oreilly.com/2012/01/what-is-big-data.html
  • 6.
    What is BigData? Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software. © 2013 Revolution Analytics Snijders, Matzat, & Reips (2012)
  • 7.
    Does Big DataMean Hadoop?  The short answer is no.  The longer answer is maybe.  Hadoop adoption is turning that maybe into a probably. © 2013 Revolution Analytics ?
  • 8.
  • 9.
    What is Analytics? Analyticsis the combination of mathematical, statistical, and heuristic techniques to glean useful insights from data and to implement actions derived from those insights. © 2013 Revolution Analytics Derek McCrae Norton
  • 10.
    Analytics  The currentbuzzword is “Data Science,” but I don’t really agree with that nomenclature. – What statistician, analyst, (data scientist) actually follows the scientific method?  That being said, the current definition of “Data Science” is a pretty good surrogate for what we are discussing.  Whatever descriptors you use, one thing is clear… You must use something to help you carry out the actual work. – R, Python, SAS, etc. – RDBMS, Hadoop, etc. © 2013 Revolution Analytics
  • 11.
  • 12.
    What is theR language?  A Platform… – A Procedural Language for Stats, Math and Data Science – A Complete Data Visualization Framework – Provided as Open Source  A Community… – 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and Machine Learning Projects – Active User Groups Across the World  An Ecosystem – CRAN: 5000+ Freely Available Packages – Applicable to Big Data if scaled © 2013 Revolution Analytics
  • 13.
    THE R USERCOMMUNITY
  • 14.
    A brief historyof R  1993: Research project in Auckland, NZ – Ross Ihaka and Robert Gentlemen  1995: Released as open-source software – Generally compatible with the “S” language  1997: R core group formed  2000: R 1.0.0 released  2004: First international user conference in Vienna  2013: R 3.0.0 released © 2013 Revolution Analytics
  • 15.
    R is Free Open Source, licensed under GPL (like Linux!) – Free as in beer – Free as in freedom  Flexible  Open for integration – Data (SAS, SPSS, Excel, SQL Server, Oracle, …) – Systems (applications, webservers, …)  Broad user-base – De-facto standard for data analysis teaching © 2013 Revolution Analytics
  • 16.
    16 R is explodingin popularity & function Web Site Popularity Number of links to main web site R SAS SPSS S-Plus Stata Scholarly Activity Google Scholar hits (’05-’09 CAGR) R 46% SAS -11% SPSS -27% S-Plus 0% Stata 10% Internet Discussion Mean monthly traffic on email discussion list R SAS Stata SPSS S-Plus Package Growth Number of R packages listed on CRAN 4,332 as of Feb 2013 © 2013 Revolution Analytics
  • 17.
    So why isn’teveryone using R? “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.” © 2013 Revolution Analytics Bo Cowgill Google (at SF R Meetup)
  • 18.
    Otherwise R isGreat! Right?  Who here has used R? – Thoughts?  Who has never seen this?  Who here has more than 1 core/processor?  Who has ever used r-help? – ’They’ did write documentation that told you that Perl was needed, but ‘they’ can’t read it for you. - Brian D. Ripley, R-help (February 2001) – This is all documented in TFM. Those who WTFM don’t want to have to WTFM again on the mailing list. RTFM. - Barry Rowlingson, R-help (October 2003) © 2013 Revolution Analytics
  • 19.
    What is RevolutionR Enterprise? © 2013 Revolution Analytics
  • 20.
    Motivators © 2013 RevolutionAnalytics Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R Commercial Viability Risk of deployment of open source Commercial license Eliminate risk with open source
  • 21.
    Introducing Revolution REnterprise (RRE) The Big Data Big Analytics Platform DistributedR DevelopR DeployR ScaleR ConnectR  Big Data Big Analytics Ready – Enterprise readiness – High performance analytics – Multi-platform architecture – Data source integration – Development tools – Deployment tools © 2013 Revolution Analytics
  • 22.
    The Platform Stepby Step: R Capabilities R+CRAN • Open source R interpreter • UPDATED R 3.0.2 • Freely-available R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math Available On: • PlatformTM LSFTM Linux® • Microsoft® HPC Clusters • Windows® & Linux Servers • Windows & Linux Workstations • IBM® Netezza® • NEW Cloudera Hadoop® • NEW Hortonworks Hadoop • NEW Teradata® Database • Intel® Hadoop • IBM BigInsightsTM © 2013 Revolution Analytics
  • 23.
    The Platform Stepby Step: Parallelization & Data Sourcing ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Correlation & covariance matrices • Predictive Models – linear, logistic, GLM • Machine learning • Monte Carlo simulation • NEW Tools for distributing customized algorithms across nodes DistributedR • Distributed computing framework • Delivers portability across platforms Available on: • Windows Servers • Red Hat and NEW SuSE Linux Servers • IBM Platform LSF Linux • Microsoft HPC Clusters • NEW Teradata Database • NEW Cloudera Hadoop • NEW Hortonworks Hadoop © 2013 Revolution Analytics A single package (RevoScaleR)
  • 24.
    DeployR • Web servicessoftware development kit for integration analytics via Java, JavaScript or .NET APIs • Integrates R Into application infrastructures Capabilities: • Invokes R Scripts from web services calls • RESTful interface for easy integration • Works with web & mobile apps, leading BI & Visualization tools and business rules engines DevelopR • Integrated development environment for R • Visual ‘step-into’ debugger Available on: • Windows The Platform Step by Step: Tools & Deployment DevelopR DeployR © 2013 Revolution Analytics
  • 25.
    DistributedR ScaleR ConnectR DeployR Write Once. DeployAnywhere. DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE In the Cloud Amazon AWS Workstations & Servers Desktop Server Clustered Systems IBM Platform LSF Microsoft HPC EDW Teradata Hadoop Hortonworks Cloudera © 2013 Revolution Analytics
  • 26.
  • 27.
    Put it alltogether  Talent fresh out of school knows R.  RRE is R plus more.  RRE provides a unified way of carrying out analytics (small or big).  RRE code is portable… © 2013 Revolution Analytics
  • 28.
    Scale and Portability Set “compute context” to define hardware (one line of code) – Native job-scheduler handles distribution, monitoring, failover etc.  Same code runs on other supported architectures – Just change compute context © 2013 Revolution Analytics 42 seconds instead of 6 minutes on the local machine
  • 30.
    References 1. Snijders, C.,Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of knowledge in the field of Internet. International Journal of Internet Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html 2. Conway, D, THE DATA SCIENCE VENN DIAGRAM © 2013 Revolution Analytics