SlideShare a Scribd company logo
Recent Developments in
SparkR for Advanced Analytics
Xiangrui Meng
meng@databricks.com
2016/06/07 - Spark Summit 2016
About Me
• Software Engineer at Databricks
• tech lead of machine learning and data science
• Committer and PMC member of Apache Spark
• Ph.D. from Stanford in computational mathematics
2
Outline
• Introduction to SparkR
• Descriptive analytics in SparkR
• Predictive analytics in SparkR
• Future directions
3
Introduction to SparkR
Bridging the gap between R and Big Data
SparkR
• Introduced to Spark since 1.4
• Wrappers over DataFrames and DataFrame-based APIs
• In SparkR, we make the APIs similar to existing ones in R
(or R packages), rather than Python/Java/Scala APIs.
• R is very convenient for analytics and users love it.
• Scalability is the main issue, not the API.
5
DataFrame-based APIs
• Storage: s3 / HDFS / local / …
• Data sources: csv / parquet / json / …
• DataFrame operations:
• select / subset / groupBy / agg / collect / …
• rand / sample / avg / var / …
• Conversion to/from R data.frame
6
SparkR Architecture
7
Spark Driver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
Data Conversion between R and SparkR
8
R JVM
RBackend
SparkR::collect()
SparkR::createDataFrame()
Descriptive Analytics
Big Data at a glimpse in SparkR
Summary Statistics
10
• count, min, max, mean, standard deviation, variance
describe(df)
df %>% groupBy(“dept”, avgAge = avg(df$age))
• covariance, correlation
df %>% select(var_samp(df$x, df$y))
• skewness, kurtosis
df %>% select(skewness(df$x), kurtosis(df$x))
Sampling Algorithms
• Bernoulli sampling (without replacement)

df %>% sample(FALSE, 0.01)
• Poisson sampling (with replacement)

df %>% sample(TRUE, 0.01)
• stratified sampling

df %>% sampleBy(“key”, c(positive = 1.0, negative = 0.1))
11
Approximate Algorithms
• frequent items [Karp03]
df %>% freqItems(c(“title”, “gender”), support = 0.01)
• approximate quantiles [Greenwald01]
df %>% approxQuantile(“value”, c(0.1, 0.5, 0.9), relErr = 0.01)
• single pass with aggregate pattern
• trade-off between accuracy and space
12
Implementation: Aggregation Pattern
split + aggregate + combine in a single pass
• split data into multiple partitions
• calculate partially aggregated result on each partition
• combine partial results into final result
13
Implementation: High-Performance
• new online update formulas of summary statistics
• code generation to achieve high performance
kurtosis of 1 billion values on a Macbook Pro (2 cores):
14
scipy.stats 250s
octave 120s
CRAN::moments 70s
SparkR / Spark / PySpark 5.5s
Predictive Analytics
Enabling large-scale machine learning in SparkR
MLlib + SparkR
MLlib and SparkR integration started in Spark 1.5.
API design choices:
1. mimic the methods implemented in R or R packages
• no new method to learn
• similar but not the same / shadows existing methods
• inconsistent APIs
2. create a new set of APIs
16
Generalized Linear Models (GLMs)
• Linear models are simple but extremely popular.
• A GLM is specified by the following:
• a distribution of the response (from the exponential family),
• a link function g such that
• maximizes the sum of log-likelihoods
17
Distributions and Link Functions
SparkR supports all families supported by R in Spark 2.0.
18
Model Distribution Link
linear least squares normal identity
logistic regression binomial logit
Poisson regression Poisson log
gamma regression gamma inverse
… … …
GLMs in SparkR
# Create the DataFrame for training
df <- read.df(sqlContext, “path/to/training”)
# Fit a Gaussian linear model
model <- glm(y ~ x1 + x2, data = df, family = “gaussian”) # mimic R
model <- spark.glm(df, y ~ x1 + x2, family = “gaussian”)
# Get the model summary
summary(model)
# Make predictions
predict(model, newDF)
19
Implementation: SparkR::glm
The `SparkR::glm` is a simple wrapper over an ML
pipeline that consists of the following stages:
• RFormula, which itself embeds an ML pipeline for
feature preprocessing and encoding,
• an estimator (GeneralizedLinearRegression).
20
RWrapper
Implementation: SparkR::glm
21
RFormula
GLM
RWrapper
RFormula
GLM
StringIndexer
VectorAssembler
IndexToString
StringIndexer
Implementation: R Formula
22
• R provides model formula to express models.
• We support the following R formula operators in SparkR:
• `~` separate target and terms
• `+` concat terms, "+ 0" means removing intercept
• `-` remove a term, "- 1" means removing intercept
• `:` interaction (multiplication for numeric values, or binarized
categorical values)
• `.` all columns except target
• The implementation is in Scala.
Implementation: Test against R
Besides normal tests, we also verify our implementation using R.
/*

df <- as.data.frame(cbind(A, b))

for (formula in c(b ~ . -1, b ~ .)) {

model <- lm(formula, data=df, weights=w)

print(as.vector(coef(model)))

}



[1] -3.727121 3.009983

[1] 18.08 6.08 -0.60

*/
val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),

Vectors.dense(18.08, 6.08, -0.60))
23
ML Models in SparkR
• generalized linear models (GLMs)
• glm / spark.glm (stats::glm)
• accelerated failure time (AFT) model for survival analysis
• spark.survreg (survival)
• k-means clustering
• spark.kmeans (stats:kmeans)
• Bernoulli naive Bayes
• spark.naiveBayes (e1071)
24
Model Persistence in SparkR
• model persistence supported for all ML models in SparkR
• thin wrappers over pipeline persistence from MLlib
model <- spark.glm(df, x ~ y + z, family = “gaussian”)
write.ml(model, path)
model <- read.ml(path)
summary(model)
• feasible to pass saved models to Scala/Java engineers
25
Work with R Packages in SparkR
• There are ~8500 community packages on CRAN.
• It is impossible for SparkR to match all existing features.
• Not every dataset is large.
• Many people work with small/medium datasets.
• SparkR helps in those scenarios by:
• connecting to different data sources,
• filtering or downsampling big datasets,
• parallelizing training/tuning tasks.
26
Work with R Packages in SparkR
df <- sqlContext %>% read.df(…) %>% collect()
points <- data.matrix(df)
run_kmeans <- function(k) {
kmeans(points, centers=k)
}
kk <- 1:6
lapply(kk, run_kmeans) # R’s apply
spark.lapply(sc, kk, run_kmeans) # parallelize the tasks
27
summary(this.talk)
• SparkR enables big data analytics on R
• descriptive analytics on top of DataFrames
• predictive analytics from MLlib integration
• SparkR works well with existing R packages
Thanks to the Apache Spark community for developing and
maintaining SparkR: Alteryx, Berkeley AMPLab, Databricks,
Hortonworks, IBM, Intel, etc, and individual contributors!!
28
Future Directions
• CRAN release of SparkR
• more consistent APIs with existing R packages: dplyr, etc
• better R formula support
• more algorithms from MLlib: decision trees, ALS, etc
• better integration with existing R packages: gapply / UDFs
• integration with Spark packages: GraphFrames, CoreNLP, etc
We’d greatly appreciate feedback from the R community!
29
Try Apache Spark with Databricks
30
http://databricks.com/try
• Download a companion notebook of this talk at: http://dbricks.co/1rbujoD
• Try latest version of Apache Spark and preview of Spark 2.0
Thank you.
• SparkRuserguideonApacheSparkwebsite
• MLlibroadmapforSpark2.1
• Officehours:
• 2-3:30pmatExpoHallTheater;3:45-6pmatDatabricksbooth
• DatabricksCommunityEditionandblogposts

More Related Content

What's hot (20)

PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
PPTX
Spark tutorial
Sahan Bulathwela
 
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Apache Spark Core – Practical Optimization
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Understanding Query Plans and Spark UIs
Databricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Spark tutorial
Sahan Bulathwela
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 

Viewers also liked (20)

PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
PDF
Big Data in Production: Lessons from Running in the Cloud
Jen Aman
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PDF
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PDF
PySaprk
Giivee The
 
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
Spark on Mesos
Jen Aman
 
PDF
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Jen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PySaprk
Giivee The
 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Spark on Mesos
Jen Aman
 
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 
Ad

Similar to Recent Developments In SparkR For Advanced Analytics (20)

PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
SparkR best practices for R data scientist
DataWorks Summit
 
PDF
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
PDF
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
PDF
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
PDF
Big data analysis using spark r published
Dipendra Kusi
 
PDF
Sparkr sigmod
waqasm86
 
PDF
Parallelizing Existing R Packages
Craig Warman
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
Introduction to SparkR
Kien Dang
 
PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PDF
Introduction to SparkR
Olgun Aydın
 
Scalable Data Science with SparkR
DataWorks Summit
 
SparkR best practices for R data scientist
DataWorks Summit
 
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
Machine Learning with SparkR
Olgun Aydın
 
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Big data analysis using spark r published
Dipendra Kusi
 
Sparkr sigmod
waqasm86
 
Parallelizing Existing R Packages
Craig Warman
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Enabling exploratory data science with Spark and R
Databricks
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Introduction to SparkR
Kien Dang
 
Introduction to SparkR
Ankara Big Data Meetup
 
Introduction to SparkR
Olgun Aydın
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 

Recent Developments In SparkR For Advanced Analytics

  • 1. Recent Developments in SparkR for Advanced Analytics Xiangrui Meng [email protected] 2016/06/07 - Spark Summit 2016
  • 2. About Me • Software Engineer at Databricks • tech lead of machine learning and data science • Committer and PMC member of Apache Spark • Ph.D. from Stanford in computational mathematics 2
  • 3. Outline • Introduction to SparkR • Descriptive analytics in SparkR • Predictive analytics in SparkR • Future directions 3
  • 4. Introduction to SparkR Bridging the gap between R and Big Data
  • 5. SparkR • Introduced to Spark since 1.4 • Wrappers over DataFrames and DataFrame-based APIs • In SparkR, we make the APIs similar to existing ones in R (or R packages), rather than Python/Java/Scala APIs. • R is very convenient for analytics and users love it. • Scalability is the main issue, not the API. 5
  • 6. DataFrame-based APIs • Storage: s3 / HDFS / local / … • Data sources: csv / parquet / json / … • DataFrame operations: • select / subset / groupBy / agg / collect / … • rand / sample / avg / var / … • Conversion to/from R data.frame 6
  • 7. SparkR Architecture 7 Spark Driver R JVM RBackend JVM Worker JVM Worker DataSources
  • 8. Data Conversion between R and SparkR 8 R JVM RBackend SparkR::collect() SparkR::createDataFrame()
  • 9. Descriptive Analytics Big Data at a glimpse in SparkR
  • 10. Summary Statistics 10 • count, min, max, mean, standard deviation, variance describe(df) df %>% groupBy(“dept”, avgAge = avg(df$age)) • covariance, correlation df %>% select(var_samp(df$x, df$y)) • skewness, kurtosis df %>% select(skewness(df$x), kurtosis(df$x))
  • 11. Sampling Algorithms • Bernoulli sampling (without replacement)
 df %>% sample(FALSE, 0.01) • Poisson sampling (with replacement)
 df %>% sample(TRUE, 0.01) • stratified sampling
 df %>% sampleBy(“key”, c(positive = 1.0, negative = 0.1)) 11
  • 12. Approximate Algorithms • frequent items [Karp03] df %>% freqItems(c(“title”, “gender”), support = 0.01) • approximate quantiles [Greenwald01] df %>% approxQuantile(“value”, c(0.1, 0.5, 0.9), relErr = 0.01) • single pass with aggregate pattern • trade-off between accuracy and space 12
  • 13. Implementation: Aggregation Pattern split + aggregate + combine in a single pass • split data into multiple partitions • calculate partially aggregated result on each partition • combine partial results into final result 13
  • 14. Implementation: High-Performance • new online update formulas of summary statistics • code generation to achieve high performance kurtosis of 1 billion values on a Macbook Pro (2 cores): 14 scipy.stats 250s octave 120s CRAN::moments 70s SparkR / Spark / PySpark 5.5s
  • 15. Predictive Analytics Enabling large-scale machine learning in SparkR
  • 16. MLlib + SparkR MLlib and SparkR integration started in Spark 1.5. API design choices: 1. mimic the methods implemented in R or R packages • no new method to learn • similar but not the same / shadows existing methods • inconsistent APIs 2. create a new set of APIs 16
  • 17. Generalized Linear Models (GLMs) • Linear models are simple but extremely popular. • A GLM is specified by the following: • a distribution of the response (from the exponential family), • a link function g such that • maximizes the sum of log-likelihoods 17
  • 18. Distributions and Link Functions SparkR supports all families supported by R in Spark 2.0. 18 Model Distribution Link linear least squares normal identity logistic regression binomial logit Poisson regression Poisson log gamma regression gamma inverse … … …
  • 19. GLMs in SparkR # Create the DataFrame for training df <- read.df(sqlContext, “path/to/training”) # Fit a Gaussian linear model model <- glm(y ~ x1 + x2, data = df, family = “gaussian”) # mimic R model <- spark.glm(df, y ~ x1 + x2, family = “gaussian”) # Get the model summary summary(model) # Make predictions predict(model, newDF) 19
  • 20. Implementation: SparkR::glm The `SparkR::glm` is a simple wrapper over an ML pipeline that consists of the following stages: • RFormula, which itself embeds an ML pipeline for feature preprocessing and encoding, • an estimator (GeneralizedLinearRegression). 20
  • 22. Implementation: R Formula 22 • R provides model formula to express models. • We support the following R formula operators in SparkR: • `~` separate target and terms • `+` concat terms, "+ 0" means removing intercept • `-` remove a term, "- 1" means removing intercept • `:` interaction (multiplication for numeric values, or binarized categorical values) • `.` all columns except target • The implementation is in Scala.
  • 23. Implementation: Test against R Besides normal tests, we also verify our implementation using R. /*
 df <- as.data.frame(cbind(A, b))
 for (formula in c(b ~ . -1, b ~ .)) {
 model <- lm(formula, data=df, weights=w)
 print(as.vector(coef(model)))
 }
 
 [1] -3.727121 3.009983
 [1] 18.08 6.08 -0.60
 */ val expected = Seq(Vectors.dense(0.0, -3.727121, 3.009983),
 Vectors.dense(18.08, 6.08, -0.60)) 23
  • 24. ML Models in SparkR • generalized linear models (GLMs) • glm / spark.glm (stats::glm) • accelerated failure time (AFT) model for survival analysis • spark.survreg (survival) • k-means clustering • spark.kmeans (stats:kmeans) • Bernoulli naive Bayes • spark.naiveBayes (e1071) 24
  • 25. Model Persistence in SparkR • model persistence supported for all ML models in SparkR • thin wrappers over pipeline persistence from MLlib model <- spark.glm(df, x ~ y + z, family = “gaussian”) write.ml(model, path) model <- read.ml(path) summary(model) • feasible to pass saved models to Scala/Java engineers 25
  • 26. Work with R Packages in SparkR • There are ~8500 community packages on CRAN. • It is impossible for SparkR to match all existing features. • Not every dataset is large. • Many people work with small/medium datasets. • SparkR helps in those scenarios by: • connecting to different data sources, • filtering or downsampling big datasets, • parallelizing training/tuning tasks. 26
  • 27. Work with R Packages in SparkR df <- sqlContext %>% read.df(…) %>% collect() points <- data.matrix(df) run_kmeans <- function(k) { kmeans(points, centers=k) } kk <- 1:6 lapply(kk, run_kmeans) # R’s apply spark.lapply(sc, kk, run_kmeans) # parallelize the tasks 27
  • 28. summary(this.talk) • SparkR enables big data analytics on R • descriptive analytics on top of DataFrames • predictive analytics from MLlib integration • SparkR works well with existing R packages Thanks to the Apache Spark community for developing and maintaining SparkR: Alteryx, Berkeley AMPLab, Databricks, Hortonworks, IBM, Intel, etc, and individual contributors!! 28
  • 29. Future Directions • CRAN release of SparkR • more consistent APIs with existing R packages: dplyr, etc • better R formula support • more algorithms from MLlib: decision trees, ALS, etc • better integration with existing R packages: gapply / UDFs • integration with Spark packages: GraphFrames, CoreNLP, etc We’d greatly appreciate feedback from the R community! 29
  • 30. Try Apache Spark with Databricks 30 http://databricks.com/try • Download a companion notebook of this talk at: http://dbricks.co/1rbujoD • Try latest version of Apache Spark and preview of Spark 2.0
  • 31. Thank you. • SparkRuserguideonApacheSparkwebsite • MLlibroadmapforSpark2.1 • Officehours: • 2-3:30pmatExpoHallTheater;3:45-6pmatDatabricksbooth • DatabricksCommunityEditionandblogposts