SlideShare a Scribd company logo
SCALABLE DATA
SCIENCE WITH SPARKR
Felix Cheung
Principal Engineer - Spark @ Microsoft & Apache Spark Committer
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Disclaimer:
Apache Spark community contributions
Spark in 5 seconds
• General-purpose cluster computing system
• Spark SQL + DataFrame/Dataset + data sources
• Streaming/Structured Streaming
• ML
• GraphX
R
• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 10k+ packages
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly
DataFrame APIs
• Runs as its own REPL sparkR
• or as a R package loaded in IDEs like RStudio 

library(SparkR)

sparkR.session()
Architecture
• Native R classes and methods
• RBackend
• Scala “helper” methods (ML pipeline etc.)
www.slideshare.net/SparkSummit/07-venkataraman-sun
Advantages
• JVM processing, full access to DAG capabilities
and Catalyst optimizer, predicate pushdown,
code generation, etc.
databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
Features - What’s new in SparkR
• SQL
• Data source (JSON, csv, PostgreSQL, libsvm)
• SparkSession & default session (streamlined parameter)
as.DataFrame(iris)
• Catalog (external data table management)
• Spark packages, spark.addFiles()
• ML
• R-native UDF
• Cluster support (YARN, mesos, standalone)
SparkR for Data Science
Decisions, decisions?
Distributed?
Native R
UDF
Spark.ml
YesNo
Spark ML Pipeline
• Pre-processing, feature extraction, model fitting,
validation stages
• Transformer
• Estimator
• Cross-validation/hyperparameter tuning
Tokenizer HashTF
Logistic
Regression
SparkR API for ML Pipeline
spark.lda(

data = text, k =
20, maxIter = 25,
optimizer = "em")
RegexTokenizer
StopWordsRemover
CountVectorizer
R
JVM
LDA
Single-entrypoint

R API
builds
JVM ML Pipeline
Model Operations
• summary - print a summary of the fitted model
• predict - make predictions on new data
• write.ml/read.ml - save/load fitted models
(slight layout difference: pipeline model plus R
metadata)
Spark.ml in SparkR 2.0.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure Time (AFT) Survival Model
Spark.ml in SparkR 2.1.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure Time (AFT) Survival Model
• Isotonic Regression Model
• Gaussian Mixture Model (GMM)
• Latent Dirichlet Allocation (LDA)
• Alternating Least Squares (ALS)
• Multilayer Perceptron Model (MLP)
• Kolmogorov-Smirnov Test (K-S test)
• Multiclass Logistic Regression
• Random Forest
• Gradient Boosted Tree (GBT)
RFormula
• Specify modeling in symbolic form
y ~ f0 + f1
response y is modeled linearly by f0 and f1
• Support a subset of R formula operators

~ , . , : , + , -
• Implemented as feature transformer in core Spark,
available to Scala/Java, Python
• String label column is indexed
• String term columns are one-hot encoded
Generalized Linear Model


# R-like
glm(Sepal_Length ~ Sepal_Width + Species,
gaussianDF, family = "gaussian")

spark.glm(binomialDF, Species ~
Sepal_Length + Sepal_Width, family =
"binomial")
• “binomial” output string label, prediction
Multilayer Perceptron Model


spark.mlp(df, label ~ features,
blockSize = 128, layers = c(4, 5, 4,
3), solver = “l-bfgs”, maxIter = 100,
tol = 0.5, stepSize = 1)
Multiclass Logistic Regression


spark.logit(df, label ~ ., regParam =
0.3, elasticNetParam = 0.8, family =
"multinomial", thresholds = c(0, 1,
1))
• binary or multiclass
Random Forest


spark.randomForest(df, Employed ~ ., type
= "regression", maxDepth = 5, maxBins =
16)
spark.randomForest(df, Species ~
Petal_Length + Petal_Width,
"classification", numTree = 30)
• “classification” index label, predicted label to string
Gradient Boosted Tree


spark.gbt(df, Employed ~ ., type =
"regression", maxDepth = 5, maxBins = 16)
spark.gbt(df, IndexedSpecies ~ ., type =
"classification", stepSize = 0.1)
• “classification” index label, predicted label to string
• Binary classification
Modeling Parameters


spark.randomForest
function(data, formula, type = c("regression", "classification"),
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL,
featureSubsetStrategy = "auto", seed = NULL,

subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0,

checkpointInterval = 10,
maxMemoryInMB = 256, cacheNodeIds = FALSE)
Spark.ml Challenges
• Limited API sets
• Non-trivial to map spark.ml API to R API
• Keeping up to changes
• Almost all (except One vs Rest)
• Simple API, but fixed ML pipeline
• Debugging is hard
• Not a ML specific problem
• Getting better?
Native-R UDF
• User-Defined Functions - custom transformation
• Apply by Partition
• Apply by Group
UDFdata.frame data.frame
Parallel Processing By Partition
R
R
R
Partition
Partition
Partition
UDF
UDF
UDF
data.frame
data.frame
data.frame
data.frame
data.frame
data.frame
UDF: Apply by Partition
• Similar to R apply
• Function to process each partition of a DataFrame
• Mapping of Spark/R data types

dapply(carsSubDF,
function(x) {

x <- cbind(x, x$mpg * 1.61)
},
schema)
UDF: Apply by Partition + Collect
• No schema

out <- dapplyCollect(
carsSubDF,
function(x) {
x <- cbind(x, "kmpg" = x$mpg*1.61)
})
Example - UDF
results <- dapplyCollect(train,
function(x) {
model <-
randomForest::randomForest(as.factor(dep_delayed_
15min) ~ Distance + night + early, data = x,
importance = TRUE, ntree = 20)
predictions <- predict(model, t)
data.frame(UniqueCarrier = t$UniqueCarrier,
delayed = predictions)
})
closure capture -
serialize &
broadcast “t”
access package
“randomForest::”
at each invocation
UDF: Apply by Group
• By grouping columns

gapply(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
},
schema)
UDF: Apply by Group + Collect
• No Schema

out <- gapplyCollect(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
names(y) <- c("cyl", "max_mpg")
y
})
UDF: data type mapping * not a complete list
R Spark
byte byte
integer integer
float float
double, numeric double
character, string string
binary, raw binary
logical boolean
POSIXct, POSIXlt timestamp
Date date
array, list array
env map
UDF Challenges
• “struct”
• No support for nested structures as columns
• Scaling up / data skew
• What if partition or group too big for single R process?
• Not enough data variety to run model?
• Performance costs
• Serialization/deserialization, data transfer
• esp. beware of closure capture
UDF: lapply
• Like R lapply or doParallel
• Good for “embarrassingly parallel” tasks
• Such as hyperparameter tuning
UDF: lapply
• Take a native R list, distribute it
• Run the UDF in parallel
UDFelement *anything*
vector/
list
list
UDF: parallel distributed processing
• Output is a list - needs to fit in memory at the driver
costs <- exp(seq(from = log(1), to = log(1000),
length.out = 5))
train <- function(cost) {
model <- e1071::svm(Species ~ ., iris, cost =
cost)
summary(model)
}
summaries <- spark.lapply(costs, train)
Walkthrough
Demo at felixcheung.github.io
One last thing…
SparkR as a Package (target 2.1.1)
• Goal: simple one-line installation of SparkR from CRAN
install.packages("SparkR")
• Spark Jar downloaded from official release and cached
automatically, or manually install.spark() since Spark 2.0.0
• R vignettes
• Community can write packages that depends on SparkR package
• Advanced Spark JVM interop APIs
sparkR.newJObject

sparkR.callJMethod

sparkR.callJStatic
Ecosystem
• RStudio sparklyr
• RevoScaleR/RxSpark, R Server
• H2O R
• Apache SystemML (R-like API)
• Renjin (not Spark)
• IBM BigInsights Big R (not Spark!)
Recap: SparkR 2.0.0, 2.1.0
• SparkSession
• ML
• UDF
What’s coming in SparkR 2.1.1
• Fix Gamma family with GLM, optimizer in LDA(SPARK-19133, SPARK-19066)
• Partitioning DataFrame (SPARK-18335, SPARK-18788)
df <- as.DataFrame(cars, numPartitions = 10)
getNumPartitions(df)
• Setting column R-friendly shortcuts (SPARK-19130, SPARK-18823)
df$foo <- 1
df[[myname]] <- 1; df[[2]] <- df$eruptions / 60
• Spark UI URL sparkR.uiWebUrl (SPARK-18903)
• install.spark better download error handling (SPARK-19231)
What’s coming in SparkR 2.2.0
• More, richer ML - Bisecting K-means
More in-planning and not committed - feedback appreciated!
• Tweedie GLM
• collect performance (SPARK-18924)
• ML Pipeline in SparkR (SPARK-18822)
• Richer RFormula support (SPARK-18570, SPARK-18569)
• Better tree ensemble summary (SPARK-18348)
• ML persistence format (SPARK-15572)
Thank You.
https://github.com/felixcheung 

linkedin: http://linkd.in/1OeZDb7 

blog: http://bit.ly/1E2z6OI

More Related Content

What's hot (20)

PPTX
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
Spark Summit EU talk by Qifan Pu
Spark Summit
 
PDF
Transactional writes to cloud storage with Eric Liang
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PPT
Spark stream - Kafka
Dori Waldman
 
PDF
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PPTX
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
PPTX
CaffeOnSpark Update: Recent Enhancements and Use Cases
DataWorks Summit
 
PDF
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PDF
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
PDF
Spark on YARN
Adarsh Pannu
 
PDF
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Transactional writes to cloud storage with Eric Liang
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Spark stream - Kafka
Dori Waldman
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
DataWorks Summit
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Spark Summit
 
Spark on YARN
Adarsh Pannu
 
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Low Latency Execution For Apache Spark
Jen Aman
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 

Similar to Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung (20)

PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PDF
ESIL - Universal IL (Intermediate Language) for Radare2
Anton Kochkov
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PPT
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Spark core
Prashant Gupta
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PPTX
Apache Spark for Beginners
Anirudh
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Spark real world use cases and optimizations
Gal Marder
 
Introduction to Apache Spark
Rahul Jain
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Dive into spark2
Gal Marder
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
ESIL - Universal IL (Intermediate Language) for Radare2
Anton Kochkov
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark Tutorial
Ahmet Bulut
 
Spark core
Prashant Gupta
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
Spark r under the hood with Hossein Falaki
Databricks
 
A Deep Dive Into Spark
Ashish kumar
 
Apache Spark for Beginners
Anirudh
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

  • 1. SCALABLE DATA SCIENCE WITH SPARKR Felix Cheung Principal Engineer - Spark @ Microsoft & Apache Spark Committer
  • 4. Spark in 5 seconds • General-purpose cluster computing system • Spark SQL + DataFrame/Dataset + data sources • Streaming/Structured Streaming • ML • GraphX
  • 5. R • A programming language for statistical computing and graphics • S – 1975 • S4 - advanced object-oriented features • R – 1993 • S + lexical scoping • Interpreted • Matrix arithmetic • Comprehensive R Archive Network (CRAN) – 10k+ packages
  • 7. SparkR • R language APIs for Spark and Spark SQL • Exposes Spark functionality in an R-friendly DataFrame APIs • Runs as its own REPL sparkR • or as a R package loaded in IDEs like RStudio 
 library(SparkR)
 sparkR.session()
  • 8. Architecture • Native R classes and methods • RBackend • Scala “helper” methods (ML pipeline etc.) www.slideshare.net/SparkSummit/07-venkataraman-sun
  • 9. Advantages • JVM processing, full access to DAG capabilities and Catalyst optimizer, predicate pushdown, code generation, etc. databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
  • 10. Features - What’s new in SparkR • SQL • Data source (JSON, csv, PostgreSQL, libsvm) • SparkSession & default session (streamlined parameter) as.DataFrame(iris) • Catalog (external data table management) • Spark packages, spark.addFiles() • ML • R-native UDF • Cluster support (YARN, mesos, standalone)
  • 11. SparkR for Data Science
  • 13. Spark ML Pipeline • Pre-processing, feature extraction, model fitting, validation stages • Transformer • Estimator • Cross-validation/hyperparameter tuning Tokenizer HashTF Logistic Regression
  • 14. SparkR API for ML Pipeline spark.lda(
 data = text, k = 20, maxIter = 25, optimizer = "em") RegexTokenizer StopWordsRemover CountVectorizer R JVM LDA Single-entrypoint
 R API builds JVM ML Pipeline
  • 15. Model Operations • summary - print a summary of the fitted model • predict - make predictions on new data • write.ml/read.ml - save/load fitted models (slight layout difference: pipeline model plus R metadata)
  • 16. Spark.ml in SparkR 2.0.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model
  • 17. Spark.ml in SparkR 2.1.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model • Isotonic Regression Model • Gaussian Mixture Model (GMM) • Latent Dirichlet Allocation (LDA) • Alternating Least Squares (ALS) • Multilayer Perceptron Model (MLP) • Kolmogorov-Smirnov Test (K-S test) • Multiclass Logistic Regression • Random Forest • Gradient Boosted Tree (GBT)
  • 18. RFormula • Specify modeling in symbolic form y ~ f0 + f1 response y is modeled linearly by f0 and f1 • Support a subset of R formula operators
 ~ , . , : , + , - • Implemented as feature transformer in core Spark, available to Scala/Java, Python • String label column is indexed • String term columns are one-hot encoded
  • 19. Generalized Linear Model 
 # R-like glm(Sepal_Length ~ Sepal_Width + Species, gaussianDF, family = "gaussian")
 spark.glm(binomialDF, Species ~ Sepal_Length + Sepal_Width, family = "binomial") • “binomial” output string label, prediction
  • 20. Multilayer Perceptron Model 
 spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3), solver = “l-bfgs”, maxIter = 100, tol = 0.5, stepSize = 1)
  • 21. Multiclass Logistic Regression 
 spark.logit(df, label ~ ., regParam = 0.3, elasticNetParam = 0.8, family = "multinomial", thresholds = c(0, 1, 1)) • binary or multiclass
  • 22. Random Forest 
 spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification", numTree = 30) • “classification” index label, predicted label to string
  • 23. Gradient Boosted Tree 
 spark.gbt(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.gbt(df, IndexedSpecies ~ ., type = "classification", stepSize = 0.1) • “classification” index label, predicted label to string • Binary classification
  • 24. Modeling Parameters 
 spark.randomForest function(data, formula, type = c("regression", "classification"), maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto", seed = NULL,
 subsamplingRate = 1.0, minInstancesPerNode = 1, minInfoGain = 0.0,
 checkpointInterval = 10, maxMemoryInMB = 256, cacheNodeIds = FALSE)
  • 25. Spark.ml Challenges • Limited API sets • Non-trivial to map spark.ml API to R API • Keeping up to changes • Almost all (except One vs Rest) • Simple API, but fixed ML pipeline • Debugging is hard • Not a ML specific problem • Getting better?
  • 26. Native-R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group UDFdata.frame data.frame
  • 27. Parallel Processing By Partition R R R Partition Partition Partition UDF UDF UDF data.frame data.frame data.frame data.frame data.frame data.frame
  • 28. UDF: Apply by Partition • Similar to R apply • Function to process each partition of a DataFrame • Mapping of Spark/R data types
 dapply(carsSubDF, function(x) {
 x <- cbind(x, x$mpg * 1.61) }, schema)
  • 29. UDF: Apply by Partition + Collect • No schema
 out <- dapplyCollect( carsSubDF, function(x) { x <- cbind(x, "kmpg" = x$mpg*1.61) })
  • 30. Example - UDF results <- dapplyCollect(train, function(x) { model <- randomForest::randomForest(as.factor(dep_delayed_ 15min) ~ Distance + night + early, data = x, importance = TRUE, ntree = 20) predictions <- predict(model, t) data.frame(UniqueCarrier = t$UniqueCarrier, delayed = predictions) }) closure capture - serialize & broadcast “t” access package “randomForest::” at each invocation
  • 31. UDF: Apply by Group • By grouping columns
 gapply(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) }, schema)
  • 32. UDF: Apply by Group + Collect • No Schema
 out <- gapplyCollect(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) names(y) <- c("cyl", "max_mpg") y })
  • 33. UDF: data type mapping * not a complete list R Spark byte byte integer integer float float double, numeric double character, string string binary, raw binary logical boolean POSIXct, POSIXlt timestamp Date date array, list array env map
  • 34. UDF Challenges • “struct” • No support for nested structures as columns • Scaling up / data skew • What if partition or group too big for single R process? • Not enough data variety to run model? • Performance costs • Serialization/deserialization, data transfer • esp. beware of closure capture
  • 35. UDF: lapply • Like R lapply or doParallel • Good for “embarrassingly parallel” tasks • Such as hyperparameter tuning
  • 36. UDF: lapply • Take a native R list, distribute it • Run the UDF in parallel UDFelement *anything* vector/ list list
  • 37. UDF: parallel distributed processing • Output is a list - needs to fit in memory at the driver costs <- exp(seq(from = log(1), to = log(1000), length.out = 5)) train <- function(cost) { model <- e1071::svm(Species ~ ., iris, cost = cost) summary(model) } summaries <- spark.lapply(costs, train)
  • 41. SparkR as a Package (target 2.1.1) • Goal: simple one-line installation of SparkR from CRAN install.packages("SparkR") • Spark Jar downloaded from official release and cached automatically, or manually install.spark() since Spark 2.0.0 • R vignettes • Community can write packages that depends on SparkR package • Advanced Spark JVM interop APIs sparkR.newJObject
 sparkR.callJMethod
 sparkR.callJStatic
  • 42. Ecosystem • RStudio sparklyr • RevoScaleR/RxSpark, R Server • H2O R • Apache SystemML (R-like API) • Renjin (not Spark) • IBM BigInsights Big R (not Spark!)
  • 43. Recap: SparkR 2.0.0, 2.1.0 • SparkSession • ML • UDF
  • 44. What’s coming in SparkR 2.1.1 • Fix Gamma family with GLM, optimizer in LDA(SPARK-19133, SPARK-19066) • Partitioning DataFrame (SPARK-18335, SPARK-18788) df <- as.DataFrame(cars, numPartitions = 10) getNumPartitions(df) • Setting column R-friendly shortcuts (SPARK-19130, SPARK-18823) df$foo <- 1 df[[myname]] <- 1; df[[2]] <- df$eruptions / 60 • Spark UI URL sparkR.uiWebUrl (SPARK-18903) • install.spark better download error handling (SPARK-19231)
  • 45. What’s coming in SparkR 2.2.0 • More, richer ML - Bisecting K-means More in-planning and not committed - feedback appreciated! • Tweedie GLM • collect performance (SPARK-18924) • ML Pipeline in SparkR (SPARK-18822) • Richer RFormula support (SPARK-18570, SPARK-18569) • Better tree ensemble summary (SPARK-18348) • ML persistence format (SPARK-15572)
  • 46. Thank You. https://github.com/felixcheung 
 linkedin: http://linkd.in/1OeZDb7 
 blog: http://bit.ly/1E2z6OI