Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf

Scaling SparkR in
Production.
Lessons from the Field.
Heiko Korndorf
Wireframe, CEO & Founder

About me
Heiko Korndorf
• CEO & Founder Wireframe
• MS in Computer Science
• Application Areas: ERP, CRM, BI, EAI
• Helping companies in
• Manufacturing
• Telecommunications
• Financial Services
• Utilities
• Oil & Gas
• Professional Services
Rapid Application Development
for Hadoop/Spark
Data Science-as-a-Service

What we’ll talk about
Classify this talk ….
• Data Science: Scaling your R application with SparkR
• Data Engineering: How to bring Data Science applications into
your production pipelines, i.e. adding R to your toolset.
• Management: Integrating Data Science and Data Engineering with
SparkR

Agenda
• SparkRArchitecture 1.x/2.x
• Reference Projects I + II
• Approach with Spark 1.5/1.6
• Parallelization via YARN
• Dynamic R Deployment, incl. dependencies/packages
• Approach with Spark 2.0
• Parallelization via SparkR
• R-Graphics: headless environment, concurrency
• Use Spark APIs: SQL, Mllib
• On-Prem vs Cloud (Elasticity/decouple storage and compute)
• Integrating Data Science and Data Engineering
• A Broader Look at the Ecosystem
• Outlook and Next Steps

Data Science with R
• Very popular language
• Designed by statisticians
• Large community
• > 10.000 packages
• plus: integrated package management
• But: Limited as Single-Node platform
• Data has to fit in memory
• Limited concurrency for processing

SparkR as seen from R
• Import SparkR-package and initialize SparkSession
• Convert data frames from local R data frames to Spark DataFrame and back
• Read and write data stored in Hadoop HDFS, HBase, Cassandra, and more
• Use Spark Libraries, such as SparkSQL and ML
• User cluster hardware to distribute data frames and parallelize computation

SparkR Architecture
• Execute R on cluster
• Data Integration
• Spark DataFrame – R data frame
• Access Big Data File Formats
• Parallelization with UDFs
• Use Spark APIs
• SparkSQL
• Spark MLlib

SparkSQL from R
• Execute SQL against
Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables

Time Series Forecasting
• ARIMA(p,d,q)
• AR: p = order of the autoregressivepart
• I: d = degree of first differencing involved
• MA: q = order of the moving average part
• Time Series: a series of data points indexed in time order
• Methods:
• ExponentialSmoothing
• Neural Networks
• ARIMA:

“Pedestrian” Challenges
• Modify some Spark and R (custom-build)
• Submit Spark job with R (incl. packages)
as YARN dependency
• Challenge: R not installed on cluster
• R’s installation location is hard-coded in R
• “R Markdown” produces HTML, PDF,
and more
• Best way to manage those outputs?
• Producing additional output during run
• Creating graphics in headless
environments
Installing R (+Pkg’s) on cluster Creating PNGs, HTML, PDF, …

Parallelization with SparkR 1.x
• Sequentialcomputation:> 20 hrs.
• Single-Server, parallelized:> 4.5 hrs

• Sequentialcomputation:> 20 hrs.
• Single-Server, parallelized:> 4.5 hrs
• SparkR 1.6.2, 25 nodes,4 cores: ca. 12 mins.

Microsoft R Server for Spark
• Microsoft R Server for HDInsight
integrates Spark and R
• Based on Revolution Analytics
• UDFs via rxExec()
• Data Sources
• RxXdfFile
• RxTextFile
• RxHiveData
• RxParquetData

Support for User-Defined Functions
• dapply (dapplyCollect)
• input: DataFrame, func [, Schema]
• output: DataFrame
• gapply (gapplyCollect)
• input: DataFrame¦GroupedData,
groupBy, func [, Schema]
• output: DataFrame
• spark.lapply
• input: parameters, func
• Access to data/HDFS
• output: List

The (Data) Science Process
Public Perception of Science
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)

The (Data) Science Process
Public Perception of Science Science in Reality
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)

Integrating Dev and Prod
• No Need to Re-Write Applications
for Production
• Common Environmentfor
Development,Testand Production
• “Looks like R to Data Science,
looks like Spark to Data
Engineers”

2-Level Parallelization
(1) Submit multiple jobs to your cluster:
- Cluster Manager (YARN, Spark, Mesos)
- Spark Job: Driver and Executors
(2) Use GPGPU
- Spark Job: Driver and Executor
- Let Executor use GPGPU
(3) Combine 1 and 2

Mix Scala and R
• Call R from Scala
• Add DataScience Module to
your Spark Application
• Use Spark/Scala for ETL, R for
Science code
• Call Spark from R
• Implement high-performance
code in Spark
• More granularcontrol over
cluster resources

Spark & R: A Dynamic Ecosystem
Hadoop, Spark & R: Many interesting projects and options
• SparkR (Apache, Databricks)
• R Server for Spark (Microsoft)
• Sparklyr (RStudio)
• SystemML (IBM)
• FastR (Oracle)
• Renjin (BeDataDriven)

Outlook & Misc
• Organizational: Deepen Integration of Data Engineering & Data Science
• Source Code Control & Versioning (git …)
• Continuous Build
• Test Management (RUnit, testthat…?)
• Multi-Output (Rmarkdown)
• Technical: New Approaches
• Simplify/Unify Data Pipelines (SparkSQL)
• Performance Improvement: use MLlib
• Performance Improvement: move calculation to GPU

Thank You.
Heiko Korndorf
heiko.korndorf@wireframe.li

Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf

More Related Content

What's hot (20)

Similar to Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf (20)

More from Spark Summit (20)

Recently uploaded (20)

Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf