SlideShare a Scribd company logo
Scaling SparkR in
Production.
Lessons from the Field.
Heiko Korndorf
Wireframe, CEO & Founder
About me
Heiko Korndorf
• CEO & Founder Wireframe
• MS in Computer Science
• Application Areas: ERP, CRM, BI, EAI
• Helping companies in
• Manufacturing
• Telecommunications
• Financial Services
• Utilities
• Oil & Gas
• Professional Services
Rapid Application Development
for Hadoop/Spark
Data Science-as-a-Service
What we’ll talk about
Classify this talk ….
• Data Science: Scaling your R application with SparkR
• Data Engineering: How to bring Data Science applications into
your production pipelines, i.e. adding R to your toolset.
• Management: Integrating Data Science and Data Engineering with
SparkR
Agenda
• SparkRArchitecture 1.x/2.x
• Reference Projects I + II
• Approach with Spark 1.5/1.6
• Parallelization via YARN
• Dynamic R Deployment, incl. dependencies/packages
• Approach with Spark 2.0
• Parallelization via SparkR
• R-Graphics: headless environment, concurrency
• Use Spark APIs: SQL, Mllib
• On-Prem vs Cloud (Elasticity/decouple storage and compute)
• Integrating Data Science and Data Engineering
• A Broader Look at the Ecosystem
• Outlook and Next Steps
Data Science with R
• Very popular language
• Designed by statisticians
• Large community
• > 10.000 packages
• plus: integrated package management
• But: Limited as Single-Node platform
• Data has to fit in memory
• Limited concurrency for processing
SparkR Projects
SparkR as seen from R
• Import SparkR-package and initialize SparkSession
• Convert data frames from local R data frames to Spark DataFrame and back
• Read and write data stored in Hadoop HDFS, HBase, Cassandra, and more
• Use Spark Libraries, such as SparkSQL and ML
• User cluster hardware to distribute data frames and parallelize computation
SparkR Architecture
• Execute R on cluster
• Data Integration
• Spark DataFrame – R data frame
• Access Big Data File Formats
• Parallelization with UDFs
• Use Spark APIs
• SparkSQL
• Spark MLlib
SparkSQL from R
• Execute SQL against
Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables
Native Spark ML
Time Series Forecasting
• ARIMA(p,d,q)
• AR: p = order of the autoregressivepart
• I: d = degree of first differencing involved
• MA: q = order of the moving average part
• Time Series: a series of data points indexed in time order
• Methods:
• ExponentialSmoothing
• Neural Networks
• ARIMA:
“Pedestrian” Challenges
• Modify some Spark and R (custom-build)
• Submit Spark job with R (incl. packages)
as YARN dependency
• Challenge: R not installed on cluster
• R’s installation location is hard-coded in R
• “R Markdown” produces HTML, PDF,
and more
• Best way to manage those outputs?
• Producing additional output during run
• Creating graphics in headless
environments
Installing R (+Pkg’s) on cluster Creating PNGs, HTML, PDF, …
Parallelization with SparkR 1.x
• Sequentialcomputation:> 20 hrs.
• Single-Server, parallelized:> 4.5 hrs
Parallelization with SparkR 1.x
• Sequentialcomputation:> 20 hrs.
• Single-Server, parallelized:> 4.5 hrs
• SparkR 1.6.2, 25 nodes,4 cores: ca. 12 mins.
Microsoft R Server for Spark
• Microsoft R Server for HDInsight
integrates Spark and R
• Based on Revolution Analytics
• UDFs via rxExec()
• Data Sources
• RxXdfFile
• RxTextFile
• RxHiveData
• RxParquetData
Parallelization with SparkR 2.x
Support for User-Defined Functions
• dapply (dapplyCollect)
• input: DataFrame, func [, Schema]
• output: DataFrame
• gapply (gapplyCollect)
• input: DataFrame¦GroupedData,
groupBy, func [, Schema]
• output: DataFrame
• spark.lapply
• input: parameters, func
• Access to data/HDFS
• output: List
Cultural Integration
The (Data) Science Process
Public Perception of Science
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
The (Data) Science Process
Public Perception of Science Science in Reality
Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
Integrating Dev and Prod
• No Need to Re-Write Applications
for Production
• Common Environmentfor
Development,Testand Production
• “Looks like R to Data Science,
looks like Spark to Data
Engineers”
2-Level Parallelization
(1) Submit multiple jobs to your cluster:
- Cluster Manager (YARN, Spark, Mesos)
- Spark Job: Driver and Executors
(2) Use GPGPU
- Spark Job: Driver and Executor
- Let Executor use GPGPU
(3) Combine 1 and 2
Mix Scala and R
• Call R from Scala
• Add DataScience Module to
your Spark Application
• Use Spark/Scala for ETL, R for
Science code
• Call Spark from R
• Implement high-performance
code in Spark
• More granularcontrol over
cluster resources
Spark & R: A Dynamic Ecosystem
Hadoop, Spark & R: Many interesting projects and options
• SparkR (Apache, Databricks)
• R Server for Spark (Microsoft)
• Sparklyr (RStudio)
• SystemML (IBM)
• FastR (Oracle)
• Renjin (BeDataDriven)
Outlook & Misc
• Organizational: Deepen Integration of Data Engineering & Data Science
• Source Code Control & Versioning (git …)
• Continuous Build
• Test Management (RUnit, testthat…?)
• Multi-Output (Rmarkdown)
• Technical: New Approaches
• Simplify/Unify Data Pipelines (SparkSQL)
• Performance Improvement: use MLlib
• Performance Improvement: move calculation to GPU
Thank You.
Heiko Korndorf
heiko.korndorf@wireframe.li

More Related Content

What's hot (20)

PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PDF
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
PDF
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Spark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Summit
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 

Similar to Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf (20)

PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
Sparkr sigmod
waqasm86
 
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Databricks
 
PDF
Data processing with spark in r & python
Maloy Manna, PMP®
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
SparkR best practices for R data scientist
DataWorks Summit
 
PDF
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PPTX
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
PDF
Parallelizing Existing R Packages
Craig Warman
 
PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PDF
Introduction to SparkR
Olgun Aydın
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Machine Learning with SparkR
Olgun Aydın
 
Sparkr sigmod
waqasm86
 
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Databricks
 
Data processing with spark in r & python
Maloy Manna, PMP®
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
SparkR best practices for R data scientist
DataWorks Summit
 
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Scalable Data Science with SparkR
DataWorks Summit
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Parallelizing Existing R Packages
Craig Warman
 
Introduction to SparkR
Ankara Big Data Meetup
 
Introduction to SparkR
Olgun Aydın
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 

Using SparkR to Scale Data Science Applications in Production. Lessons from the Field: Spark Summit East talk by Heiko Korndorf

  • 1. Scaling SparkR in Production. Lessons from the Field. Heiko Korndorf Wireframe, CEO & Founder
  • 2. About me Heiko Korndorf • CEO & Founder Wireframe • MS in Computer Science • Application Areas: ERP, CRM, BI, EAI • Helping companies in • Manufacturing • Telecommunications • Financial Services • Utilities • Oil & Gas • Professional Services Rapid Application Development for Hadoop/Spark Data Science-as-a-Service
  • 3. What we’ll talk about Classify this talk …. • Data Science: Scaling your R application with SparkR • Data Engineering: How to bring Data Science applications into your production pipelines, i.e. adding R to your toolset. • Management: Integrating Data Science and Data Engineering with SparkR
  • 4. Agenda • SparkRArchitecture 1.x/2.x • Reference Projects I + II • Approach with Spark 1.5/1.6 • Parallelization via YARN • Dynamic R Deployment, incl. dependencies/packages • Approach with Spark 2.0 • Parallelization via SparkR • R-Graphics: headless environment, concurrency • Use Spark APIs: SQL, Mllib • On-Prem vs Cloud (Elasticity/decouple storage and compute) • Integrating Data Science and Data Engineering • A Broader Look at the Ecosystem • Outlook and Next Steps
  • 5. Data Science with R • Very popular language • Designed by statisticians • Large community • > 10.000 packages • plus: integrated package management • But: Limited as Single-Node platform • Data has to fit in memory • Limited concurrency for processing
  • 7. SparkR as seen from R • Import SparkR-package and initialize SparkSession • Convert data frames from local R data frames to Spark DataFrame and back • Read and write data stored in Hadoop HDFS, HBase, Cassandra, and more • Use Spark Libraries, such as SparkSQL and ML • User cluster hardware to distribute data frames and parallelize computation
  • 8. SparkR Architecture • Execute R on cluster • Data Integration • Spark DataFrame – R data frame • Access Big Data File Formats • Parallelization with UDFs • Use Spark APIs • SparkSQL • Spark MLlib
  • 9. SparkSQL from R • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables
  • 11. Time Series Forecasting • ARIMA(p,d,q) • AR: p = order of the autoregressivepart • I: d = degree of first differencing involved • MA: q = order of the moving average part • Time Series: a series of data points indexed in time order • Methods: • ExponentialSmoothing • Neural Networks • ARIMA:
  • 12. “Pedestrian” Challenges • Modify some Spark and R (custom-build) • Submit Spark job with R (incl. packages) as YARN dependency • Challenge: R not installed on cluster • R’s installation location is hard-coded in R • “R Markdown” produces HTML, PDF, and more • Best way to manage those outputs? • Producing additional output during run • Creating graphics in headless environments Installing R (+Pkg’s) on cluster Creating PNGs, HTML, PDF, …
  • 13. Parallelization with SparkR 1.x • Sequentialcomputation:> 20 hrs. • Single-Server, parallelized:> 4.5 hrs
  • 14. Parallelization with SparkR 1.x • Sequentialcomputation:> 20 hrs. • Single-Server, parallelized:> 4.5 hrs • SparkR 1.6.2, 25 nodes,4 cores: ca. 12 mins.
  • 15. Microsoft R Server for Spark • Microsoft R Server for HDInsight integrates Spark and R • Based on Revolution Analytics • UDFs via rxExec() • Data Sources • RxXdfFile • RxTextFile • RxHiveData • RxParquetData
  • 16. Parallelization with SparkR 2.x Support for User-Defined Functions • dapply (dapplyCollect) • input: DataFrame, func [, Schema] • output: DataFrame • gapply (gapplyCollect) • input: DataFrame¦GroupedData, groupBy, func [, Schema] • output: DataFrame • spark.lapply • input: parameters, func • Access to data/HDFS • output: List
  • 18. The (Data) Science Process Public Perception of Science Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
  • 19. The (Data) Science Process Public Perception of Science Science in Reality Source: Birth of a Theorem – with Cedric Villani (https://www.youtube.com/watch?v=yYwydG_aHPE)
  • 20. Integrating Dev and Prod • No Need to Re-Write Applications for Production • Common Environmentfor Development,Testand Production • “Looks like R to Data Science, looks like Spark to Data Engineers”
  • 21. 2-Level Parallelization (1) Submit multiple jobs to your cluster: - Cluster Manager (YARN, Spark, Mesos) - Spark Job: Driver and Executors (2) Use GPGPU - Spark Job: Driver and Executor - Let Executor use GPGPU (3) Combine 1 and 2
  • 22. Mix Scala and R • Call R from Scala • Add DataScience Module to your Spark Application • Use Spark/Scala for ETL, R for Science code • Call Spark from R • Implement high-performance code in Spark • More granularcontrol over cluster resources
  • 23. Spark & R: A Dynamic Ecosystem Hadoop, Spark & R: Many interesting projects and options • SparkR (Apache, Databricks) • R Server for Spark (Microsoft) • Sparklyr (RStudio) • SystemML (IBM) • FastR (Oracle) • Renjin (BeDataDriven)
  • 24. Outlook & Misc • Organizational: Deepen Integration of Data Engineering & Data Science • Source Code Control & Versioning (git …) • Continuous Build • Test Management (RUnit, testthat…?) • Multi-Output (Rmarkdown) • Technical: New Approaches • Simplify/Unify Data Pipelines (SparkSQL) • Performance Improvement: use MLlib • Performance Improvement: move calculation to GPU