SlideShare a Scribd company logo
Enabling Exploratory Data Science
with Apache Spark and R
Hossein Falaki (@mhfalaki)
About the speaker: Hossein Falaki
Hossein Falaki is a software engineeratDatabricks
working on the next big thing.Prior to that, he was
a data scientistat Apple’spersonal assistant, Siri.
He graduated with Ph.D. in Computer Science
from UCLA, where he was a member of the Center
for Embedded Networked Sensing (CENS).
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
Why do we like R?
5
• Open source
• Highly dynamic
• Interactive environment
• Rich ecosystem of packages
• Powerful visualization infrastructure
• Data frames make data manipulation convenient
• Taughtby many schoolsto stats and computing students
What would be ideal?
Seamless manipulationand analysisof very large data in R
• R’s flexible syntax
• R’s rich package ecosystem
• R’s interactive environment
• Scalability (scaleup and out)
• Integration with distributed data sources/ storage
6
Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks
(Hadoop MR, Hive, Pig, Relational Databases, etc)
7
Framework	X
(Language	Y)
Distributed
Storage
1.	Load,	clean,	transform,	aggregate,	sample
Local
Storage
2.	Save	to	local	storage 3.	Read	and	analyze	in	R
Iterate
What is SparkR?
An R package distributed with ApacheSpark:
• Provides R frontend to Spark
• Exposes Spark Dataframes (inspired by R and Pandas)
• Convenientinteroperability between R and Spark DataFrames
8
+distributed/robust	 processing,	data	
sources,	off-memory	data	structures
Spark
Dynamic	environment,	interactivity,	
packages,	visualization
R
How does SparkR solve our problems?
No local storage involved
Write everything in R
Use Spark’s distributed cachefor interactive/iterative analysis at
speed of thought
9
Local
Storage
2.	Save	to	local	storage 3.	Read	and	analyze	in	R
Framework	X
(Language	Y)
Distributed
Storage
1.	Load,	clean,	transform,	aggregate,	sample
Iterate
Example SparkR program
# Loading distributed data
df <- read.df(“hdfs://bigdata/logs”, source = “json”)
# Distributed filtering and aggregation
errors <- subset(df, df$type == “error”)
counts <- agg(groupBy(errors, df$code), num = count(df$code))
# Collecting and plotting small data
qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip()
10
SparkR architecture
11
Spark	Driver
R JVM
R	Backend
JVM
Worker
JVM
Worker
Data	Sources
Overview of SparkR API
IO
• read.df / write.df
• createDataFrame / collect
Caching
• cache / persist / unpersist
• cacheTable / uncacheTable
Utility functions
• dim / head / take
• names / rand / sample / ...
12
ML Lib
• glm / kmeans /
DataFrame API
select / subset / groupBy
head / showDF /unionAll
agg / avg / column / ...
SQL
sql / table / saveAsTable
registerTempTable / tables
Overview of SparkR API :: SQLContext
SQLContextis your interface to Spark functionality in R
o SparkR DataFrames are implemented on top of SparkSQLtables
o All DataFrame operations go througha SQL optimizer (catalyst)
13
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
From now on, you don’t need Spark Context(sc) any more.
Moving data between R and JVM
14
R JVM
R	Backend
JVM
Worker
JVM
Worker
HDFS/S3/…
read.df()
write.df()
Moving data between R and JVM
15
R JVM
R	Backend
SparkR::collect()
SparkR::createDataFrame()
Overview of SparkR API :: Caching
16
Controlscashing of distributed data:
o persist(sparkDF, storageLevel)
o DISK_ONLY
o MEMORY_AND_DISK
o MEMORY_AND_DISK_SER
o MEMORY_ONLY
o MEMORY_ONLY_SER
o OFF_HEAP
o cache(sparkdF) == persist(sparkDF, “MEMORY_ONLY”)
o cacheTable(sqlContext, “table_name”)
Overview of SparkR API :: DataFrame API
SparkR DataFrame behavessimilar to R data.frames
o sparkDF$newCol <- sparkDF$col + 1
o subsetDF <- sparkDF[, c(“date”, “type”)]
o recentData <- subset(sparkDF$date == “2015-10-24”)
o firstRow <- sparkDF[[1, ]]
o names(subsetDF) <- c(“Date”, “Type”)
o dim(recentData)
o head(collect(count(group_by(subsetDF, “Date”))))
17
Overview of SparkR API :: SQL
You can register a DataFrame as a table and queryit in SQL
o logs <- read.df(sqlContext, “data/logs”, source = “json”)
o registerTempTable(df, “logsTable”)
o errorsByCode <- sql(sqlContext, “select count(*) as num, type from logsTable where type
== “error” group by code order by date desc”)
o reviewsDF <- table(sqlContext, “reviewsTable”)
o registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)
18
Mixing R and SQL
Pass a query to SQLContextand getthe resultback as a DataFrame
19
# Register DataFrame as a table
registerTempTable(df, “dataTable”)
# Complex SQL query, result is returned as another DataFrame
aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date
desc”)
qplot(date, num, data = collect(aggCount), geom = “line”)
Moving between languages
20
R Scala
Spark
df <- read.df(...)
wiki <- filter(df, ...)
registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”)
val parsed = wiki.map {
Row(_, _, text: String, _, _)
=>text.split(‘ ’)
}
val model = Kmeans.train(parsed)
Demo
21
How to get started with SparkR?
• On your computer
1. Download latest version ofSpark (2.0)
2. Build (maven orsbt)
3. Run ./install-dev.sh inside the R directory
4. Start R shell by running ./bin/sparkR
• Deploy Spark on your cluster
• Sign up for Databricks Community Edition:
https://databricks.com/try-databricks
22
Community Edition Waitlist
23
Summary
1. SparkR is an R frontend to ApacheSpark
2. Distributed data resides in the JVM
3. Workers are not runningR process(yet)
4. Distinction between Spark DataFrames and R data frames
24
25
Thank you

More Related Content

What's hot (20)

PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
Distributed ML in Apache Spark
Databricks
 
PDF
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Visualizing big data in the browser using spark
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Spark streaming state of the union
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Spark what's new what's coming
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Distributed ML in Apache Spark
Databricks
 
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Visualizing big data in the browser using spark
Databricks
 

Viewers also liked (20)

PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PDF
Explorative Datenanalyse
Christian Reinboth
 
PPTX
Exploratory Data Analysis
thinrhino
 
PDF
Introduction to SparkR
Kien Dang
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PPTX
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
PDF
Practical Predictive Modeling in Python
Robert Dempsey
 
PPT
Messy Research: How to Make Qualitative Data Quantifiable and Make Messy Data...
Gigi Johnson
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
Sri Ambati
 
PPTX
A Predictive Model Factory Picks Up Steam
Sri Ambati
 
PDF
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
PDF
Presentation of the unbalanced R package
Andrea Dal Pozzolo
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
PDF
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PPTX
Apache Spark and Online Analytics
Databricks
 
PDF
Getting Started with Deep Learning using Scala
Taisuke Oe
 
Use r tutorial part1, introduction to sparkr
Databricks
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Explorative Datenanalyse
Christian Reinboth
 
Exploratory Data Analysis
thinrhino
 
Introduction to SparkR
Kien Dang
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Practical Predictive Modeling in Python
Robert Dempsey
 
Messy Research: How to Make Qualitative Data Quantifiable and Make Messy Data...
Gigi Johnson
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SF
Sri Ambati
 
A Predictive Model Factory Picks Up Steam
Sri Ambati
 
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Presentation of the unbalanced R package
Andrea Dal Pozzolo
 
Building a modern Application with DataFrames
Databricks
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Apache Spark and Online Analytics
Databricks
 
Getting Started with Deep Learning using Scala
Taisuke Oe
 
Ad

Similar to Enabling Exploratory Analysis of Large Data with Apache Spark and R (20)

PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PDF
Introduction to SparkR
Olgun Aydın
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Parallelizing Existing R Packages
Craig Warman
 
PDF
Big data analysis using spark r published
Dipendra Kusi
 
PDF
Sparkr sigmod
waqasm86
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Final_show
Nitay Alon
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
snaggbarumx3
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PPTX
Big Data training
vishal192091
 
PPTX
Apache Spark for Beginners
Anirudh
 
Introduction to SparkR
Ankara Big Data Meetup
 
Introduction to SparkR
Olgun Aydın
 
Machine Learning with SparkR
Olgun Aydın
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Parallelizing Existing R Packages
Craig Warman
 
Big data analysis using spark r published
Dipendra Kusi
 
Sparkr sigmod
waqasm86
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Final_show
Nitay Alon
 
An Introduction to Apache spark with scala
johnn210
 
Apache Spark PDF
Naresh Rupareliya
 
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
snaggbarumx3
 
Spark For Faster Batch Processing
Edureka!
 
Introduction to Spark - DataFactZ
DataFactZ
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Big Data training
vishal192091
 
Apache Spark for Beginners
Anirudh
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 

Enabling Exploratory Analysis of Large Data with Apache Spark and R

  • 1. Enabling Exploratory Data Science with Apache Spark and R Hossein Falaki (@mhfalaki)
  • 2. About the speaker: Hossein Falaki Hossein Falaki is a software engineeratDatabricks working on the next big thing.Prior to that, he was a data scientistat Apple’spersonal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS). 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. Why do we like R? 5 • Open source • Highly dynamic • Interactive environment • Rich ecosystem of packages • Powerful visualization infrastructure • Data frames make data manipulation convenient • Taughtby many schoolsto stats and computing students
  • 6. What would be ideal? Seamless manipulationand analysisof very large data in R • R’s flexible syntax • R’s rich package ecosystem • R’s interactive environment • Scalability (scaleup and out) • Integration with distributed data sources/ storage 6
  • 7. Augmenting R with other frameworks In practice data scientists use R in conjunction with other frameworks (Hadoop MR, Hive, Pig, Relational Databases, etc) 7 Framework X (Language Y) Distributed Storage 1. Load, clean, transform, aggregate, sample Local Storage 2. Save to local storage 3. Read and analyze in R Iterate
  • 8. What is SparkR? An R package distributed with ApacheSpark: • Provides R frontend to Spark • Exposes Spark Dataframes (inspired by R and Pandas) • Convenientinteroperability between R and Spark DataFrames 8 +distributed/robust processing, data sources, off-memory data structures Spark Dynamic environment, interactivity, packages, visualization R
  • 9. How does SparkR solve our problems? No local storage involved Write everything in R Use Spark’s distributed cachefor interactive/iterative analysis at speed of thought 9 Local Storage 2. Save to local storage 3. Read and analyze in R Framework X (Language Y) Distributed Storage 1. Load, clean, transform, aggregate, sample Iterate
  • 10. Example SparkR program # Loading distributed data df <- read.df(“hdfs://bigdata/logs”, source = “json”) # Distributed filtering and aggregation errors <- subset(df, df$type == “error”) counts <- agg(groupBy(errors, df$code), num = count(df$code)) # Collecting and plotting small data qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip() 10
  • 12. Overview of SparkR API IO • read.df / write.df • createDataFrame / collect Caching • cache / persist / unpersist • cacheTable / uncacheTable Utility functions • dim / head / take • names / rand / sample / ... 12 ML Lib • glm / kmeans / DataFrame API select / subset / groupBy head / showDF /unionAll agg / avg / column / ... SQL sql / table / saveAsTable registerTempTable / tables
  • 13. Overview of SparkR API :: SQLContext SQLContextis your interface to Spark functionality in R o SparkR DataFrames are implemented on top of SparkSQLtables o All DataFrame operations go througha SQL optimizer (catalyst) 13 sc <- sparkR.init() sqlContext <- sparkRSQL.init(sc) From now on, you don’t need Spark Context(sc) any more.
  • 14. Moving data between R and JVM 14 R JVM R Backend JVM Worker JVM Worker HDFS/S3/… read.df() write.df()
  • 15. Moving data between R and JVM 15 R JVM R Backend SparkR::collect() SparkR::createDataFrame()
  • 16. Overview of SparkR API :: Caching 16 Controlscashing of distributed data: o persist(sparkDF, storageLevel) o DISK_ONLY o MEMORY_AND_DISK o MEMORY_AND_DISK_SER o MEMORY_ONLY o MEMORY_ONLY_SER o OFF_HEAP o cache(sparkdF) == persist(sparkDF, “MEMORY_ONLY”) o cacheTable(sqlContext, “table_name”)
  • 17. Overview of SparkR API :: DataFrame API SparkR DataFrame behavessimilar to R data.frames o sparkDF$newCol <- sparkDF$col + 1 o subsetDF <- sparkDF[, c(“date”, “type”)] o recentData <- subset(sparkDF$date == “2015-10-24”) o firstRow <- sparkDF[[1, ]] o names(subsetDF) <- c(“Date”, “Type”) o dim(recentData) o head(collect(count(group_by(subsetDF, “Date”)))) 17
  • 18. Overview of SparkR API :: SQL You can register a DataFrame as a table and queryit in SQL o logs <- read.df(sqlContext, “data/logs”, source = “json”) o registerTempTable(df, “logsTable”) o errorsByCode <- sql(sqlContext, “select count(*) as num, type from logsTable where type == “error” group by code order by date desc”) o reviewsDF <- table(sqlContext, “reviewsTable”) o registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”) 18
  • 19. Mixing R and SQL Pass a query to SQLContextand getthe resultback as a DataFrame 19 # Register DataFrame as a table registerTempTable(df, “dataTable”) # Complex SQL query, result is returned as another DataFrame aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date desc”) qplot(date, num, data = collect(aggCount), geom = “line”)
  • 20. Moving between languages 20 R Scala Spark df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”) val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
  • 22. How to get started with SparkR? • On your computer 1. Download latest version ofSpark (2.0) 2. Build (maven orsbt) 3. Run ./install-dev.sh inside the R directory 4. Start R shell by running ./bin/sparkR • Deploy Spark on your cluster • Sign up for Databricks Community Edition: https://databricks.com/try-databricks 22
  • 24. Summary 1. SparkR is an R frontend to ApacheSpark 2. Distributed data resides in the JVM 3. Workers are not runningR process(yet) 4. Distinction between Spark DataFrames and R data frames 24
  • 25. 25