Enabling Exploratory Analysis of Large Data with Apache Spark and R

Enabling Exploratory Data Science
with Apache Spark and R
Hossein Falaki (@mhfalaki)

About the speaker: Hossein Falaki
Hossein Falaki is a software engineeratDatabricks
working on the next big thing.Prior to that, he was
a data scientistat Apple’spersonal assistant, Siri.
He graduated with Ph.D. in Computer Science
from UCLA, where he was a member of the Center
for Embedded Networked Sensing (CENS).
2

About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
3

We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.

Why do we like R?
5
• Open source
• Highly dynamic
• Interactive environment
• Rich ecosystem of packages
• Powerful visualization infrastructure
• Data frames make data manipulation convenient
• Taughtby many schoolsto stats and computing students

What would be ideal?
Seamless manipulationand analysisof very large data in R
• R’s flexible syntax
• R’s rich package ecosystem
• R’s interactive environment
• Scalability (scaleup and out)
• Integration with distributed data sources/ storage
6

Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks
(Hadoop MR, Hive, Pig, Relational Databases, etc)
7
Framework X
(Language Y)
Distributed
Storage
1. Load, clean, transform, aggregate, sample
Local
Storage
2. Save to local storage 3. Read and analyze in R
Iterate

What is SparkR?
An R package distributed with ApacheSpark:
• Provides R frontend to Spark
• Exposes Spark Dataframes (inspired by R and Pandas)
• Convenientinteroperability between R and Spark DataFrames
8
+distributed/robust processing, data
sources, off-memory data structures
Spark
Dynamic environment, interactivity,
packages, visualization
R

How does SparkR solve our problems?
No local storage involved
Write everything in R
Use Spark’s distributed cachefor interactive/iterative analysis at
speed of thought
9
Local
Storage
2. Save to local storage 3. Read and analyze in R
Framework X
(Language Y)
Distributed
Storage
1. Load, clean, transform, aggregate, sample
Iterate

Example SparkR program
# Loading distributed data
df <- read.df(“hdfs://bigdata/logs”, source = “json”)
# Distributed filtering and aggregation
errors <- subset(df, df$type == “error”)
counts <- agg(groupBy(errors, df$code), num = count(df$code))
# Collecting and plotting small data
qplot(code, num, data = collect(counts), geom = “bar”, stat = “identity”) + coord_flip()
10

SparkR architecture
11
Spark Driver
R JVM
R Backend
JVM
Worker
JVM
Worker
Data Sources

Overview of SparkR API
IO
• read.df / write.df
• createDataFrame / collect
Caching
• cache / persist / unpersist
• cacheTable / uncacheTable
Utility functions
• dim / head / take
• names / rand / sample / ...
12
ML Lib
• glm / kmeans /
DataFrame API
select / subset / groupBy
head / showDF /unionAll
agg / avg / column / ...
SQL
sql / table / saveAsTable
registerTempTable / tables

Overview of SparkR API :: SQLContext
SQLContextis your interface to Spark functionality in R
o SparkR DataFrames are implemented on top of SparkSQLtables
o All DataFrame operations go througha SQL optimizer (catalyst)
13
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
From now on, you don’t need Spark Context(sc) any more.

Moving data between R and JVM
14
R JVM
R Backend
JVM
Worker
JVM
Worker
HDFS/S3/…
read.df()
write.df()

Moving data between R and JVM
15
R JVM
R Backend
SparkR::collect()
SparkR::createDataFrame()

Overview of SparkR API :: Caching
16
Controlscashing of distributed data:
o persist(sparkDF, storageLevel)
o DISK_ONLY
o MEMORY_AND_DISK
o MEMORY_AND_DISK_SER
o MEMORY_ONLY
o MEMORY_ONLY_SER
o OFF_HEAP
o cache(sparkdF) == persist(sparkDF, “MEMORY_ONLY”)
o cacheTable(sqlContext, “table_name”)

Overview of SparkR API :: DataFrame API
SparkR DataFrame behavessimilar to R data.frames
o sparkDF$newCol <- sparkDF$col + 1
o subsetDF <- sparkDF[, c(“date”, “type”)]
o recentData <- subset(sparkDF$date == “2015-10-24”)
o firstRow <- sparkDF[[1, ]]
o names(subsetDF) <- c(“Date”, “Type”)
o dim(recentData)
o head(collect(count(group_by(subsetDF, “Date”))))
17

Overview of SparkR API :: SQL
You can register a DataFrame as a table and queryit in SQL
o logs <- read.df(sqlContext, “data/logs”, source = “json”)
o registerTempTable(df, “logsTable”)
o errorsByCode <- sql(sqlContext, “select count(*) as num, type from logsTable where type
== “error” group by code order by date desc”)
o reviewsDF <- table(sqlContext, “reviewsTable”)
o registerTempTable(filter(reviewsDF, reviewsDF$rating == 5), “fiveStars”)
18

Mixing R and SQL
Pass a query to SQLContextand getthe resultback as a DataFrame
19
# Register DataFrame as a table
registerTempTable(df, “dataTable”)
# Complex SQL query, result is returned as another DataFrame
aggCount <- sql(sqlContext, “select count(*) as num, type, date group by type order by date
desc”)
qplot(date, num, data = collect(aggCount), geom = “line”)

Moving between languages
20
R Scala
Spark
df <- read.df(...)
wiki <- filter(df, ...)
registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”)
val parsed = wiki.map {
Row(_, _, text: String, _, _)
=>text.split(‘ ’)
}
val model = Kmeans.train(parsed)

How to get started with SparkR?
• On your computer
1. Download latest version ofSpark (2.0)
2. Build (maven orsbt)
3. Run ./install-dev.sh inside the R directory
4. Start R shell by running ./bin/sparkR
• Deploy Spark on your cluster
• Sign up for Databricks Community Edition:
https://databricks.com/try-databricks
22

Summary
1. SparkR is an R frontend to ApacheSpark
2. Distributed data resides in the JVM
3. Workers are not runningR process(yet)
4. Distinction between Spark DataFrames and R data frames
24

Enabling Exploratory Analysis of Large Data with Apache Spark and R

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Enabling Exploratory Analysis of Large Data with Apache Spark and R (20)

More from Databricks (20)

Recently uploaded (20)

Enabling Exploratory Analysis of Large Data with Apache Spark and R