SlideShare a Scribd company logo
A DataFrame Abstraction
Layer for SparkR
Chris Freeman
Agenda
• What is SparkR?
• History of DataFrames
• Why DataFrames?
• How do DataFrames work?
• Demo
• On the Roadmap
• Questions
2
What is SparkR?
• New R language API for Spark and SparkSQL
• Exposes existing Spark functionality in an R-
friendly syntax via the DataFrame API
• Has its own shell, but can also be imported like a
standard R package and used with Rstudio.
3
What is SparkR?
• An opportunity to make Spark accessible to the
large community of R developers who already
have clear ideas about how to do analytics in R
• No need to learn a new programming paradigm
when working with Spark
4
History of DataFrames
• SparkR began as an R package that ported
Spark’s core functionality (RDDs) to the R
language.
• The next logical step was to add SparkSQL and
SchemaRDDs.
• Initial implementation of SQLContext and
SchemaRDDs working in SparkR
5
History of DataFrames
6
History of DataFrames
7
History of DataFrames
Me:
8
History of DataFrames
Me:
9
Reynold:
Maybe this isn’t such a bad thing…
10
How can I use Spark to do something
simple?
"Michael, 29"
"Andy, 30"
"Justin, 19"
"Bob, 22"
"Chris, 28"
"Garth, 36"
"Tasha, 24"
"Mac, 30"
"Neil, 32"
11
Let’s say we wanted to do this with regular RDDs. What would that look like?
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
12
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
13
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
14
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
15
How can I use Spark to do something
simple?
peopleRDD <- textFile(sc, “people.txt”)
lines <- flatMap(peopleRDD,
function(line) {
strsplit(line, ", ")
})
ageInt <- lapply(lines,
function(line) {
as.numeric(line[2])
})
sum <- reduce(ageInt,function(x,y) {x+y})
avg <- sum / count(peopleRDD)
16
There’s got to be a better way.
17
What I’d hoped to see
{"name":"Michael", "age":29}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":22}
{"name":"Chris", "age":28}
{"name":"Garth", "age":36}
{"name":"Tasha", "age":24}
{"name":"Mac", "age":30}
{"name":"Neil", "age":32}
18
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
19
What I’d hoped to see
df <- read.df(sqlCtx, “people.json”, “json”)
avg <- select(df, avg(df$age))
20
Why DataFrames?
• Uses the distributed, parallel capabilities offered
by RDDs, but imposes a schema on the data
• More structure == Easier access and
manipulation
• Natural extension of existing R conventions since
DataFrames are already the standard
21
Why DataFrames?
• Super awesome distributed, in-memory
collections
22
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
23
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
24
Why DataFrames?
• Super awesome distributed, in-memory
collections
• Schemas == metadata, structure, declarative
instead of imperative
• ????
• Profit
25
DataFrames in SparkR
• Multiple Components:
– A set of native S4 classes and methods that
live inside a standard R package
– A SparkR backend that passes data
structures and method calls to the JVM
– A set of “helper” methods written in Scala
26
Why does the structure matter?
• Native R classes allow us to extend the existing
DataFrame API by adding R-like syntax and
interactions
• Handoff to the JVM gives us full access to
Spark’s DAG capabilities and Catalyst
optimizations, e.g. constant-folding, predicate
pushdown, and code generation.
27
SparkR DataFrame Features
• Column access using ‘$’ or ‘[ ]’ just like in R
• dplyr-like DataFrame manipulation:
– filter
– groupBy
– summarize
– mutate
• Access to external R packages that extend R
syntax
28
Demo Time!
29
On the Roadmap
• Spark 1.4: SparkR becomes an official API
– Primarily focused on SparkSQL/DataFrame
implementation
• Spark 1.5: Extend SparkR to include machine learning
capabilities (e.g. sparkML)
• For more information, be sure to check out “SparkR: The
Past, Present, and Future” at 4:30 on the Data Science
track.
30
Integration with
• Drag-and-drop GUI for data analysis
• Spark functionality built directly into existing
tools using SparkR
• Interact with a remote Spark cluster from your
desktop via Alteryx Designer
• Combine local and in-database data sources in
one workflow.
31
Developer Community
• SparkR originated at UC Berkeley AMPLAB, with
additional contributions from Alteryx, Intel,
Databricks, and others.
• Working on integration with Spark Packages
– Easily extend Spark with new functionality and
distribute via the Spark Package repository
32
Questions?
Slides, Demo, and Data available on GitHub at:
https://github.com/cafreeman/
SparkR_DataFrame_Demo
@15lettermax
cafreeman
33

More Related Content

What's hot (20)

PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Scalable Data Science in Python and R on Apache Spark
felixcss
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
PPTX
Parallelizing Existing R Packages with SparkR
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
Parallelize R Code Using Apache Spark
Databricks
 
PDF
Spark Summit EU talk by Luca Canali
Spark Summit
 
PDF
Operational Tips for Deploying Spark
Databricks
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Scalable Data Science in Python and R on Apache Spark
felixcss
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Spark DataFrames and ML Pipelines
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Parallelizing Existing R Packages with SparkR
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
New directions for Apache Spark in 2015
Databricks
 
Parallelize R Code Using Apache Spark
Databricks
 
Spark Summit EU talk by Luca Canali
Spark Summit
 
Operational Tips for Deploying Spark
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 

Similar to A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx) (20)

PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
Parallelizing Existing R Packages
Craig Warman
 
PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PDF
Introduction to SparkR
Olgun Aydın
 
PDF
Big data analysis using spark r published
Dipendra Kusi
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Sparkr sigmod
waqasm86
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Final_show
Nitay Alon
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PDF
SparkR-Advance Analytic for Big Data
samuel shamiri
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Machine Learning with SparkR
Olgun Aydın
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Parallelizing Existing R Packages
Craig Warman
 
Introduction to SparkR
Ankara Big Data Meetup
 
Introduction to SparkR
Olgun Aydın
 
Big data analysis using spark r published
Dipendra Kusi
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Sparkr sigmod
waqasm86
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Final_show
Nitay Alon
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Scalable Data Science with SparkR
DataWorks Summit
 
An Introduction to Apache spark with scala
johnn210
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
SparkR-Advance Analytic for Big Data
samuel shamiri
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Spark meetup TCHUG
Ryan Bosshart
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Research Methodology Overview Introduction
ayeshagul29594
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
big data eco system fundamentals of data science
arivukarasi
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 

A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)

  • 1. A DataFrame Abstraction Layer for SparkR Chris Freeman
  • 2. Agenda • What is SparkR? • History of DataFrames • Why DataFrames? • How do DataFrames work? • Demo • On the Roadmap • Questions 2
  • 3. What is SparkR? • New R language API for Spark and SparkSQL • Exposes existing Spark functionality in an R- friendly syntax via the DataFrame API • Has its own shell, but can also be imported like a standard R package and used with Rstudio. 3
  • 4. What is SparkR? • An opportunity to make Spark accessible to the large community of R developers who already have clear ideas about how to do analytics in R • No need to learn a new programming paradigm when working with Spark 4
  • 5. History of DataFrames • SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language. • The next logical step was to add SparkSQL and SchemaRDDs. • Initial implementation of SQLContext and SchemaRDDs working in SparkR 5
  • 10. Maybe this isn’t such a bad thing… 10
  • 11. How can I use Spark to do something simple? "Michael, 29" "Andy, 30" "Justin, 19" "Bob, 22" "Chris, 28" "Garth, 36" "Tasha, 24" "Mac, 30" "Neil, 32" 11 Let’s say we wanted to do this with regular RDDs. What would that look like?
  • 12. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) 12
  • 13. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) 13
  • 14. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) 14
  • 15. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 15
  • 16. How can I use Spark to do something simple? peopleRDD <- textFile(sc, “people.txt”) lines <- flatMap(peopleRDD, function(line) { strsplit(line, ", ") }) ageInt <- lapply(lines, function(line) { as.numeric(line[2]) }) sum <- reduce(ageInt,function(x,y) {x+y}) avg <- sum / count(peopleRDD) 16
  • 17. There’s got to be a better way. 17
  • 18. What I’d hoped to see {"name":"Michael", "age":29} {"name":"Andy", "age":30} {"name":"Justin", "age":19} {"name":"Bob", "age":22} {"name":"Chris", "age":28} {"name":"Garth", "age":36} {"name":"Tasha", "age":24} {"name":"Mac", "age":30} {"name":"Neil", "age":32} 18
  • 19. What I’d hoped to see df <- read.df(sqlCtx, “people.json”, “json”) 19
  • 20. What I’d hoped to see df <- read.df(sqlCtx, “people.json”, “json”) avg <- select(df, avg(df$age)) 20
  • 21. Why DataFrames? • Uses the distributed, parallel capabilities offered by RDDs, but imposes a schema on the data • More structure == Easier access and manipulation • Natural extension of existing R conventions since DataFrames are already the standard 21
  • 22. Why DataFrames? • Super awesome distributed, in-memory collections 22
  • 23. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative 23
  • 24. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? 24
  • 25. Why DataFrames? • Super awesome distributed, in-memory collections • Schemas == metadata, structure, declarative instead of imperative • ???? • Profit 25
  • 26. DataFrames in SparkR • Multiple Components: – A set of native S4 classes and methods that live inside a standard R package – A SparkR backend that passes data structures and method calls to the JVM – A set of “helper” methods written in Scala 26
  • 27. Why does the structure matter? • Native R classes allow us to extend the existing DataFrame API by adding R-like syntax and interactions • Handoff to the JVM gives us full access to Spark’s DAG capabilities and Catalyst optimizations, e.g. constant-folding, predicate pushdown, and code generation. 27
  • 28. SparkR DataFrame Features • Column access using ‘$’ or ‘[ ]’ just like in R • dplyr-like DataFrame manipulation: – filter – groupBy – summarize – mutate • Access to external R packages that extend R syntax 28
  • 30. On the Roadmap • Spark 1.4: SparkR becomes an official API – Primarily focused on SparkSQL/DataFrame implementation • Spark 1.5: Extend SparkR to include machine learning capabilities (e.g. sparkML) • For more information, be sure to check out “SparkR: The Past, Present, and Future” at 4:30 on the Data Science track. 30
  • 31. Integration with • Drag-and-drop GUI for data analysis • Spark functionality built directly into existing tools using SparkR • Interact with a remote Spark cluster from your desktop via Alteryx Designer • Combine local and in-database data sources in one workflow. 31
  • 32. Developer Community • SparkR originated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others. • Working on integration with Spark Packages – Easily extend Spark with new functionality and distribute via the Spark Package repository 32
  • 33. Questions? Slides, Demo, and Data available on GitHub at: https://github.com/cafreeman/ SparkR_DataFrame_Demo @15lettermax cafreeman 33